Gateless Gate Sharp 1.3.8: 30Mh/s (Ethash) on RX 480! - page 152.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: djm34 on March 21, 2017, 01:08:15 PM

Quote from: sp_ on March 21, 2017, 12:56:31 PM

2 kernels can ofcourse be executed at the same time but you need to specify seperate stream for both of them.

(if you don't specify a stream they both go to the default stream and are executed serially)

Serial execution: one stream (the default stream)
Paralell execution: two streams.

Make sure that the serialstrem kernel code use async kernel calls like cudaallocasync etc. You also need to make sure that the kernel isn't using all of the resources on the chip. Like threads per block. Running round0 with 32 threads per block or less should be enough...

that's not that simple, in most case you only get partial overlap. Actually it all depends on gpu usage. If one kernel alone uses 100% gpu (or close to)usage then the other will just wait for ressource to be avalaible. The only way to get 2 kernels to run in parallel is that if each kernels use only 50% of gpu usage. So in most case it isn't practical

lol why is that discussion happening on amd thread ? lol
parallelizing too ?

I hope this thread would be bipartisan again once I catch up with Claymore's and Optiminer.
This AMD Zcash miner competition is dragging for way too long...

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

It seems like I need to weave Round 0 into Rounds 1 through 8 explicitly with AMD drivers.
I thought this was an easy task, but, alas, it wasn't...

djm34

legendary

Activity: 1400

Merit: 1050

Quote from: sp_ on March 21, 2017, 12:56:31 PM

2 kernels can ofcourse be executed at the same time but you need to specify seperate stream for both of them.

(if you don't specify a stream they both go to the default stream and are executed serially)

Serial execution: one stream (the default stream)
Paralell execution: two streams.

Make sure that the serialstrem kernel code use async kernel calls like cudaallocasync etc. You also need to make sure that the kernel isn't using all of the resources on the chip. Like threads per block. Running round0 with 32 threads per block or less should be enough...

that's not that simple, in most case you only get partial overlap. Actually it all depends on gpu usage. If one kernel alone uses 100% gpu (or close to)usage then the other will just wait for ressource to be avalaible. The only way to get 2 kernels to run in parallel is that if each kernels use only 50% of gpu usage. So in most case it isn't practical

lol why is that discussion happening on amd thread ? lol
parallelizing too ?

djeZo

hero member

Activity: 588

Merit: 520

Quote from: sp_ on March 21, 2017, 12:56:31 PM

2 kernels can ofcourse be executed at the same time but you need to specify seperate stream for both of them.

(if you don't specify a stream they both go to the default stream and are executed serially)

Serial execution: one stream (the default stream)
Paralell execution: two streams.

Make sure that the serialstrem kernel code use async kernel calls like cudaallocasync etc. You also need to make sure that the kernel isn't using all of the resources on the chip. Like threads per block.

Source of these claims?

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

2 kernels can ofcourse be executed at the same time but you need to specify seperate stream for both of them.

(if you don't specify a stream they both go to the default stream and are executed serially)

Serial execution: one stream (the default stream)
Paralell execution: two streams.

Make sure that the serialstrem kernel code use async kernel calls like cudaallocasync etc. You also need to make sure that the kernel isn't using all of the resources on the chip. Like threads per block. Running round0 with 32 threads per block or less should be enough...

djeZo

hero member

Activity: 588

Merit: 520

Quote

Only after the first kernel is finished the second one will execute.

Like I said... it doesn't matter if you have threads, streams etc... at the end, on GPU, only one kernel can be executed at the same time. Equihash notably gets more speed with several threads, because there are many kernels to be executed (from round0 to round9) and between each execution there is pause that can be used by CUDA driver to execute another kernel of another thread.

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

Use double buffer and 2 cudastreams in parallell.

do
1.launch round0 buffer1 (thread1)
2.launch round1-round8 buffer2 (thread2)
sync
swap buffer pointers
loop

Or permute the rounds so that the round that give the most speed is executed in parallell. Here round3-round8 is Running at the same time as round0:

f.ex

do
launch round1-round2 (thread2)
wait for thread2
1.launch round0 buffer1 (thread1)
2.launch round3-round8 buffer2 (thread2)
sync
swap buffer pointers
loop

round0 take around 20% of the total time

Quote

Kernel calls are asynchronous from the point of view of the CPU so if you call 2 kernels in succession the second one will be called without waiting for the first one to finish. It only means that the control returns to the CPU immediately.

On the GPU side, if you haven't specified different streams to execute the kernel they will be executed by the order they were called (if you don't specify a stream they both go to the default stream and are executed serially). Only after the first kernel is finished the second one will execute.

This behavior is valid for devices with compute capability 2.x which support concurrent kernel execution. On the other devices even though kernel calls are still asynchronous the kernel execution is always sequential.

Check the CUDA C programming guide on section 3.2.5 which every CUDA programmer should read.

http://stackoverflow.com/questions/8473617/are-cuda-kernel-calls-synchronous-or-asynchronous

djeZo

hero member

Activity: 588

Merit: 520

Quote from: sp_ on March 21, 2017, 11:10:25 AM

Quote from: nerdralph on March 21, 2017, 11:02:53 AM

Quote from: zawawa on March 21, 2017, 10:24:08 AM

I will definitely look into further optimizations for Round 0. I think sp_ was talking not about reusing the Blake calculations but executing Round 0 of the next Equihash run in the background during Rounds 1 through 8. I did notice the VALU was being idle quite often during memory transfers but did not know what to do with it until now. I should be able to implement this idea *pretty* soon.

Just running multiple instances of the kernel should help; just don't launch them at exactly the same time. Ideally the 2nd instance should be launched after the first has finished round0.

But you want to make sure than round1 starts at exactly the same time as round0. running with multiple threads, sometimes help, and sometimes not. With proper code, you can make sure that this always happens. No need for 5 threads. (nicehash dj-ezo kernel on the gtx 1080)

At first, I didn't fully understood what you meant, but I think I do now. Your ideal is following; when there is round0 being executed, you would like to execute other rounds in parallel with round0 but with different nonce, so that resources of the card can be better utilized (during round0 there is not much mem ops, but rather alu ops, and during rounds1+ are more mem ops and less alu ops). I had this idea but here is the problem for CUDA, you would need to be able to launch two kernels at the same time, and I am not talking about in various threads, but actually make NVIDIA driver execute two kernels in parallel. That is not how CUDA works to my knowledge. CUDA, at driver level, will always execute certain kernel, then move on to the next one. To acheive parallel solving of rounds, you would need to do it in code on your own (eg say that each odd blockthread is doing round0, each even blockthread is doing round1), but here are different needs of round0 and round1 that would lower your occupation and probably make everything slower (round0 doesn't need shared memory, needs more registers, round1 needs lot's of shared memory, needs less registers).

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

Quote from: nerdralph on March 21, 2017, 11:02:53 AM

Quote from: zawawa on March 21, 2017, 10:24:08 AM

I will definitely look into further optimizations for Round 0. I think sp_ was talking not about reusing the Blake calculations but executing Round 0 of the next Equihash run in the background during Rounds 1 through 8. I did notice the VALU was being idle quite often during memory transfers but did not know what to do with it until now. I should be able to implement this idea *pretty* soon.

Just running multiple instances of the kernel should help; just don't launch them at exactly the same time. Ideally the 2nd instance should be launched after the first has finished round0.

But you want to make sure than round1 starts at exactly the same time as round0. running with multiple threads, sometimes help, and sometimes not. With proper code, you can make sure that this always happens. No need for 5 threads. (nicehash dj-ezo kernel on the gtx 1080)

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: zawawa on March 21, 2017, 10:24:08 AM

I will definitely look into further optimizations for Round 0. I think sp_ was talking not about reusing the Blake calculations but executing Round 0 of the next Equihash run in the background during Rounds 1 through 8. I did notice the VALU was being idle quite often during memory transfers but did not know what to do with it until now. I should be able to implement this idea *pretty* soon.

Just running multiple instances of the kernel should help; just don't launch them at exactly the same time. Ideally the 2nd instance should be launched after the first has finished round0.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: nerdralph on March 21, 2017, 08:52:27 AM

Quote from: zawawa on March 21, 2017, 04:19:03 AM

Quote from: sp_ on March 21, 2017, 04:03:23 AM

Quote from: joaocha on March 20, 2017, 05:23:14 PM

Maibe you can do a dual miner , you should focus on Ethash too, give time to ideas cook in you head, them you go back to equihash

Could the blake2s pass be removed completely (round 0)? Dual mined with the memory accesses? I am not talking about 2 threads, but one thread with the round0 merged into the other rounds.

The miner will need to do 1 nonce search(round1-round8) and one blake2s (round0) per iteration. Since you work on the round0 data from the previos run, the nonce found would be the result of the previous padding
data.

On NVIDIA round0 take 20% of the time. (The opensource ZEC DJezo kernel)

I must admit that this is a brilliant idea. Thank you so much for sharing it!

Meh. ZEC uses a truncated blake2 using 2x200 bits out of 512. I doubt you'll find a way to re-use the blake calculations for another algo like dcr or sia.

To optimize round0, use the same idea as bitcoin sha-256 optimization, by looking for parts of the algorithm that can be skipped. For example, since the last 112 bits are ignored, you might be able to skip some parts of the blake algo. And since everything but the nonce is constant for ~2.5 minutes, you can probably move some of the calculations to compile time and generate a new kernel for each new block. Since you're already building a custom llvm, you can probably get the kernel compile and dispatch time down to a few ms.

p.s. Here's some bedtime reading for you on bitcoin mining optimization.
http://www.nicolascourtois.com/bitcoin/Optimising%20the%20SHA256%20Hashing%20Algorithm%20for%20Faster%20and%20More%20Efficient%20Bitcoin%20Mining_Rahul_Naik.pdf

I will definitely look into further optimizations for Round 0. I think sp_ was talking not about reusing the Blake calculations but executing Round 0 of the next Equihash run in the background during Rounds 1 through 8. I did notice the VALU was being idle quite often during memory transfers but did not know what to do with it until now. I should be able to implement this idea *pretty* soon.

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: zawawa on March 21, 2017, 04:19:03 AM

Quote from: sp_ on March 21, 2017, 04:03:23 AM

Quote from: joaocha on March 20, 2017, 05:23:14 PM

Maibe you can do a dual miner , you should focus on Ethash too, give time to ideas cook in you head, them you go back to equihash

Could the blake2s pass be removed completely (round 0)? Dual mined with the memory accesses? I am not talking about 2 threads, but one thread with the round0 merged into the other rounds.

The miner will need to do 1 nonce search(round1-round8) and one blake2s (round0) per iteration. Since you work on the round0 data from the previos run, the nonce found would be the result of the previous padding
data.

On NVIDIA round0 take 20% of the time. (The opensource ZEC DJezo kernel)

I must admit that this is a brilliant idea. Thank you so much for sharing it!

Meh. ZEC uses a truncated blake2 using 2x200 bits out of 512. I doubt you'll find a way to re-use the blake calculations for another algo like dcr or sia.

To optimize round0, use the same idea as bitcoin sha-256 optimization, by looking for parts of the algorithm that can be skipped. For example, since the last 112 bits are ignored, you might be able to skip some parts of the blake algo. And since everything but the nonce is constant for ~2.5 minutes, you can probably move some of the calculations to compile time and generate a new kernel for each new block. Since you're already building a custom llvm, you can probably get the kernel compile and dispatch time down to a few ms.

p.s. Here's some bedtime reading for you on bitcoin mining optimization.
http://www.nicolascourtois.com/bitcoin/Optimising%20the%20SHA256%20Hashing%20Algorithm%20for%20Faster%20and%20More%20Efficient%20Bitcoin%20Mining_Rahul_Naik.pdf

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

Quote from: laik2 on March 21, 2017, 07:22:46 AM

Quote from: sp_ on March 21, 2017, 04:03:23 AM

Quote from: joaocha on March 20, 2017, 05:23:14 PM

Maibe you can do a dual miner , you should focus on Ethash too, give time to ideas cook in you head, them you go back to equihash

Could the blake2s pass be removed completely (round 0)? Dual mined with the memory accesses? I am not talking about 2 threads, but one thread with the round0 merged into the other rounds.

The miner will need to do 1 nonce search(round1-round8) and one blake2s (round0) merged into the round1-round8 code per iteration. Since you work on the round0 data from the previos run, the nonce found would be the result of the previous padding data.

On NVIDIA round0 take 20% of the time. (The opensource ZEC DJezo kernel)

Jeeesus, sometimes I think sp_ is a jerk and sometimes normal...I believe either you have personality disorder or your account is being used by 2 different people...

I just pointed the opensource development into the right direction. Time to give team Claymore some opensource competition...

laik2

sr. member

Activity: 652

Merit: 266

Quote from: sp_ on March 21, 2017, 04:03:23 AM

Quote from: joaocha on March 20, 2017, 05:23:14 PM

Maibe you can do a dual miner , you should focus on Ethash too, give time to ideas cook in you head, them you go back to equihash

Could the blake2s pass be removed completely (round 0)? Dual mined with the memory accesses? I am not talking about 2 threads, but one thread with the round0 merged into the other rounds.

The miner will need to do 1 nonce search(round1-round8) and one blake2s (round0) merged into the round1-round8 code per iteration. Since you work on the round0 data from the previos run, the nonce found would be the result of the previous padding data.

On NVIDIA round0 take 20% of the time. (The opensource ZEC DJezo kernel)

Jeeesus, sometimes I think sp_ is a jerk and sometimes normal...I believe either you have personality disorder or your account is being used by 2 different people...

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: sp_ on March 21, 2017, 04:03:23 AM

Quote from: joaocha on March 20, 2017, 05:23:14 PM

Maibe you can do a dual miner , you should focus on Ethash too, give time to ideas cook in you head, them you go back to equihash

Could the blake2s pass be removed completely (round 0)? Dual mined with the memory accesses? I am not talking about 2 threads, but one thread with the round0 merged into the other rounds.

The miner will need to do 1 nonce search(round1-round8) and one blake2s (round0) per iteration. Since you work on the round0 data from the previos run, the nonce found would be the result of the previous padding
data.

On NVIDIA round0 take 20% of the time. (The opensource ZEC DJezo kernel)

I must admit that this is a brilliant idea. Thank you so much for sharing it!

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

Quote from: joaocha on March 20, 2017, 05:23:14 PM

Maibe you can do a dual miner , you should focus on Ethash too, give time to ideas cook in you head, them you go back to equihash

Could the blake2s pass be removed completely (round 0)? Dual mined with the memory accesses? I am not talking about 2 threads, but one thread with the round0 merged into the other rounds.

The miner will need to do 1 nonce search(round1-round8) and one blake2s (round0) merged into the round1-round8 code per iteration. Since you work on the round0 data from the previos run, the nonce found would be the result of the previous padding data.

On NVIDIA round0 take 20% of the time. (The opensource ZEC DJezo kernel)

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

I'm trying to hook Linux system calls from the user space so that GG can access a larger GDS segment without a kernel patch.
The work never ends...

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: joaocha on March 20, 2017, 05:23:14 PM

Maibe you can do a dual miner , you should focus on Ethash too, give time to ideas cook in you head, them you go back to equihash

I have been thinking about that for quite some time now.
I will wrap up Equihash optimizations once I'm done with helper functions.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

After I tried everything with my Equihash kernel, I reached the conclusion that the current bottleneck is not in my kernel but elsewhere.
Surely enough, I found that a considerable amount of CPU time was spent in sgminer's helper functions.
I don't think anybody touched them since super-nice folks at Genesis Mining ported SA's old kernel to sgminer-gm.
Let me see...

joaocha

full member

Activity: 254

Merit: 100

Maibe you can do a dual miner , you should focus on Ethash too, give time to ideas cook in you head, them you go back to equihash

Topic: Gateless Gate Sharp 1.3.8: 30Mh/s (Ethash) on RX 480! - page 152. (Read 214458 times)