Author

Topic: Gateless Gate Sharp 1.3.8: 30Mh/s (Ethash) on RX 480! - page 152. (Read 214431 times)

sr. member
Activity: 728
Merit: 304
Miner Developer
2 kernels can ofcourse be executed at the same time but you need to specify seperate stream for both of them.

 (if you don't specify a stream they both go to the default stream and are executed serially)

Serial execution: one stream (the default stream)
Paralell execution: two streams.

Make sure that the serialstrem kernel code use async kernel calls like cudaallocasync etc. You also need to make sure that the kernel isn't using all of the resources on the chip. Like threads per block. Running round0 with 32 threads per block or less should be enough...
that's not that simple, in most case you only get partial overlap. Actually it all depends on gpu usage. If one kernel alone uses 100% gpu (or close to)usage then the other will just wait for ressource to be avalaible. The only way to get 2 kernels to run in parallel is that if each kernels use only 50% of gpu usage. So in most case it isn't practical

lol why is that discussion happening on amd thread ? lol
parallelizing too ?


I hope this thread would be bipartisan again once I catch up with Claymore's and Optiminer.
This AMD Zcash miner competition is dragging for way too long...
sr. member
Activity: 728
Merit: 304
Miner Developer
It seems like I need to weave Round 0 into Rounds 1 through 8 explicitly with AMD drivers.
I thought this was an easy task, but, alas, it wasn't...
legendary
Activity: 1400
Merit: 1050
2 kernels can ofcourse be executed at the same time but you need to specify seperate stream for both of them.

 (if you don't specify a stream they both go to the default stream and are executed serially)

Serial execution: one stream (the default stream)
Paralell execution: two streams.

Make sure that the serialstrem kernel code use async kernel calls like cudaallocasync etc. You also need to make sure that the kernel isn't using all of the resources on the chip. Like threads per block. Running round0 with 32 threads per block or less should be enough...
that's not that simple, in most case you only get partial overlap. Actually it all depends on gpu usage. If one kernel alone uses 100% gpu (or close to)usage then the other will just wait for ressource to be avalaible. The only way to get 2 kernels to run in parallel is that if each kernels use only 50% of gpu usage. So in most case it isn't practical

lol why is that discussion happening on amd thread ? lol
parallelizing too ?
hero member
Activity: 588
Merit: 520
2 kernels can ofcourse be executed at the same time but you need to specify seperate stream for both of them.

 (if you don't specify a stream they both go to the default stream and are executed serially)

Serial execution: one stream (the default stream)
Paralell execution: two streams.

Make sure that the serialstrem kernel code use async kernel calls like cudaallocasync etc. You also need to make sure that the kernel isn't using all of the resources on the chip. Like threads per block.

Source of these claims?
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
2 kernels can ofcourse be executed at the same time but you need to specify seperate stream for both of them.

 (if you don't specify a stream they both go to the default stream and are executed serially)

Serial execution: one stream (the default stream)
Paralell execution: two streams.

Make sure that the serialstrem kernel code use async kernel calls like cudaallocasync etc. You also need to make sure that the kernel isn't using all of the resources on the chip. Like threads per block. Running round0 with 32 threads per block or less should be enough...
hero member
Activity: 588
Merit: 520
Quote
Only after the first kernel is finished the second one will execute.

Like I said... it doesn't matter if you have threads, streams etc... at the end, on GPU, only one kernel can be executed at the same time. Equihash notably gets more speed with several threads, because there are many kernels to be executed (from round0 to round9) and between each execution there is pause that can be used by CUDA driver to execute another kernel of another thread.
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
Use double buffer and 2 cudastreams in parallell.

do
1.launch round0 buffer1 (thread1)
2.launch round1-round8 buffer2 (thread2)
sync
swap buffer pointers
loop

Or permute the rounds so that the round that give the most speed is executed in parallell. Here round3-round8 is Running at the same time as round0:

f.ex

do
launch round1-round2 (thread2)
wait for thread2
1.launch round0 buffer1 (thread1)
2.launch round3-round8 buffer2 (thread2)
sync
swap buffer pointers
loop

round0 take around 20% of the total time




Quote
Kernel calls are asynchronous from the point of view of the CPU so if you call 2 kernels in succession the second one will be called without waiting for the first one to finish. It only means that the control returns to the CPU immediately.

On the GPU side, if you haven't specified different streams to execute the kernel they will be executed by the order they were called (if you don't specify a stream they both go to the default stream and are executed serially). Only after the first kernel is finished the second one will execute.

This behavior is valid for devices with compute capability 2.x which support concurrent kernel execution. On the other devices even though kernel calls are still asynchronous the kernel execution is always sequential.

Check the CUDA C programming guide on section 3.2.5 which every CUDA programmer should read.

http://stackoverflow.com/questions/8473617/are-cuda-kernel-calls-synchronous-or-asynchronous
hero member
Activity: 588
Merit: 520
I will definitely look into further optimizations for Round 0. I think sp_ was talking not about reusing the Blake calculations but executing Round 0 of the next Equihash run in the background during Rounds 1 through 8. I did notice the VALU was being idle quite often during memory transfers but did not know what to do with it until now. I should be able to implement this idea *pretty* soon.
Just running multiple instances of the kernel should help; just don't launch them at exactly the same time.  Ideally the 2nd instance should be launched after the first has finished round0.

But you want to make sure than round1 starts at exactly the same time as round0.  running with multiple threads, sometimes help, and sometimes not. With proper code, you can make sure that this always happens. No need for 5 threads. (nicehash dj-ezo kernel on the gtx 1080)


At first, I didn't fully understood what you meant, but I think I do now. Your ideal is following; when there is round0 being executed, you would like to execute other rounds in parallel with round0 but with different nonce, so that resources of the card can be better utilized (during round0 there is not much mem ops, but rather alu ops, and during rounds1+ are more mem ops and less alu ops). I had this idea but here is the problem for CUDA, you would need to be able to launch two kernels at the same time, and I am not talking about in various threads, but actually make NVIDIA driver execute two kernels in parallel. That is not how CUDA works to my knowledge. CUDA, at driver level, will always execute certain kernel, then move on to the next one. To acheive parallel solving of rounds, you would need to do it in code on your own (eg say that each odd blockthread is doing round0, each even blockthread is doing round1), but here are different needs of round0 and round1 that would lower your occupation and probably make everything slower (round0 doesn't need shared memory, needs more registers, round1 needs lot's of shared memory, needs less registers).
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
I will definitely look into further optimizations for Round 0. I think sp_ was talking not about reusing the Blake calculations but executing Round 0 of the next Equihash run in the background during Rounds 1 through 8. I did notice the VALU was being idle quite often during memory transfers but did not know what to do with it until now. I should be able to implement this idea *pretty* soon.
Just running multiple instances of the kernel should help; just don't launch them at exactly the same time.  Ideally the 2nd instance should be launched after the first has finished round0.

But you want to make sure than round1 starts at exactly the same time as round0.  running with multiple threads, sometimes help, and sometimes not. With proper code, you can make sure that this always happens. No need for 5 threads. (nicehash dj-ezo kernel on the gtx 1080)
sr. member
Activity: 588
Merit: 251
I will definitely look into further optimizations for Round 0. I think sp_ was talking not about reusing the Blake calculations but executing Round 0 of the next Equihash run in the background during Rounds 1 through 8. I did notice the VALU was being idle quite often during memory transfers but did not know what to do with it until now. I should be able to implement this idea *pretty* soon.

Just running multiple instances of the kernel should help; just don't launch them at exactly the same time.  Ideally the 2nd instance should be launched after the first has finished round0.
sr. member
Activity: 728
Merit: 304
Miner Developer
Maibe you can do a dual miner , you should focus on Ethash too, give time to ideas cook in you head, them you go back to equihash

Could the blake2s pass be removed completely (round 0)?  Dual mined  with the memory accesses? I am not talking about 2 threads, but one thread with the round0 merged into the other rounds.

The miner will need to do 1 nonce search(round1-round8) and one blake2s  (round0) per iteration. Since you work on the round0 data from the previos run, the nonce found would be the result of the previous padding
data.  

On NVIDIA round0 take 20% of the time. (The opensource ZEC DJezo kernel)

I must admit that this is a brilliant idea. Thank you so much for sharing it!

Meh.  ZEC uses a truncated blake2 using 2x200 bits out of 512.  I doubt you'll find a way to re-use the blake calculations for another algo like dcr or sia.

To optimize round0, use the same idea as bitcoin sha-256 optimization, by looking for parts of the algorithm that can be skipped.  For example, since the last 112 bits are ignored, you might be able to skip some parts of the blake algo.  And since everything but the nonce is constant for ~2.5 minutes, you can probably move some of the calculations to compile time and generate a new kernel for each new block.  Since you're already building a custom llvm, you can probably get the kernel compile and dispatch time down to a few ms.

p.s.  Here's some bedtime reading for you on bitcoin mining optimization.
http://www.nicolascourtois.com/bitcoin/Optimising%20the%20SHA256%20Hashing%20Algorithm%20for%20Faster%20and%20More%20Efficient%20Bitcoin%20Mining_Rahul_Naik.pdf



I will definitely look into further optimizations for Round 0. I think sp_ was talking not about reusing the Blake calculations but executing Round 0 of the next Equihash run in the background during Rounds 1 through 8. I did notice the VALU was being idle quite often during memory transfers but did not know what to do with it until now. I should be able to implement this idea *pretty* soon.
sr. member
Activity: 588
Merit: 251
Maibe you can do a dual miner , you should focus on Ethash too, give time to ideas cook in you head, them you go back to equihash

Could the blake2s pass be removed completely (round 0)?  Dual mined  with the memory accesses? I am not talking about 2 threads, but one thread with the round0 merged into the other rounds.

The miner will need to do 1 nonce search(round1-round8) and one blake2s  (round0) per iteration. Since you work on the round0 data from the previos run, the nonce found would be the result of the previous padding
data.  

On NVIDIA round0 take 20% of the time. (The opensource ZEC DJezo kernel)

I must admit that this is a brilliant idea. Thank you so much for sharing it!

Meh.  ZEC uses a truncated blake2 using 2x200 bits out of 512.  I doubt you'll find a way to re-use the blake calculations for another algo like dcr or sia.

To optimize round0, use the same idea as bitcoin sha-256 optimization, by looking for parts of the algorithm that can be skipped.  For example, since the last 112 bits are ignored, you might be able to skip some parts of the blake algo.  And since everything but the nonce is constant for ~2.5 minutes, you can probably move some of the calculations to compile time and generate a new kernel for each new block.  Since you're already building a custom llvm, you can probably get the kernel compile and dispatch time down to a few ms.

p.s.  Here's some bedtime reading for you on bitcoin mining optimization.
http://www.nicolascourtois.com/bitcoin/Optimising%20the%20SHA256%20Hashing%20Algorithm%20for%20Faster%20and%20More%20Efficient%20Bitcoin%20Mining_Rahul_Naik.pdf

sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
Maibe you can do a dual miner , you should focus on Ethash too, give time to ideas cook in you head, them you go back to equihash

Could the blake2s pass be removed completely (round 0)?  Dual mined  with the memory accesses? I am not talking about 2 threads, but one thread with the round0 merged into the other rounds.

The miner will need to do 1 nonce search(round1-round8) and one blake2s  (round0) merged into the round1-round8 code per iteration. Since you work on the round0 data from the previos run, the nonce found would be the result of the previous padding data.  

On NVIDIA round0 take 20% of the time. (The opensource ZEC DJezo kernel)
Jeeesus, sometimes I think sp_ is a jerk and sometimes normal...I believe either you have personality disorder or your account is being used by 2 different people...

I just pointed the opensource development into the right direction. Time to give team Claymore some opensource competition...
sr. member
Activity: 652
Merit: 266
Maibe you can do a dual miner , you should focus on Ethash too, give time to ideas cook in you head, them you go back to equihash

Could the blake2s pass be removed completely (round 0)?  Dual mined  with the memory accesses? I am not talking about 2 threads, but one thread with the round0 merged into the other rounds.

The miner will need to do 1 nonce search(round1-round8) and one blake2s  (round0) merged into the round1-round8 code per iteration. Since you work on the round0 data from the previos run, the nonce found would be the result of the previous padding data.  

On NVIDIA round0 take 20% of the time. (The opensource ZEC DJezo kernel)
Jeeesus, sometimes I think sp_ is a jerk and sometimes normal...I believe either you have personality disorder or your account is being used by 2 different people...
sr. member
Activity: 728
Merit: 304
Miner Developer
Maibe you can do a dual miner , you should focus on Ethash too, give time to ideas cook in you head, them you go back to equihash

Could the blake2s pass be removed completely (round 0)?  Dual mined  with the memory accesses? I am not talking about 2 threads, but one thread with the round0 merged into the other rounds.

The miner will need to do 1 nonce search(round1-round8) and one blake2s  (round0) per iteration. Since you work on the round0 data from the previos run, the nonce found would be the result of the previous padding
data.  

On NVIDIA round0 take 20% of the time. (The opensource ZEC DJezo kernel)

I must admit that this is a brilliant idea. Thank you so much for sharing it!
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
Maibe you can do a dual miner , you should focus on Ethash too, give time to ideas cook in you head, them you go back to equihash

Could the blake2s pass be removed completely (round 0)?  Dual mined  with the memory accesses? I am not talking about 2 threads, but one thread with the round0 merged into the other rounds.

The miner will need to do 1 nonce search(round1-round8) and one blake2s  (round0) merged into the round1-round8 code per iteration. Since you work on the round0 data from the previos run, the nonce found would be the result of the previous padding data.  

On NVIDIA round0 take 20% of the time. (The opensource ZEC DJezo kernel)
sr. member
Activity: 728
Merit: 304
Miner Developer
I'm trying to hook Linux system calls from the user space so that GG can access a larger GDS segment without a kernel patch.
The work never ends...
sr. member
Activity: 728
Merit: 304
Miner Developer
Maibe you can do a dual miner , you should focus on Ethash too, give time to ideas cook in you head, them you go back to equihash

I have been thinking about that for quite some time now.
I will wrap up Equihash optimizations once I'm done with helper functions.
sr. member
Activity: 728
Merit: 304
Miner Developer
After I tried everything with my Equihash kernel, I reached the conclusion that the current bottleneck is not in my kernel but elsewhere.
Surely enough, I found that a considerable amount of CPU time was spent in sgminer's helper functions.
I don't think anybody touched them since super-nice folks at Genesis Mining ported SA's old kernel to sgminer-gm.
Let me see...
full member
Activity: 254
Merit: 100
Maibe you can do a dual miner , you should focus on Ethash too, give time to ideas cook in you head, them you go back to equihash
Jump to: