Author

Topic: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner - page 118. (Read 877889 times)

member
Activity: 81
Merit: 1002
It was only the wind.
full member
Activity: 169
Merit: 100
 Here are all the bins getting produced:

  I get this from my output.log:

Code:
[14:07:01] Building binary neoscryptHawaiigw64l4ku0big7hs.bin
[14:07:01] Error -11: Building Program (clBuildProgram)
[14:07:01] "C:\Users\ANIMAL~1\AppData\Local\Temp\OCL4772T27.cl", line 368: warning:
          variable "t" was declared but never referenced
   uint4 t, st[4];
        ^

"C:\Users\ANIMAL~1\AppData\Local\Temp\OCL4772T27.cl", line 495: error:
          identifier "MAX_GLOBAL_THREADS" is undefined
   __global ulong16 *V = (__global ulong16 *)(padcache + (0x8000 * (get_global_id(0) % MAX_GLOBAL_THREADS)));
                                                                                      ^

"C:\Users\ANIMAL~1\AppData\Local\Temp\OCL4772T27.cl", line 513: warning:
          argument of type "__global ulong16 *" is incompatible with parameter
          of type "__global uint16 *"
   SMix(X, V, flag);
          ^

1 error detected in the compilation of "C:\Users\ANIMAL~1\AppData\Local\Temp\OCL4772T27.cl".

Frontend phase failed compilation.

Any ideas?  Could CGWatcher be interfering somehow when bins are made?

You have wrong marucoin-mod.cl.
Try to find right, and replace in ./kernels
fix https://bitcointalk.org/index.php?topic=854257.240

Thanks, I have actually been using that marucoin-mod.cl from the thread.  To be sure I double checked and it does have the correct line 96.  I'm all ears if you've got any other ideas.

You're building a Neoscrypt binary, not an X13 one.

Yeah, it's trying to build a weird one right, Building binary neoscryptHawaiigw64l4ku0big7hs.bin ?  Here are some other strange ones that actually got built.

darkcoin-modHawaiigw64l4ku0.bin (Correct)
darkcoin-modHawaiigw64l4ku0big7hs.bin
darkcoin-modHawaiigw64l4tc8192.bin
marucoin-modHawaiigw64l4ku0.bin
marucoin-modHawaiigw64l4ku0big7hs.bin (Correct)
marucoin-modHawaiigw64l4tc8192.bin
neoscryptHawaiigw64l4tc8192.bin (Correct)


full member
Activity: 169
Merit: 100
 Here are all the bins getting produced:

  I get this from my output.log:

Code:
[14:07:01] Building binary neoscryptHawaiigw64l4ku0big7hs.bin
[14:07:01] Error -11: Building Program (clBuildProgram)
[14:07:01] "C:\Users\ANIMAL~1\AppData\Local\Temp\OCL4772T27.cl", line 368: warning:
          variable "t" was declared but never referenced
   uint4 t, st[4];
        ^

"C:\Users\ANIMAL~1\AppData\Local\Temp\OCL4772T27.cl", line 495: error:
          identifier "MAX_GLOBAL_THREADS" is undefined
   __global ulong16 *V = (__global ulong16 *)(padcache + (0x8000 * (get_global_id(0) % MAX_GLOBAL_THREADS)));
                                                                                      ^

"C:\Users\ANIMAL~1\AppData\Local\Temp\OCL4772T27.cl", line 513: warning:
          argument of type "__global ulong16 *" is incompatible with parameter
          of type "__global uint16 *"
   SMix(X, V, flag);
          ^

1 error detected in the compilation of "C:\Users\ANIMAL~1\AppData\Local\Temp\OCL4772T27.cl".

Frontend phase failed compilation.

Any ideas?  Could CGWatcher be interfering somehow when bins are made?

You have wrong marucoin-mod.cl.
Try to find right, and replace in ./kernels
fix https://bitcointalk.org/index.php?topic=854257.240

Thanks, I have actually been using that marucoin-mod.cl from the thread.  To be sure I double checked and it does have the correct line 96.  I'm all ears if you've got any other ideas.
member
Activity: 81
Merit: 1002
It was only the wind.
#2 is completely impossible to satisfy - you realize the SIMD algo by itself is so big, it has to be split over 8 threads or it'll cause register spillage? Imagine trying to fit all 11 hashes into one kernel and make the compiler NOT spill registers into memory. Would you rather do small, quick accesses, or get often used variables like the hash state dumped to global - forcing you to stall all the time?

Why don't I put the hashes variable in local memory? Because local memory is good for things that threads within a workgroup need to cooperate on, as well as often used constants. There would be no point putting the hashes in local memory, because first, I cannot set it from outside the kernel. Therefore, I would be loading the hashes from global to local, and then to registers? And then I would never need to access global again, so putting it in local serves no purpose.

I have another idea:
1. you don't store results from different kernels to the same memory area, but to different area for each algo
2. you can schedule all of them for execution in parallel (well as you currently do in "opencl_scanhash" function except that clCreateCommandQueue should specify "CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE")
3. kernels should be modified to wait for event to triger them. This event would be a signal from the previous kernel that it had finished the work. I'm not sure if this can be done via clEvents as they seem to work on entire queue rather than individual thread.

This would allow for kernels not to have to wait for all of them to finish (no end barrier) as the signaling to triger the next one would be explicit signal.... Reading OpenCl documentation I'm not sure this is possible...

Your point 1 is wasteful on memory - why store useless information? The output of blake becomes absolutely useless to me once BMW has finished. You cannot schedule them all for execution in parallel - not out of order! Look at X11, the output of the previous hash is the input to the next - therefore, BMW obviously depends on Blake's earlier result.

My understanding regarding barriers is that if one thread hits it then all of them have to in order for kernel to continue execution. So it is a sync mechanism (within a workgroup anyway). I agree that putting "hashes" in local mem would require one small kernel at the beginning and one at the end to copy local data into global mem and this would make sense if concurrent access to global mem from all threads is indeed a bottle neck. You say it's not I just mention it for the sake of completeness.

I completely forgot about gpu-threads, you're right that it keeps gpu busy. This is equivalent of just having 2 command queues that schedule kernels in parallel between them and in a sequential order within one gpu-thread .

BTW why is there a barrier in each kernel anyway? Since currently command queue does not allow for parallel execution of different kernels surely writes of blake should complete before next kernel bmw starts work...

My idea was that we could have some kind of say per thread counter/semaphore associated with "hashes" that would allow a kernel function searchN to start it's work only when this counter/semaphore was of value N. And this counter would be set at the end of each searchN fucntion to N+1. I see now this may not work as it maps poorly with what gpu was designed for.

Anyway I realize I need to spent some time with CodeXL for a while to gain some insight... what beats me is that all hash algos in X11 are designed to have efficient implementation in hardware. So they should be small in code and consume little memory. This kind of thing should be possible to implement directly in thread registers or cl terms "private memory".

Barriers work that way ONLY WITHIN A WORKGROUP. It's not global, except in the sense that writes before it are now globally visible. Also, you can't have a small kernel at the beginning and end to copy the local data - you don't understand; local memory is gone when the kernel exits.

As for the hash algos being small and not very heavy - except SIMD, they are. The implementations suck. The incompetence awards go to the current SGMiner 5 Echo implementation, though - that shit is terrible.  Grin
legendary
Activity: 1400
Merit: 1050
Hey djm34, there's something is wrong with the compiled binary that you post https://bitcointalksearch.org/topic/m.9817141

When hashing lyra2RE (vtc), the hashrate shown on the miner is indeed faster 5-10% than the metalicjames one, but on pool it will only record half the hashrate.
All stats normal, low rejected shares, no hardware errors that are shown on the miners.
Tested it on two different pools, coinotron and hashlink.eu, still when using your windows binary, only half hashrate will be recorded on the web.
Somehow only half the shares are sent by the miners or accepted by the pool.
And it's not  just estimations only, the coin received is halved for the same time period.
When back on using metalicjames version, the hashrate going back to normal again.

Where's that half hashrate gone??
Could you please recheck the binaries. Thanks.
must be related to the difficulty adjustment..., I will re-upload it. In the mean time there is also a difficulty multiplier option in sgminer (it shows as deprecated in the help, however it still works).
Also, I think that several pool are still tuning their hash report things...  Grin

Nah man it isn't the pool. I gave this a try and thx for it...finally got to that sweet 640-660 kh/s x 5 (3.2) but on Givemecoins pool showing exactly half (~.1.65). Change back to sgminer RC1 back to 550 kh/s but showing correct on pool.

What you are telling shows that it depends on which pool you run... so it is up to the pool admi to set up the pool corectly so that it coressponds the hashrate the miner tell you.
full member
Activity: 136
Merit: 100
Hey djm34, there's something is wrong with the compiled binary that you post https://bitcointalksearch.org/topic/m.9817141

When hashing lyra2RE (vtc), the hashrate shown on the miner is indeed faster 5-10% than the metalicjames one, but on pool it will only record half the hashrate.
All stats normal, low rejected shares, no hardware errors that are shown on the miners.
Tested it on two different pools, coinotron and hashlink.eu, still when using your windows binary, only half hashrate will be recorded on the web.
Somehow only half the shares are sent by the miners or accepted by the pool.
And it's not  just estimations only, the coin received is halved for the same time period.
When back on using metalicjames version, the hashrate going back to normal again.

Where's that half hashrate gone??
Could you please recheck the binaries. Thanks.
must be related to the difficulty adjustment..., I will re-upload it. In the mean time there is also a difficulty multiplier option in sgminer (it shows as deprecated in the help, however it still works).
Also, I think that several pool are still tuning their hash report things...  Grin

Nah man it isn't the pool. I gave this a try and thx for it...finally got to that sweet 640-660 kh/s x 5 (3.2) but on Givemecoins pool showing exactly half (~.1.65). Change back to sgminer RC1 back to 550 kh/s but showing correct on pool.
member
Activity: 81
Merit: 1002
It was only the wind.
Your approach (so called "darkcoin-mod.cl") was different in the way that you do not schedule whole X11 to be executed an one single go, but in that you have had to split X11 into 11 individual components (in fact hash algos that comprise X11) which execute one after the other thus resolving cache issues as all of the 11 components are small enough to fit into the cache. But now you're scheduling execution of 11 kernels (via 11 "clEnqueueNDRangeKernel" calls).

I only use xI 64 on Hawaii - 128 is better on Tahiti and Pitcairn.

Then "blake" hash is executed over that data in all threads. I noticed you use xIntensity of 64 so he would execute 64*2048 "blake" global threads on say R9 280X card. Results of all these threads is stored into global memory. Since all instructions are executed more or less in lock-step (lock-step within a compute unit and possibly out of sync between compute units) there is always a lot of contention for memory access at specific times. Finally GPU waits for all threads to finish via the "barrier" command.

Then the GPU pauses a bit and waits for sgminer to enqueue another kernel which in our case is: "bmw". Global memory contains "blake" hashed block data.

There's no real contention for memory access, if you mean in a locking sense - all workgroups are accessing different portions of global memory. Plus, the reads and writes are VERY small - which is why you never notice. The GPU has far more than enough memory bandwidth to handle those accesses. The barrier is to make sure the GPU ACTUALLY writes it to global memory, instead of optimizing it out. I think you assumed that the barrier was some sort of global thread synchronization - this is not the case. Also, the GPU doesn't have to wait for sgminer, "gpu-threads" being set to 2 in the miner means that 2 CPU threads are spawned to give work to the GPU at once. This allows it to stay busy.

So there are two major points of inefficiency in this approach if we can call it that as this "darkcoin-mod" approach is 40% faster than the original darkcoin.cl:

#1 GPU is not doing much between kernel schedules. This effect is even more problematic as all kernels end with a "barrier" statement thus waiting for all of threads to complete.
#2 kernels are spending time getting and storing data into global memory. Ideally this data should remain in GPU registers but I'm not sure this is possible with different kernels approach.

#1 is a non-issue, it's why we have 2 GPU threads - the barrier doesn't do what you think it does - it flushes writes to memory to ensure the next kernel can see them.

You say #2 is non issue and possibly I could agree with that. Although one could argue that as all the threads wait for the memory at the beginning for example you cannot take advantage of "waves in flight" as all waves will hit the same obstacle at the beginning of the kernel.
Question for you: why didn't you set "hashes" variable to reside in the local memory?

#2 is completely impossible to satisfy - you realize the SIMD algo by itself is so big, it has to be split over 8 threads or it'll cause register spillage? Imagine trying to fit all 11 hashes into one kernel and make the compiler NOT spill registers into memory. Would you rather do small, quick accesses, or get often used variables like the hash state dumped to global - forcing you to stall all the time?

Why don't I put the hashes variable in local memory? Because local memory is good for things that threads within a workgroup need to cooperate on, as well as often used constants. There would be no point putting the hashes in local memory, because first, I cannot set it from outside the kernel. Therefore, I would be loading the hashes from global to local, and then to registers? And then I would never need to access global again, so putting it in local serves no purpose.

I have another idea:
1. you don't store results from different kernels to the same memory area, but to different area for each algo
2. you can schedule all of them for execution in parallel (well as you currently do in "opencl_scanhash" function except that clCreateCommandQueue should specify "CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE")
3. kernels should be modified to wait for event to triger them. This event would be a signal from the previous kernel that it had finished the work. I'm not sure if this can be done via clEvents as they seem to work on entire queue rather than individual thread.

This would allow for kernels not to have to wait for all of them to finish (no end barrier) as the signaling to triger the next one would be explicit signal.... Reading OpenCl documentation I'm not sure this is possible...

Your point 1 is wasteful on memory - why store useless information? The output of blake becomes absolutely useless to me once BMW has finished. You cannot schedule them all for execution in parallel - not out of order! Look at X11, the output of the previous hash is the input to the next - therefore, BMW obviously depends on Blake's earlier result.
hero member
Activity: 658
Merit: 500
My sgminer versions are updated with Lyra2RE now for those that want to use it
Hello,where do i download the Sgminer for Lyra2RE friend ? Thank you
links are in my sig
both source code and precompiled binaries
newbie
Activity: 40
Merit: 0
My sgminer versions are updated with Lyra2RE now for those that want to use it
Hello,where do i download the Sgminer for Lyra2RE friend ? Thank you
member
Activity: 81
Merit: 1002
It was only the wind.
No, you can't set local memory like that - it is local to a workgroup.

Dear fucking god no - you don't understand basic GPU architecture. Without getting too technical, they cannot STAND large chunks of work - you MUST break it down into small chunks that can be parallelized. As a matter of fact, it's not good enough yet - 8 threads should be used per SIMD hash instead of one, because SIMD is too goddamned big to fit in the code cache, and it spills EVERYTHING to global memory. You want to figure out why X11 relies on memory when it shouldn't, look at SIMD. The access to get the work is seriously nothing at all.

Yes that's what I asked: if "hashes" variable were a local, then only those threads that belong to the same workgroup (64 threads) would have access to this common local resource.

I get the code cache argument, but then this leads to significantly different design, namely different workgroups should run different kernels. This we would have to tie in into sgminer gpu scheduling code.

I'm not as proficient with opencl as I would like to be, but looking at scrypt kernel implementation or SHA256 for that matter (at least from a year ago) all seemed to be rather simple: GPU was treated as a flat SIMD resource with say 2048 threads (7970) and global memory, so we would just schedule 8192 simultaneous threads (-I 13) , or more, so that hopefully we use up all computing resources.

X11 kernel (moded one) looks very different. The way I see it I should somehow take advantage of the fact that work scheduled to different workgroups is in fact executed in MIMD fashion.

Anyway I truly enjoy these conversations and I thank you for it.


If you wanted to take advantage of it, you could do 11 hashes at once in a parallel fashion. Dunno if it'd be much faster, though.
sr. member
Activity: 547
Merit: 250
Euw...
I have HW errors on neoscrypt with all the Lyra2RE miners ~~
Plop in an older bin file you had generated.
member
Activity: 81
Merit: 1002
It was only the wind.
By the way, I want to say thanks to Slix (ystarnaud) for adding the kernelfile option I suggested. Now I can develop without polluting the source tree - just setting kernelfile to "wolf-x13" or whatever it is per profile, then setting kernel-path to the new folder.  Cheesy
full member
Activity: 147
Merit: 100
Euw...
I have HW errors on neoscrypt with all the Lyra2RE miners ~~
member
Activity: 81
Merit: 1002
It was only the wind.
Still X2 works like a charm. Currently it runs on stock clocks, 0.9V, 56C, about 150W/core and delivers 5.4MH/s/core....or: 35 kH/J. Are you saying you're doing 60+ kH/J

Here are some stats, you do the math:

Mithra, running 2x270X gets 3.33MH/s & 3.45MH/s X11 for a total of 6.78MH/s, clocked at 875/1400 & 900/1500, both undervolted to 950mV, and she pulls 165W at the wall.
Screenshot (NSFW): https://ottrbutt.com/miner/x11localrigwolf-lowpower-11222014.png

I'm running all my cards at 1250 Mhz memory, in same instances I get better hashrates but power consumption are always lower. I guess it's lower memory latency at those clocks...

I noticed lowering it can help, too.

When mining scrypt with 290(x) reference cards there were sweet spots where hash was maximized.  Before Stilt's bios, the reference cards could only mine well with 1250 or 1500 memory and you just adjust the gpu clock to the sweet spot, I think it was usually a 1.4:1 ratio.  After flashing Stilt's bios you could change the memory in steps of 25 mhz and adjust the gpu again.  Maybe there are sweet spots for x11 type coins.  I haven't had time to test much so I have no idea.

No, not in the same way. The stock X11 kernel sucks, so it's kinda sensitive to memory speed, but not latency.
hero member
Activity: 896
Merit: 1000
I there some issue with 290x on Lyra2RE ,,because i cant get my 290x's to hash as fast as my 290 non x's
can you give the numbers for both as well as the setting

"290 non x Elipa 1.488 Mh/s

          "gpu-engine" : "980",
          "gpu-memclock" : "1500",
          "xintensity" : "64",
          "nfactor" : "10",
          "worksize" : "64",
          "algorithm" : "Lyra2RE",

290x hynix 1.450 Mh/s

          "gpu-engine" : "1070",
          "gpu-memclock" : "1400",
          "xintensity" : "64",
          "algorithm" : "Lyra2RE",
          "worksize" : "64",


I've had the clock speeds all over the place it just seems like the 290x's don't want to leave 1.450 Mh/s

and yet the 290's are happy to run faster easyer


Drop core, raise memclk.

True, it is memory intensive.
hero member
Activity: 528
Merit: 500
I there some issue with 290x on Lyra2RE ,,because i cant get my 290x's to hash as fast as my 290 non x's
can you give the numbers for both as well as the setting

"290 non x Elipa 1.488 Mh/s

          "gpu-engine" : "980",
          "gpu-memclock" : "1500",
          "xintensity" : "64",
          "nfactor" : "10",
          "worksize" : "64",
          "algorithm" : "Lyra2RE",

290x hynix 1.450 Mh/s

          "gpu-engine" : "1070",
          "gpu-memclock" : "1400",
          "xintensity" : "64",
          "algorithm" : "Lyra2RE",
          "worksize" : "64",


I've had the clock speeds all over the place it just seems like the 290x's don't want to leave 1.450 Mh/s

and yet the 290's are happy to run faster easyer
member
Activity: 81
Merit: 1002
It was only the wind.
Still X2 works like a charm. Currently it runs on stock clocks, 0.9V, 56C, about 150W/core and delivers 5.4MH/s/core....or: 35 kH/J. Are you saying you're doing 60+ kH/J

Here are some stats, you do the math:

Mithra, running 2x270X gets 3.33MH/s & 3.45MH/s X11 for a total of 6.78MH/s, clocked at 875/1400 & 900/1500, both undervolted to 950mV, and she pulls 165W at the wall.
Screenshot (NSFW): https://ottrbutt.com/miner/x11localrigwolf-lowpower-11222014.png

I'm running all my cards at 1250 Mhz memory, in same instances I get better hashrates but power consumption are always lower. I guess it's lower memory latency at those clocks...

I noticed lowering it can help, too.
legendary
Activity: 1400
Merit: 1050
I there some issue with 290x on Lyra2RE ,,because i cant get my 290x's to hash as fast as my 290 non x's
can you give the numbers for both as well as the setting
hero member
Activity: 528
Merit: 500
I there some issue with 290x on Lyra2RE ,,because i cant get my 290x's to hash as fast as my 290 non x's
member
Activity: 81
Merit: 1002
It was only the wind.
Still X2 works like a charm. Currently it runs on stock clocks, 0.9V, 56C, about 150W/core and delivers 5.4MH/s/core....or: 35 kH/J. Are you saying you're doing 60+ kH/J

Here are some stats, you do the math:

Mithra, running 2x270X gets 3.33MH/s & 3.45MH/s X11 for a total of 6.78MH/s, clocked at 875/1400 & 900/1500, both undervolted to 950mV, and she pulls 165W at the wall.
Screenshot (NSFW): https://ottrbutt.com/miner/x11localrigwolf-lowpower-11222014.png

OK i assume 50W is your idle draw, so that leaves us with 3450kH/58J~=60kH/J
If same ratios were to apply to 290X then 290X should do Cores_2816/1280 * Freq_1018/900 * 3450 kH/s ≃ 8600 kH/s, pull from the wall should be 145W. For X2 those numbers would be double.

This is phenomenal. Although I remember seeing your X2 hash rates to be even higher.

They are, but this is heavily undervolted - idle draw was 55W, by the way. But close enough.

Anyway, X11 can get far better than this. My rewrite is decent. It's not excellent, or even really good.
Jump to: