Author

Topic: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner - page 129. (Read 877903 times)

newbie
Activity: 4
Merit: 0
Regarding neoscrypt HW errors: You are probably not using the correct AMD Catalyst drivers that are required for the neoscrypt.cl kernel to generate a bin properly. For the version of sgminer, and the neoscrypt kernel that comes with it, you will most likely need to be using AMD Catalyst 14.6 RC2 drivers; however, there is an even easier fix. Reference this post here, and follow all the directions. It should allow you to mine neoscrypt without anymore HW errors, and maybe even a bit of a performance increase as well:
https://forum.feathercoin.com/index.php?/topic/7780-dev-neoscrypt-gpu-miner-public-beta-test/page-29#entry70031

Hey that fixed the problem. It put me up to 97KH/s on each card.
After that I started messing with the intensity and worksize and got up to 123Kh/s per card. Has anyone gotten higher with 270X's? (Core at 1180MHz for now)
{
                "name" : "WestHash_NeoScrypt",
                "url" : "stratum+tcp://stratum.westhash.com:4341",
                "user" : "X",
                "pass" : "X",
                "algorithm" : "neoscrypt",
                "nfactor" : "10",
                "worksize" : "96",
                "gpu-threads" : "1",
                "xintensity" : "9",
                "thread-concurrency" : "8192"
        },

Also, I'm still having issues with Scrypt-N. In 3 minutes each card has about 500 HW errors. Getting about 175KH/s per card.

{
                "name" : "WestHash_Scrypt-N",
                "url" : "stratum+tcp://stratum.westhash.com:4335",
                "user" : "X",
                "pass" : "X",
                "algorithm" : "zuikkis",
                "nfactor" : "11",
                "xintensity" : "4",
                "thread-concurrency" : "5152"
        },

Actually, I just tested good 'ole Scrypt out and it's getting about twice the amount of HW errors at twice the hashrate. I used to mine Scrypt at 4MH/s with my four 270X's but now I'm getting 1.39MH/s. I think that's from a driver change though.



I wish I could help you with the Scrypt-N HW errors, but I have no experience with it Sad As far as the 270xs: 123 Kh/s isn't terrible, but I know (with FTCs implementation of neoscrypt at least) that it is possible to get higher hash rate out of the 270x; like in the range of 130-140 Kh/s or so from what I recall. I have 7850s and 7870s, which are very similar to the 270x, so maybe try using these settings to get more hashes out of them: gpu-threads2, worksize64, lookup-gap2, thread-concurrency8192, xIntensity3 also you don't need nfactor for neoscrypt, so take that out.

I don't know why GPU-threads was set to 1... I just switched to 2 and went to 130Kh/s which made lower xintensity values usable. Before that I couldn't break 80Kh/s with xintensity below 7. It seems also that xintensity 2 gets me 2 KH/s more than xintensity 3. Worksize 96 still beats any multiple of 32 or 24 by at least 15KH/s.

So here's what I got.

"name" : "WestHash_NeoScrypt",
                "url" : "stratum+tcp://stratum.westhash.com:4341",
                "user" : "X",
                "pass" : "X",
                "algorithm" : "neoscrypt",
                "worksize" : "96",
                "gpu-threads" : "2",
                "xintensity" : "2",
                "thread-concurrency" : "8192"

And The Grand total is... 142KH/s! Thanks a bunch insanid!

Wolf0, if I were to drop the driver down to 13.12 wouldn't most other algos suffer? I thought X13 and X11 do much better on newer drivers?
full member
Activity: 322
Merit: 100
Regarding neoscrypt HW errors: You are probably not using the correct AMD Catalyst drivers that are required for the neoscrypt.cl kernel to generate a bin properly. For the version of sgminer, and the neoscrypt kernel that comes with it, you will most likely need to be using AMD Catalyst 14.6 RC2 drivers; however, there is an even easier fix. Reference this post here, and follow all the directions. It should allow you to mine neoscrypt without anymore HW errors, and maybe even a bit of a performance increase as well:
https://forum.feathercoin.com/index.php?/topic/7780-dev-neoscrypt-gpu-miner-public-beta-test/page-29#entry70031

Hey that fixed the problem. It put me up to 97KH/s on each card.
After that I started messing with the intensity and worksize and got up to 123Kh/s per card. Has anyone gotten higher with 270X's? (Core at 1180MHz for now)
{
                "name" : "WestHash_NeoScrypt",
                "url" : "stratum+tcp://stratum.westhash.com:4341",
                "user" : "X",
                "pass" : "X",
                "algorithm" : "neoscrypt",
                "nfactor" : "10",
                "worksize" : "96",
                "gpu-threads" : "1",
                "xintensity" : "9",
                "thread-concurrency" : "8192"
        },

Also, I'm still having issues with Scrypt-N. In 3 minutes each card has about 500 HW errors. Getting about 175KH/s per card.

{
                "name" : "WestHash_Scrypt-N",
                "url" : "stratum+tcp://stratum.westhash.com:4335",
                "user" : "X",
                "pass" : "X",
                "algorithm" : "zuikkis",
                "nfactor" : "11",
                "xintensity" : "4",
                "thread-concurrency" : "5152"
        },

Actually, I just tested good 'ole Scrypt out and it's getting about twice the amount of HW errors at twice the hashrate. I used to mine Scrypt at 4MH/s with my four 270X's but now I'm getting 1.39MH/s. I think that's from a driver change though.



I wish I could help you with the Scrypt-N HW errors, but I have no experience with it Sad As far as the 270xs: 123 Kh/s isn't terrible, but I know (with FTCs implementation of neoscrypt at least) that it is possible to get higher hash rate out of the 270x; like in the range of 130-140 Kh/s or so from what I recall. I have 7850s and 7870s, which are very similar to the 270x, so maybe try using these settings to get more hashes out of them: gpu-threads2, worksize64, lookup-gap2, thread-concurrency8192, xIntensity3 also you don't need nfactor for neoscrypt, so take that out.
newbie
Activity: 4
Merit: 0
Regarding neoscrypt HW errors: You are probably not using the correct AMD Catalyst drivers that are required for the neoscrypt.cl kernel to generate a bin properly. For the version of sgminer, and the neoscrypt kernel that comes with it, you will most likely need to be using AMD Catalyst 14.6 RC2 drivers; however, there is an even easier fix. Reference this post here, and follow all the directions. It should allow you to mine neoscrypt without anymore HW errors, and maybe even a bit of a performance increase as well:
https://forum.feathercoin.com/index.php?/topic/7780-dev-neoscrypt-gpu-miner-public-beta-test/page-29#entry70031

Hey that fixed the problem. It put me up to 97KH/s on each card.
After that I started messing with the intensity and worksize and got up to 123Kh/s per card. Has anyone gotten higher with 270X's? (Core at 1180MHz for now)
{
                "name" : "WestHash_NeoScrypt",
                "url" : "stratum+tcp://stratum.westhash.com:4341",
                "user" : "X",
                "pass" : "X",
                "algorithm" : "neoscrypt",
                "nfactor" : "10",
                "worksize" : "96",
                "gpu-threads" : "1",
                "xintensity" : "9",
                "thread-concurrency" : "8192"
        },

Also, I'm still having issues with Scrypt-N. In 3 minutes each card has about 500 HW errors. Getting about 175KH/s per card.

{
                "name" : "WestHash_Scrypt-N",
                "url" : "stratum+tcp://stratum.westhash.com:4335",
                "user" : "X",
                "pass" : "X",
                "algorithm" : "zuikkis",
                "nfactor" : "11",
                "xintensity" : "4",
                "thread-concurrency" : "5152"
        },

Actually, I just tested good 'ole Scrypt out and it's getting about twice the amount of HW errors at twice the hashrate. I used to mine Scrypt at 4MH/s with my four 270X's but now I'm getting 1.39MH/s. I think that's from a driver change though.

full member
Activity: 322
Merit: 100
member
Activity: 81
Merit: 1002
It was only the wind.
newbie
Activity: 4
Merit: 0
Hey guys, maybe you can help me out here. I've been struggling to get NeoScrypt to work correctly for a few weeks. I've got 4x270x's set at 1150MHz Core and 1250MHz Memory (tested and 100% stable) and for the life of me I cannot get a single accepted share using Neoscrypt. It says it's hashing away giving me around 79Kh/s per card. But no shares are being accepted. I've experimented with different intensities, worksizes and even nfactor. I should also note that NeoScrypt is giving me roughly one to three HW errors per card every minute.

Scrypt-N is giving me a ton of HW errors as well, like 10 per second though I still get around 420KH/s total.  I'm using sgminer 5.1 dev 2014-11-13 from westhash. Also, please let me know if I've got any "outdated" settings in here, I haven't been messing with the other algos for a few months.

Quote
{
"pools" : [
        {
                "name" : "WestHash_Scrypt",
                "url" : "stratum+tcp://stratum.westhash.com:4333",
                "user" : "X",
                "pass" : "X",
                "algorithm" : "zuikkis",
                "nfactor" : "10",
                "xintensity" : "4",
                "thread-concurrency" : "8192"
        },
        {
                "name" : "WestHash_Scrypt-N",
                "url" : "stratum+tcp://stratum.westhash.com:4335",
                "user" : "X",
                "pass" : "X",
                "algorithm" : "zuikkis",
                "nfactor" : "11",
                "xintensity" : "4",
                "thread-concurrency" : "5152"
        },
        {
                "name" : "WestHash_X11",
                "url" : "stratum+tcp://stratum.westhash.com:4336",
                "user" : "X",
                "pass" : "X",
                "algorithm" : "darkcoin-mod",
                "nfactor" : "10",
                "xintensity" : "300",
                "thread-concurrency" : "8192"
        },
        {
                "name" : "WestHash_X13",
                "url" : "stratum+tcp://stratum.westhash.com:4337",
                "user" : "X",
                "pass" : "X",
                "algorithm" : "marucoin-mod",
                "nfactor" : "10",
                "xintensity" : "300",
                "thread-concurrency" : "8192"
        },
        {
                "name" : "WestHash_Keccak",
                "url" : "stratum+tcp://stratum.westhash.com:4338",
                "user" : "X",
                "pass" : "X",
                "algorithm" : "maxcoin",
                "nfactor" : "10",
                "xintensity" : "300",
                "thread-concurrency" : "8192"
        },
        {
                "name" : "WestHash_X15",
                "url" : "stratum+tcp://stratum.westhash.com:4339",
                "user" : "X",
                "pass" : "X",
                "algorithm" : "bitblock",
                "nfactor" : "10",
                "intensity" : "19",
                "thread-concurrency" : "8192",
                "worksize" : "128"
        },
        {
                "name" : "WestHash_NIST5",
                "url" : "stratum+tcp://stratum.westhash.com:4340",
                "user" : "X",
                "pass" : "X",
                "algorithm" : "talkcoin-mod",
                "nfactor" : "10",
                "xintensity" : "300",
                "thread-concurrency" : "8192"
        },
        {
                "name" : "WestHash_NeoScrypt",
                "url" : "stratum+tcp://stratum.westhash.com:4341",
                "user" : "X",
                "pass" : "X",
                "algorithm" : "neoscrypt",
                "worksize" : "48",
                "gpu-threads" : "1",
                "intensity" : "13",
                "thread-concurrency" : "8192"
        },
        {
                "name" : "WestHash_X13_Backup",
                "url" : "stratum+tcp://stratum.westhash.com:3337",
                "user" : "X",
                "pass" : "X",
                "algorithm" : "marucoin-mod",
                "nfactor" : "10",
                "xintensity" : "300",
                "thread-concurrency" : "8192"
        }
],
"worksize" : "128",
"vectors" : "1",
"lookup-gap" : "2",
"shaders" : "1280",
"thread-concurrency" : "8192",
"gpu-engine" : "1150",
"gpu-memclock" : "1250",
"gpu-fan" : "30-100",
"gpu-powertune" : "50",
"gpu-platform" : "0",
"temp-cutoff" : "95",
"temp-overheat" : "90",
"temp-target" : "65",
"gpu-threads" : "2",
"auto-fan" : true,
"failover-only" : true,
"failover-switch-delay" : "30",
"no-pool-disable" : true,
"no-submit-stale" : true,
"queue" : "0",
"scan-time" : "7",
"expiry" : "28",
"api-listen" : true,
"api-mcast-port" : "4028",
"api-port" : "4001",
"api-allow" : "W:127.0.0.1"
}
full member
Activity: 219
Merit: 100
May I chime in?
please do.

At the cost of being considered nazi, I have to point out there is specific terminology for those: instruction cache (often "I$") and data cache.

I fail to see your point here, so I pass.

Correct terminology is work item. Nothing in any GPU architecture ever looked like a CPU thread. That's just oversimplification for marketing. Also note "core" isn't the same thing either.

Cool. We all know that. I used sgminer terminology as I was discussing with someone proficient with that code. Perhaps I could suggest that you go over sgminer codebase and make all appropriate changes there. Most will be grateful and I will buy you lunch if you choose to do so.

Run a program called CodeXL. You will see most NDRange calls are fully dispatched even before the kernel starts executing (at least, that's what happens for me).

Huh You must realize that dispatch of a NDRange call has nothing to do with actual execution of a kernel in a queue. I assume my writing wasn't clear enough so if you let me clarify: I was writing about kernels executing not NDRange calls.

Please, not this out-of-order queue nonsense again! The algo is sequential, it will need a sequential queue and you also have understood the whole point so I guess I'll make this clear for all the people out there who believe GPUs exist for hashing: out of order queues for sequential algos are useless and possibly make the things worse!

Ok. My idea may be a dead end, I can live with that.

They are designed to be efficient in ASIC hardware or FPGA at most. The two problems here are:
  • massive I$ overload, because the AMD compiler is too dumb to not unroll stuff (as a side note: HLSL/GLSL compilers usually unroll much more smartly, I currently suspect HLSL compiler might be building a whole tree of possibilities).

True. Did you have any sucess using HSLS compiler for this purpose? If that's the way to go, let's try.

  • registers must be shuffled across Work Items so most values cannot really be in private memory which brings us to the magic world of LDS layout.
  • register pressure: how soon you need the result. To my own surprise it seems GCN 1.0 and 1.1 still cannot dispatch dependent instructions one after another

Huh, this sounds like a description RISK architecture Smiley I guess again this is a compiler issue as compiler is supposed to hide this complexity...
hero member
Activity: 672
Merit: 500
May I chime in?
The way I see it X(n) algos are much less suitable for GPU mining than Scrypt(n) due to the fact that composition of kernels produces a large code which GPU cannot execute efficiently, mostly due to limited cache size & cache algorithms.
At the cost of being considered nazi, I have to point out there is specific terminology for those: instruction cache (often "I$") and data cache.
Anyway I realize I need to spent some time with CodeXL for a while to gain some insight... what beats me is that all hash algos in X11 are designed to have efficient implementation in hardware. So they should be small in code and consume little memory. This kind of thing should be possible to implement directly in thread registers or cl terms "private memory".
They are designed to be efficient in ASIC hardware or FPGA at most. The two problems here are:
  • massive I$ overload, because the AMD compiler is too dumb to not unroll stuff (as a side note: HLSL/GLSL compilers usually unroll much more smartly, I currently suspect HLSL compiler might be building a whole tree of possibilities).
  • registers must be shuffled across Work Items so most values cannot really be in private memory which brings us to the magic world of LDS layout.
  • register pressure: how soon you need the result. To my own surprise it seems GCN 1.0 and 1.1 still cannot dispatch dependent instructions one after another
full member
Activity: 219
Merit: 100
#2 is completely impossible to satisfy - you realize the SIMD algo by itself is so big, it has to be split over 8 threads or it'll cause register spillage? Imagine trying to fit all 11 hashes into one kernel and make the compiler NOT spill registers into memory. Would you rather do small, quick accesses, or get often used variables like the hash state dumped to global - forcing you to stall all the time?

Why don't I put the hashes variable in local memory? Because local memory is good for things that threads within a workgroup need to cooperate on, as well as often used constants. There would be no point putting the hashes in local memory, because first, I cannot set it from outside the kernel. Therefore, I would be loading the hashes from global to local, and then to registers? And then I would never need to access global again, so putting it in local serves no purpose.

I have another idea:
1. you don't store results from different kernels to the same memory area, but to different area for each algo
2. you can schedule all of them for execution in parallel (well as you currently do in "opencl_scanhash" function except that clCreateCommandQueue should specify "CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE")
3. kernels should be modified to wait for event to triger them. This event would be a signal from the previous kernel that it had finished the work. I'm not sure if this can be done via clEvents as they seem to work on entire queue rather than individual thread.

This would allow for kernels not to have to wait for all of them to finish (no end barrier) as the signaling to triger the next one would be explicit signal.... Reading OpenCl documentation I'm not sure this is possible...

Your point 1 is wasteful on memory - why store useless information? The output of blake becomes absolutely useless to me once BMW has finished. You cannot schedule them all for execution in parallel - not out of order! Look at X11, the output of the previous hash is the input to the next - therefore, BMW obviously depends on Blake's earlier result.

My understanding regarding barriers is that if one thread hits it then all of them have to in order for kernel to continue execution. So it is a sync mechanism (within a workgroup anyway). I agree that putting "hashes" in local mem would require one small kernel at the beginning and one at the end to copy local data into global mem and this would make sense if concurrent access to global mem from all threads is indeed a bottle neck. You say it's not I just mention it for the sake of completeness.

I completely forgot about gpu-threads, you're right that it keeps gpu busy. This is equivalent of just having 2 command queues that schedule kernels in parallel between them and in a sequential order within one gpu-thread .

BTW why is there a barrier in each kernel anyway? Since currently command queue does not allow for parallel execution of different kernels surely writes of blake should complete before next kernel bmw starts work...

My idea was that we could have some kind of say per thread counter/semaphore associated with "hashes" that would allow a kernel function searchN to start it's work only when this counter/semaphore was of value N. And this counter would be set at the end of each searchN fucntion to N+1. I see now this may not work as it maps poorly with what gpu was designed for.

Anyway I realize I need to spent some time with CodeXL for a while to gain some insight... what beats me is that all hash algos in X11 are designed to have efficient implementation in hardware. So they should be small in code and consume little memory. This kind of thing should be possible to implement directly in thread registers or cl terms "private memory".
full member
Activity: 219
Merit: 100
If you wanted to take advantage of it, you could do 11 hashes at once in a parallel fashion. Dunno if it'd be much faster, though.

No, that's not what I had in mind. Let me explain in detail for the benefit of other in the forum (and to verify my understanding):

The way I see it X(n) algos are much less suitable for GPU mining than Scrypt(n) due to the fact that composition of kernels produces a large code which GPU cannot execute efficiently, mostly due to limited cache size & cache algorithms. So plain vanilla approach to programming a kernel will not be efficient. Example of this is original "darkcoin.cl".

Your approach (so called "darkcoin-mod.cl") was different in the way that you do not schedule whole X11 to be executed an one single go, but in that you have had to split X11 into 11 individual components (in fact hash algos that comprise X11) which execute one after the other thus resolving cache issues as all of the 11 components are small enough to fit into the cache. But now you're scheduling execution of 11 kernels (via 11 "clEnqueueNDRangeKernel" calls).

So each call to the sgminer "opencl_scanhash", function where all the fun is, starts with the job setup, and GPU memory contains block data which needs to be hashed.

Then "blake" hash is executed over that data in all threads. I noticed you use xIntensity of 64 so he would execute 64*2048 "blake" global threads on say R9 280X card. Results of all these threads is stored into global memory. Since all instructions are executed more or less in lock-step (lock-step within a compute unit and possibly out of sync between compute units) there is always a lot of contention for memory access at specific times. Finally GPU waits for all threads to finish via the "barrier" command.

Then the GPU pauses a bit and waits for sgminer to enqueue another kernel which in our case is: "bmw". Global memory contains "blake" hashed block data.

It then proceeds to execute all the threads for "bmw" hash function. First it loads "blake" hashed block data from memory into registers or whatever compiler decides where the variables should be stored, does the thing and saves result (bmw hashed blaked hashed block data) into global memory again.

The whole process repeats for all the kernels (11 of them) and the final hash function checks whether resulting hash of hash of hash ... of block data is below target and if so returns the result which sgminer reads checks - calculates again only using CPU this time if all is ok submits to the pool.
(As we have very few results (hashes below target) per run this last CPU part is not an issue.)

So there are two major points of inefficiency in this approach if we can call it that as this "darkcoin-mod" approach is 40% faster than the original darkcoin.cl:

#1 GPU is not doing much between kernel schedules. This effect is even more problematic as all kernels end with a "barrier" statement thus waiting for all of threads to complete.
#2 kernels are spending time getting and storing data into global memory. Ideally this data should remain in GPU registers but I'm not sure this is possible with different kernels approach.

You say #2 is non issue and possibly I could agree with that. Although one could argue that as all the threads wait for the memory at the beginning for example you cannot take advantage of "waves in flight" as all waves will hit the same obstacle at the beginning of the kernel.
Question for you: why didn't you set "hashes" variable to reside in the local memory?

My thinking is that #1 is a real pain.

One way around it would be to schedule all 11 kernels in parallel on different compute units, some kernels would need more compute units than others due to their relative speed difference as you suggested.

I have another idea:
1. you don't store results from different kernels to the same memory area, but to different area for each algo
2. you can schedule all of them for execution in parallel (well as you currently do in "opencl_scanhash" function except that clCreateCommandQueue should specify "CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE")
3. kernels should be modified to wait for event to triger them. This event would be a signal from the previous kernel that it had finished the work. I'm not sure if this can be done via clEvents as they seem to work on entire queue rather than individual thread.

This would allow for kernels not to have to wait for all of them to finish (no end barrier) as the signaling to triger the next one would be explicit signal.... Reading OpenCl documentation I'm not sure this is possible...


member
Activity: 81
Merit: 1002
It was only the wind.
Wolf0 said that sgminer will not solo and that you would need to use cgminer ,,,,, i do not know if wolf0 kernel works with cgminer that is something i want to know

YES!!!  I already said it could be done..



Thank you; it seems Hippie Tech has a bone to pick.
full member
Activity: 169
Merit: 100
Still X2 works like a charm. Currently it runs on stock clocks, 0.9V, 56C, about 150W/core and delivers 5.4MH/s/core....or: 35 kH/J. Are you saying you're doing 60+ kH/J

Here are some stats, you do the math:

Mithra, running 2x270X gets 3.33MH/s & 3.45MH/s X11 for a total of 6.78MH/s, clocked at 875/1400 & 900/1500, both undervolted to 950mV, and she pulls 165W at the wall.
Screenshot (NSFW): https://ottrbutt.com/miner/x11localrigwolf-lowpower-11222014.png

I'm running all my cards at 1250 Mhz memory, in same instances I get better hashrates but power consumption are always lower. I guess it's lower memory latency at those clocks...

I noticed lowering it can help, too.

When mining scrypt with 290(x) reference cards there were sweet spots where hash was maximized.  Before Stilt's bios, the reference cards could only mine well with 1250 or 1500 memory and you just adjust the gpu clock to the sweet spot, I think it was usually a 1.4:1 ratio.  After flashing Stilt's bios you could change the memory in steps of 25 mhz and adjust the gpu again.  Maybe there are sweet spots for x11 type coins.  I haven't had time to test much so I have no idea.
sr. member
Activity: 419
Merit: 250
Still X2 works like a charm. Currently it runs on stock clocks, 0.9V, 56C, about 150W/core and delivers 5.4MH/s/core....or: 35 kH/J. Are you saying you're doing 60+ kH/J

Here are some stats, you do the math:

Mithra, running 2x270X gets 3.33MH/s & 3.45MH/s X11 for a total of 6.78MH/s, clocked at 875/1400 & 900/1500, both undervolted to 950mV, and she pulls 165W at the wall.
Screenshot (NSFW): https://ottrbutt.com/miner/x11localrigwolf-lowpower-11222014.png

I'm running all my cards at 1250 Mhz memory, in same instances I get better hashrates but power consumption are always lower. I guess it's lower memory latency at those clocks...
full member
Activity: 219
Merit: 100
Still X2 works like a charm. Currently it runs on stock clocks, 0.9V, 56C, about 150W/core and delivers 5.4MH/s/core....or: 35 kH/J. Are you saying you're doing 60+ kH/J

Here are some stats, you do the math:

Mithra, running 2x270X gets 3.33MH/s & 3.45MH/s X11 for a total of 6.78MH/s, clocked at 875/1400 & 900/1500, both undervolted to 950mV, and she pulls 165W at the wall.
Screenshot (NSFW): https://ottrbutt.com/miner/x11localrigwolf-lowpower-11222014.png

OK i assume 50W is your idle draw, so that leaves us with 3450kH/58J~=60kH/J
If same ratios were to apply to 290X then 290X should do Cores_2816/1280 * Freq_1018/900 * 3450 kH/s ≃ 8600 kH/s, pull from the wall should be 145W. For X2 those numbers would be double.

This is phenomenal. Although I remember seeing your X2 hash rates to be even higher.
full member
Activity: 219
Merit: 100
No, you can't set local memory like that - it is local to a workgroup.

Dear fucking god no - you don't understand basic GPU architecture. Without getting too technical, they cannot STAND large chunks of work - you MUST break it down into small chunks that can be parallelized. As a matter of fact, it's not good enough yet - 8 threads should be used per SIMD hash instead of one, because SIMD is too goddamned big to fit in the code cache, and it spills EVERYTHING to global memory. You want to figure out why X11 relies on memory when it shouldn't, look at SIMD. The access to get the work is seriously nothing at all.

Yes that's what I asked: if "hashes" variable were a local, then only those threads that belong to the same workgroup (64 threads) would have access to this common local resource.

I get the code cache argument, but then this leads to significantly different design, namely different workgroups should run different kernels. This we would have to tie in into sgminer gpu scheduling code.

I'm not as proficient with opencl as I would like to be, but looking at scrypt kernel implementation or SHA256 for that matter (at least from a year ago) all seemed to be rather simple: GPU was treated as a flat SIMD resource with say 2048 threads (7970) and global memory, so we would just schedule 8192 simultaneous threads (-I 13) , or more, so that hopefully we use up all computing resources.

X11 kernel (moded one) looks very different. The way I see it I should somehow take advantage of the fact that work scheduled to different workgroups is in fact executed in MIMD fashion.

Anyway I truly enjoy these conversations and I thank you for it.
full member
Activity: 219
Merit: 100
Still X2 works like a charm. Currently it runs on stock clocks, 0.9V, 56C, about 150W/core and delivers 5.4MH/s/core....or: 35 kH/J. Are you saying you're doing 60+ kH/J
full member
Activity: 219
Merit: 100
Yes, I understand. BTW is there an easier way to pass a parameter to "searchX" function other than through global memory? Wouldn't local memory work as long as all threads are in the same local group/compute unit?

Still I wonder: would'n it be better for a single thread to compute the entire hash (all of the 11 functions) rather than having multiple threads evaluating different functions of X11, from a message/argument passing standpoint? (with the former approach you would not need any).



member
Activity: 81
Merit: 1002
It was only the wind.
I have three 5870's and would like to mine FTC at TMB. Any one have neoscrypt running decent for AMD 5870? I'm only getting 20kh/s per card, using badman's current mod miner.

Check the FTC forums - they have a cl for old cards.
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
Code that uses only half the damned GPU is horribly inefficient. The GPU draws power just because it's on - now, yes, unless you reduce instructions, you will get a higher power usage, but it is always worth it. At 50% hashrate improvement, I had 17% more power usage than stock.

Reduce memory usage, reduce power.
full member
Activity: 219
Merit: 100
Same as the others unfortunately, I guess only Wolf and few others enjoy better kernels at the moment.

I see. Well I'll look into kernel files in the next few days and will post the results if there are any Smiley ...
... I see where the main problem is: darkcoin_mod.cl consists of 11 functions one for each algo in X11 (search0-10)  all of which end up writing results of the global memory. This is slow and on top of that all threads compete for writes in the global memory.

Interstingly original darkcoin.cl does not do that and still it is slower. Perhaps other optimizations of darkcoin_mod make more of an impact.
Jump to: