Author

Topic: OFFICIAL CGMINER mining software thread for linux/win/osx/mips/arm/r-pi 4.11.0 - page 649. (Read 5805546 times)

-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
Hey Con,

The global_work_offset value you pass to clEnqueueNDRangeKernel is used as the starting value for get_global_id(0). If global_work_offset would be 10 and we had 5 work-items, the nonces generated would be 20, 21, 22, 23, 24, 25, 26, 27, 28 and 29 so we loose nonces from 10 to 19 and use ones, that should not be base ... would that create duplicate work? You pass -D GOFFSET to the kernel, right?
I suspect the problem is to do with overflow on 32 bit unsigned integers. Imagine a value close to 2^32. Since the values are doubled, and what happens on overflowing 32 bits is undefined, you may well get repeated ranges of nonces checked.
hero member
Activity: 769
Merit: 500
I thought the NDRangeKernel went like this...Edit: Oh and Dia, in your CommandQueue try adding this property, "cl.command_queue_properties.OUT_OF_ORDER_EXEC_MODE_ENABLE"
https://github.com/ckolivas/cgminer/blob/master/ocl.c#L710

Code:
clState->commandQueue = clCreateCommandQueue(clState->context, devices[gpu],
CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &status);

cgminer has used this for a very long time.

As I wrote, I think OoE mode is not supported on AMD GPUs ... is there a debug or vebose message, if that mode was successfully activated?

Dia
hero member
Activity: 769
Merit: 500
I thought the NDRangeKernel went like this...

cl_int clEnqueueNDRangeKernel (
cl_command_queue command_queue,
cl_kernel kernel, cl_uint work_dim,
const size_t *global_work_offset,
const size_t *global_work_size,
const size_t *local_work_size,
cl_uint num_events_in_wait_list,
const cl_event *event_wait_list, cl_event *event)

If you tell it the global_work_size and local_work_size, why would the global_work_offset be the starting value for get_global_id(0)? Or am I completely off the ball?

What runtime are you using? I'm using newest 12.2 preview runtime with no problems after adding opencl 1.2 preview drivers

Edit: Oh and Dia, in your CommandQueue try adding this property, "cl.command_queue_properties.OUT_OF_ORDER_EXEC_MODE_ENABLE"

I think it would be beneficial to start incorporating out of order execution in the kernels, especially if you are moving to the NDRangeKernel. After adding this in and a couple other changes, VECTORS8 is actually running faster than VECTORS4.. Getting very high output on my nonces

OoE is not supported on AMD GPUs afaik. But I'm sure Con has code to try to use it with the command queue, if available.

The driver, which contains the problematic Runtime is this one: http://support.amd.com/us/kbarticles/Pages/hd7700series7support.aspx

Global worksize is the number of global work-items that are processed in one kernel execution and the local worksize is the work-group size of work-items, that are executed in parallel and share __local memory.

Dia
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
I thought the NDRangeKernel went like this...Edit: Oh and Dia, in your CommandQueue try adding this property, "cl.command_queue_properties.OUT_OF_ORDER_EXEC_MODE_ENABLE"
https://github.com/ckolivas/cgminer/blob/master/ocl.c#L710

Code:
clState->commandQueue = clCreateCommandQueue(clState->context, devices[gpu],
CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &status);

cgminer has used this for a very long time.
newbie
Activity: 46
Merit: 0
I thought the NDRangeKernel went like this...

cl_int clEnqueueNDRangeKernel (
cl_command_queue command_queue,
cl_kernel kernel, cl_uint work_dim,
const size_t *global_work_offset,
const size_t *global_work_size,
const size_t *local_work_size,
cl_uint num_events_in_wait_list,
const cl_event *event_wait_list, cl_event *event)

If you tell it the global_work_size and local_work_size, why would the global_work_offset be the starting value for get_global_id(0)? Or am I completely off the ball?

What runtime are you using? I'm using newest 12.2 preview runtime with no problems after adding opencl 1.2 preview drivers

Edit: Oh and Dia, in your CommandQueue try adding this property, "cl.command_queue_properties.OUT_OF_ORDER_EXEC_MODE_ENABLE"

I think it would be beneficial to start incorporating out of order execution in the kernels, especially if you are moving to the NDRangeKernel. After adding this in and a couple other changes, VECTORS8 is actually running faster than VECTORS4.. Getting very high output on my nonces
hero member
Activity: 769
Merit: 500
Con, maybe we can talk about integrating global offset parameter support into CGMINER?

Take a short look at http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueNDRangeKernel.html and the global_work_offset parameter. All that has to be taken into consideration from the kernel-side is in DiaKGCN. OpenCL 1.1 detection is in your code, too, which is needed, but I can't do the other required changes without a compiler.

In short, the nonce-base is not supplied via the base parameter, if GOFFSET is enabled, but instead via the global_work_offset parameter and used via the global work-item ID in the kernel. This saves a few instructions and can give us a small boost.

Dia
I tried writing code to send nonce as the global offset parameter and your code returned duplicate work on 2 or more vectors. Looking at your code:
Code:
u nonce = ((uint)get_global_id(0) << 1) + (u)(0, 1);
This won't be right as global id will now be the global thread id + the global offset parameter so doubling it will give random results.

Hey Con,

The global_work_offset value you pass to clEnqueueNDRangeKernel is used as the starting value for get_global_id(0). If global_work_offset would be 10 and we had 5 work-items, the nonces generated would be 20, 21, 22, 23, 24, 25, 26, 27, 28 and 29 so we loose nonces from 10 to 19 and use ones, that should not be base ... would that create duplicate work? You pass -D GOFFSET to the kernel, right?

Edit: Oh and I have to warn you, yesterday I tested a new AMD driver, which has a new OpenCL runtime and I was unable to use CGMINER on Windows, so perhaps AMD "fixed" the binary generation but this breaks your solution from a few days ago, because the old message that no .bin could be generated was back. I switched back to a former runtime :-(. Seems to suck!

Dia
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
Nonce of that code makes much sense, but I think Dia is suggesting that he wants to use the natively built-in global_work_offset parameter instead of the one you reference by using clEnqueueNDRangeKernel
Umm... I actually do understand the code  Wink I was explaining what was wrong with it.
newbie
Activity: 46
Merit: 0
Nonce of that code makes much sense, but I think Dia is suggesting that he wants to use the natively built-in global_work_offset parameter instead of the one you reference by using clEnqueueNDRangeKernel
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
Con, maybe we can talk about integrating global offset parameter support into CGMINER?

Take a short look at http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueNDRangeKernel.html and the global_work_offset parameter. All that has to be taken into consideration from the kernel-side is in DiaKGCN. OpenCL 1.1 detection is in your code, too, which is needed, but I can't do the other required changes without a compiler.

In short, the nonce-base is not supplied via the base parameter, if GOFFSET is enabled, but instead via the global_work_offset parameter and used via the global work-item ID in the kernel. This saves a few instructions and can give us a small boost.

Dia
I tried writing code to send nonce as the global offset parameter and your code returned duplicate work on 2 or more vectors. Looking at your code:
Code:
u nonce = ((uint)get_global_id(0) << 1) + (u)(0, 1);
This won't be right as global id will now be the global thread id + the global offset parameter so doubling it will give random results.
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
cgminer sets clocks all back to default on exit... if it exits cleanly, and of course on windows it's a miracle when it does.

Sadly this is not the case.

windows machine, pair of 6950's, set to 850/1300 for normal operation, in cgminer they are set to 700-880/300... when cgminer exits it leaves the cards at 880/300.

even updated to 12.1 drivers, both 2.4 and 2.6 SDK.


Instead windows decides to blow your balls off and feed them back to you by collecting up the splatter, mincing it, putting it into a glass and forcing you to drink it through a straw via  your left nostril.
I'll try and code a workaround for this windows fail next time by stopping mining and resetting device values and pausing for a bit before letting windows crash cgminer when it tries to exit.
donator
Activity: 1218
Merit: 1079
Gerald Davis
Oh well, i guess i was wrong, i read that BAMT didn't work with cgminer, maybe they just mean it's not part of the original download which is fine. it didn't make sense to em either, but i fiugured it had something to do with some of the programming.

So i guess you just download BAMT and then download cgminer and you're all set?

what drivers and sdk does BAMT use?

BAMT already has cgminer installed and it is integrated into BAMT other tools (mgpumon, web monitor, gpumon, etc).

It is only cgminer 2.1.2 I believe but you can install newer copy if you like.  SDK is 2.4, I am not sure the driver.  It doesn't have 100% bug.

So it is write BAMT to flash drive.  Run fixer to grab latest updates, change 2 config files (1 for BAMT, 1 for cgminer) and you are mining.  You can then take that flash drive, record the image, put that image on 8 flash drives, put them in 8 rigs make a few changes via SSH and power up 20 GH/s farm in a few minutes.
hero member
Activity: 535
Merit: 500
Oh well, i guess i was wrong, i read that BAMT didn't work with cgminer, maybe they just mean it's not part of the original download which is fine. it didn't make sense to em either, but i fiugured it had something to do with some of the programming.

So i guess you just download BAMT and then download cgminer and you're all set?

what drivers and sdk does BAMT use?
donator
Activity: 798
Merit: 500
We need a new stripped down version of linuxcoin or call it whatever, just for mining and maybe built around optimal settings/set-up for cgminer. Seems BAMT doesn't supoort or work with

What am I missing here, cuz my 5970 BAMT rig has been mining with cgminer for a week now - and it took 3 mins to set up  Huh
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
cgminer sets clocks all back to default on exit... if it exits cleanly, and of course on windows it's a miracle when it does.

Sadly this is not the case.

windows machine, pair of 6950's, set to 850/1300 for normal operation, in cgminer they are set to 700-880/300... when cgminer exits it leaves the cards at 880/300.

even updated to 12.1 drivers, both 2.4 and 2.6 SDK.


Instead windows decides to blow your balls off and feed them back to you by collecting up the splatter, mincing it, putting it into a glass and forcing you to drink it through a straw via  your left nostril.
donator
Activity: 1218
Merit: 1079
Gerald Davis
cgminer sets clocks all back to default on exit... if it exits cleanly, and of course on windows it's a miracle when it does.

Sadly this is not the case.

windows machine, pair of 6950's, set to 850/1300 for normal operation, in cgminer they are set to 700-880/300... when cgminer exits it leaves the cards at 880/300.

even updated to 12.1 drivers, both 2.4 and 2.6 SDK.



Is it exiting or crashing?
sr. member
Activity: 467
Merit: 250
cgminer sets clocks all back to default on exit... if it exits cleanly, and of course on windows it's a miracle when it does.

Sadly this is not the case.

windows machine, pair of 6950's, set to 850/1300 for normal operation, in cgminer they are set to 700-880/300... when cgminer exits it leaves the cards at 880/300.

even updated to 12.1 drivers, both 2.4 and 2.6 SDK.

legendary
Activity: 4592
Merit: 1851
Linux since 1997 RedHat 4
hero member
Activity: 535
Merit: 500
I second a thread just for cgminer settings and what cards are being used.

I use I,9 and then just set memclock, fan and engine in my bash script for linuxcoin.

I'm still struggling to get my 5970's stable.

fan 100 engine 820 and me 410 seem best for me, but of course for wattage, I'd love a lower memclock.

Should I be using I 8, is that my issue? also i am on cgminer 2.0.7 because there is no way in hell i can do anything in linuxcoin without screwing it up and hate the full ubuntu OS, too difficult to work with IMO.

We need a new stripped down version of linuxcoin or call it whatever, just for mining and maybe built around optimal settings/set-up for cgminer. Seems BAMT doesn't supoort or work with cgminer.

My farm is up to 8 rigs 10 5970's, 8 5870's and 1 6990. I plan on building a water-cooled 4x5970 with 1500 watt psu. I have the stuff, but want to make sure I can get it set up as fast and stable as possible as it will be in a semi-remote location.

I know I need to figure out how to use ssh, etc. but I'd also like to have the best sdk, drivers, etc. along with some other folks settings for engine and memclock.

Can we start a cgminer optimization thread with some guides? I'll throw in a bounty of 10btc if it has a set up guide and an optimal settings database.
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
Okay so I tested the fastest diablominer has to offer on 7970 and current cgminer is 1.5 MHash faster with defaults, so I'm pleased Smiley. I guess I should keep working on my kernel Wink
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
Con, maybe we can talk about integrating global offset parameter support into CGMINER?

Take a short look at http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueNDRangeKernel.html and the global_work_offset parameter. All that has to be taken into consideration from the kernel-side is in DiaKGCN. OpenCL 1.1 detection is in your code, too, which is needed, but I can't do the other required changes without a compiler.

In short, the nonce-base is not supplied via the base parameter, if GOFFSET is enabled, but instead via the global_work_offset parameter and used via the global work-item ID in the kernel. This saves a few instructions and can give us a small boost.
Yes I like that idea. But 8 and 16 vectors perform shithouse followed by appalling so it's not worth pursuing those.
Jump to: