Author

Topic: OFFICIAL CGMINER mining software thread for linux/win/osx/mips/arm/r-pi 4.11.0 - page 646. (Read 5805983 times)

-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
Meh, it ended up being of no advantage for unnecessary complexity.
* ckolivas forgets all about goffset for now.
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
Btw, the performance of it is pretty average, after all that discussion...

Perhaps the changes needed to make it work ate the small benefits the solution offers ... but I had to LOL when I saw we came up with the same solution ^^. I posted and read your version after that and they look equal for VEC2 Cheesy.
Cheesy I'd say you're right. Oh well, always other things to try.
You know I could make cgminer "skip" nonce ranges when it's using goffset so that the code can work with less ops. This will drop efficiency though since it will decrease the amount of work a device gets before it needs new work.
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
Btw, the performance of it is pretty average, after all that discussion...

Perhaps the changes needed to make it work ate the small benefits the solution offers ... but I had to LOL when I saw we came up with the same solution ^^. I posted and read your version after that and they look equal for VEC2 Cheesy.
Cheesy I'd say you're right. Oh well, always other things to try.
hero member
Activity: 772
Merit: 500
Btw, the performance of it is pretty average, after all that discussion...

Perhaps the changes needed to make it work ate the small benefits the solution offers ... but I had to LOL when I saw we came up with the same solution ^^. I posted and read your version after that and they look equal for VEC2 Cheesy.

Dia
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
Btw, the performance of it is pretty average, after all that discussion...
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
Code:
#if defined VECTORS4
#ifdef GOFFSET
u nonce = (uint)get_global_id(0) + (u)(0, get_global_size(0), get_global_size(0) << 1, get_global_size(0) * 3);
#else
u nonce = ((uint)get_group_id(0) * (uint)get_local_size(0) << 2) + ((uint)get_local_id(0) << 2) + base;
#endif
#elif defined VECTORS2
#ifdef GOFFSET
u nonce = (uint)get_global_id(0) + (u)(0, get_global_size(0));
#else
u nonce = ((uint)get_group_id(0) * (uint)get_local_size(0) << 1) + ((uint)get_local_id(0) << 1) + base;
#endif
#else
should do it

and cgminer already takes vectors into account when increasing nonce value to pass to base on the next pass. This doesn't change it. cgminer effectively sends twice as much work when vectors go from 1 to 2 so the intensity is effectively different at different vectors.
hero member
Activity: 772
Merit: 500
Quote from: ckolivas
Thanks. My issue with your code being:
Code:
u nonce = ((uint)get_global_id(0) << 2) + (u)(0, 1, 2, 3);
as I said is it really won't be testing the nonce range we're asking it to test. If "base" is 2^31 and worksize is 2^24 (intensity 9), then get_global_id(0) will return 2^31 for the very first thread. Then if we shift it << 2 it's going to be undefined and in most implementations will just be zero again. Which means we'll be repeating 2^24 operations on nonces 0 - 2^24, which we would have done initially on getting that work item.

get_global_id(0) for the very first thread is simply base, if passed as global_work_offset parameter. So range is from "base" till "base + (2^24 - 1)". If base is 2^31 and we shift left by 2 for Vec4, you are right and we are undefined here. Code is easy for no vectors, but wrong for vectors in it's current form.

Edit: Would that work for Vec2 (base: 10 / global-worksize: 4)?
Code:
u nonce = (uint)get_global_id(0) + (u)(0, (uint)get_global_size(0));

base 10 the nonces would be: 10, 14, 11, 15, 12, 16, 13, 17

Dia
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
For -v 2 it should halve the global worksize (global Threads) number, so doubling this in the kernel should be no problem at all, right?
global_id == global_work_offset  + global_worksize

You're doubling the global id, not the worksize.


Global ID without global offset is: 0 till (global worksize - 1) queried in the kernel via get_global_size(0). So for 2-component vectors we need a global worksize / 2 (as argument in the enqueue kernel call), because there are 2 nonces processed in each work-item.

Global ID with global offset is: global offset till (global worksize + global offset - 1).

Dia
I still think it should be something like:
      u nonce = ((uint)get_global_id(0) + get_global_size(0) * (0,1,2,3));
for vectors4 etc.


Thinking loud again:

get_global_id(0) == ranges from global_offset for the 1st work-item till (global_offset + (global_worksize - 1)) for the last work-item
get_global_size(0) == global_worksize (constant value)

global_offset == nonce-base, that results in:

nonce.x = nonce-base + global_worksize * 0;
nonce.y = nonce-base + global_worksize * 1;
nonce.z = nonce-base + global_worksize * 2;
nonce.w = nonce-base + global_worksize * 3;

Let's consider 10 as nonce-base and 4 as global_worksize. This leads to the following nonces that get checked during 1 kernel execution:

Work-Item 0:
10 + 4 * 0 = 10
10 + 4 * 1 = 14
10 + 4 * 2 = 18
10 + 4 * 3 = 22

Work-Item 1:
11 + 4 * 0 = 11
11 + 4 * 1 = 15
11 + 4 * 2 = 19
11 + 4 * 3 = 23

Work-Item 2:
12 + 4 * 0 = 12
12 + 4 * 1 = 16
12 + 4 * 2 = 20
12 + 4 * 3 = 24

Work-Item 0:
13 + 4 * 0 = 13
13 + 4 * 1 = 17
13 + 4 * 2 = 21
13 + 4 * 3 = 25

So we have nonces from 10 to 25

Now if we divide the passed global worksize by 4 (because of 4-component vector usage in your example) and use 1 for it this leads to:

Work-Item 0:
10 + 1 * 0 = 10
10 + 1 * 1 = 11
10 + 1 * 2 = 12
10 + 1 * 3 = 13

So I guess your code works, if you divide the global worksize by the vec-size before passing that argument to clEnueueNDRangeKernel.

Thanks. My issue with your code being:
Code:
u nonce = ((uint)get_global_id(0) << 2) + (u)(0, 1, 2, 3);
as I said is it really won't be testing the nonce range we're asking it to test. If "base" is 2^31 and worksize is 2^24 (intensity 9), then get_global_id(0) will return 2^31 for the very first thread. Then if we shift it << 2 it's going to be undefined and in most implementations will just be zero again. Which means we'll be repeating 2^24 operations on nonces 0 - 2^24, which we would have done initially on getting that work item.
hero member
Activity: 772
Merit: 500
For -v 2 it should halve the global worksize (global Threads) number, so doubling this in the kernel should be no problem at all, right?
global_id == global_work_offset  + global_worksize

You're doubling the global id, not the worksize.


Global ID without global offset is: 0 till (global worksize - 1) queried in the kernel via get_global_size(0). So for 2-component vectors we need a global worksize / 2 (as argument in the enqueue kernel call), because there are 2 nonces processed in each work-item.

Global ID with global offset is: global offset till (global worksize + global offset - 1).

Dia
I still think it should be something like:
      u nonce = ((uint)get_global_id(0) + get_global_size(0) * (0,1,2,3));
for vectors4 etc.


Thinking loud again:

get_global_id(0) == ranges from global_offset for the 1st work-item till (global_offset + (global_worksize - 1)) for the last work-item
get_global_size(0) == global_worksize (constant value)

global_offset == nonce-base, that results in:

nonce.x = nonce-base + global_worksize * 0;
nonce.y = nonce-base + global_worksize * 1;
nonce.z = nonce-base + global_worksize * 2;
nonce.w = nonce-base + global_worksize * 3;

Let's consider 10 as nonce-base and 4 as global_worksize. This leads to the following nonces that get checked during 1 kernel execution:

Work-Item 0:
10 + 4 * 0 = 10
10 + 4 * 1 = 14
10 + 4 * 2 = 18
10 + 4 * 3 = 22

Work-Item 1:
11 + 4 * 0 = 11
11 + 4 * 1 = 15
11 + 4 * 2 = 19
11 + 4 * 3 = 23

Work-Item 2:
12 + 4 * 0 = 12
12 + 4 * 1 = 16
12 + 4 * 2 = 20
12 + 4 * 3 = 24

Work-Item 0:
13 + 4 * 0 = 13
13 + 4 * 1 = 17
13 + 4 * 2 = 21
13 + 4 * 3 = 25

So we have nonces from 10 to 25

Now if we divide the passed global worksize by 4 (because of 4-component vector usage in your example) and use 1 for it this leads to:

Work-Item 0:
10 + 1 * 0 = 10
10 + 1 * 1 = 11
10 + 1 * 2 = 12
10 + 1 * 3 = 13

So I guess your code works, if you divide the global worksize by the vec-size before passing that argument to clEnueueNDRangeKernel.

Dia
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
If you're trying to use vectors then there is a type mis-match either stick with putting a (u) in front or use (uint4) and the (0, 1, 2, 3) should be on the outside parenthesis.

Here is a float4 example...

float4 f = (float4)(1.0f, 2.0f, 3.0f, 4.0f);

Also in the _kernal void search if you keep (_global uint * output) then you're not really utilizing vectors correctly

And, sorry was just trying to provide some general feedback with Out of Order Execution, wasn't trying to offend you, I'm just not sure how to edit cgminer directly.
Thanks.

Are you saying the existing code is losing shares with __global uint * output? 99% of users on cgminer are currently using 2 vectors. Again I doubt that's the case.
newbie
Activity: 46
Merit: 0
If you're trying to use vectors then there is a type mis-match either stick with putting a (u) in front or use (uint4) and the (0, 1, 2, 3) should be on the outside parenthesis.

Here is a float4 example...

float4 f = (float4)(1.0f, 2.0f, 3.0f, 4.0f);

Also in the _kernal void search if you keep (_global uint * output) then you're not really utilizing vectors correctly

And, sorry was just trying to provide some general feedback with Out of Order Execution, wasn't trying to offend you, I'm just not sure how to edit cgminer directly.
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
For -v 2 it should halve the global worksize (global Threads) number, so doubling this in the kernel should be no problem at all, right?
global_id == global_work_offset  + global_worksize

You're doubling the global id, not the worksize.


Global ID without global offset is: 0 till (global worksize - 1) queried in the kernel via get_global_size(0). So for 2-component vectors we need a global worksize / 2 (as argument in the enqueue kernel call), because there are 2 nonces processed in each work-item.

Global ID with global offset is: global offset till (global worksize + global offset - 1).

Dia
I still think it should be something like:
      u nonce = ((uint)get_global_id(0) + get_global_size(0) * (0,1,2,3));
for vectors4 etc.

Anyway I ended up trying it both ways with your nonce code or mine and neither led to any improvement (actually detriment if anything).
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
I thought the NDRangeKernel went like this...Edit: Oh and Dia, in your CommandQueue try adding this property, "cl.command_queue_properties.OUT_OF_ORDER_EXEC_MODE_ENABLE"
https://github.com/ckolivas/cgminer/blob/master/ocl.c#L710

Code:
clState->commandQueue = clCreateCommandQueue(clState->context, devices[gpu],
CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &status);

cgminer has used this for a very long time.

As I wrote, I think OoE mode is not supported on AMD GPUs ... is there a debug or vebose message, if that mode was successfully activated?
It is successfully activated on windows and linux, but osx fails. It does not improve throughput with current GPUs but is harmless to enable for if/when they do.

I saw a significant increase in average nonces being found and a 3 Mhash/sec higher throughput.


Modified from Dia's code I used the following...

self.commandQueue = cl.CommandQueue(self.context, self.device, cl.command_queue_properties.OUT_OF_ORDER_EXEC_MODE_ENABLE)

Then I saw that read buffer was set to this > cl.enqueue_read_buffer(self.commandQueue, self.output_buf, self.output, is_blocking=True)

Since OoE Mode will NOT work if is_blocking=True I set them all to false, and re-enabled self.commandQueue.finish()

Similarly I changed the write buffer

cl.enqueue_write_buffer(self.commandQueue, self.output_buf, self.output, is_blocking=False)

on the cl.output_buf I changed mem flags to cl.mem_flags.READ_WRITE | cl.mem_flags.COPY_HOST_PTR | cl.mem_flags.ALLOC_HOST_PTR

For Async to work the 11.11 AMD drivers tell you to add environmental variable to your system. GPU_ASYNC_MEM_COPY=2

Again, this might only be a 69xx feature, but for my 6970 I turn off BFI_INT and GOFFSET and increased my Memory speed and VECTORS8 was running at over 446 MHash/s. Now it'll find between 5-14 nonces per minute without choking up or freezing system. Before it was struggling to find 5 nonces per minute if at all.

Next, I want to add the Async functions

event_t async_work_group_copy (__local T *dst, const __global T *src,
size_t num_gentypes, event_t event)

event_t async_work_group_copy (__global T *dst, const __local T *src,
size_t num_gentypes, event_t event)

One is for global and other is for local work groups

Then create a prefetch for global cache

void prefetch (const __global T *p, size_t num_gentypes)

Again, you're not remotely talking about cgminer:

Code:
	status = clEnqueueReadBuffer(clState->commandQueue, clState->outputBuffer, CL_FALSE, 0,
BUFFERSIZE, thrdata->res, 0, NULL, NULL);
Please... seriously... I could take your advice if you were talking about how it relates here, but the stuff you're saying is not set is not cgminer...

edit: That's not to say you have nothing useful to add, but the signal to noise ratio gets low when you're talking about other code first and foremost.
newbie
Activity: 46
Merit: 0
I thought the NDRangeKernel went like this...Edit: Oh and Dia, in your CommandQueue try adding this property, "cl.command_queue_properties.OUT_OF_ORDER_EXEC_MODE_ENABLE"
https://github.com/ckolivas/cgminer/blob/master/ocl.c#L710

Code:
clState->commandQueue = clCreateCommandQueue(clState->context, devices[gpu],
CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &status);

cgminer has used this for a very long time.

As I wrote, I think OoE mode is not supported on AMD GPUs ... is there a debug or vebose message, if that mode was successfully activated?
It is successfully activated on windows and linux, but osx fails. It does not improve throughput with current GPUs but is harmless to enable for if/when they do.

I saw a significant increase in average nonces being found and a 3 Mhash/sec higher throughput.


Modified from Dia's code I used the following...

self.commandQueue = cl.CommandQueue(self.context, self.device, cl.command_queue_properties.OUT_OF_ORDER_EXEC_MODE_ENABLE)

Then I saw that read buffer was set to this > cl.enqueue_read_buffer(self.commandQueue, self.output_buf, self.output, is_blocking=True)

Since OoE Mode will NOT work if is_blocking=True I set them all to false, and re-enabled self.commandQueue.finish()

Similarly I changed the write buffer

cl.enqueue_write_buffer(self.commandQueue, self.output_buf, self.output, is_blocking=False)

on the cl.output_buf I changed mem flags to cl.mem_flags.READ_WRITE | cl.mem_flags.COPY_HOST_PTR | cl.mem_flags.ALLOC_HOST_PTR

For Async to work the 11.11 AMD drivers tell you to add environmental variable to your system. GPU_ASYNC_MEM_COPY=2

Again, this might only be a 69xx feature, but for my 6970 I turn off BFI_INT and GOFFSET and increased my Memory speed and VECTORS8 was running at over 446 MHash/s. Now it'll find between 5-14 nonces per minute without choking up or freezing system. Before it was struggling to find 5 nonces per minute if at all.

Next, I want to add the Async functions

event_t async_work_group_copy (__local T *dst, const __global T *src,
size_t num_gentypes, event_t event)

event_t async_work_group_copy (__global T *dst, const __local T *src,
size_t num_gentypes, event_t event)

One is for global and other is for local work groups

Then create a prefetch for global cache

void prefetch (const __global T *p, size_t num_gentypes)
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
For -v 2 it should halve the global worksize (global Threads) number, so doubling this in the kernel should be no problem at all, right?
global_id == global_work_offset  + global_worksize

You're doubling the global id, not the worksize.


Global ID without global offset is: 0 till (global worksize - 1) queried in the kernel via get_global_size(0). So for 2-component vectors we need a global worksize / 2 (as argument in the enqueue kernel call), because there are 2 nonces processed in each work-item.

Global ID with global offset is: global offset till (global worksize + global offset - 1).

Dia
I still think it should be something like:
      u nonce = ((uint)get_global_id(0) + get_global_size(0) * (0,1,2,3));
for vectors4 etc.
hero member
Activity: 772
Merit: 500
Having said all of that it may just be the value I'm passing since it expects an array and I'm passing a single value...

Right, should be an array, because the ND-range can be 3-dimensional and we only use 1-dimension.

Dia
hero member
Activity: 772
Merit: 500
For -v 2 it should halve the global worksize (global Threads) number, so doubling this in the kernel should be no problem at all, right?
global_id == global_work_offset  + global_worksize

You're doubling the global id, not the worksize.


Global ID without global offset is: 0 till (global worksize - 1) queried in the kernel via get_global_size(0). So for 2-component vectors we need a global worksize / 2 (as argument in the enqueue kernel call), because there are 2 nonces processed in each work-item.

Global ID with global offset is: global offset till (global worksize + global offset - 1).

Dia
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
Having said all of that it may just be the value I'm passing since it expects an array and I'm passing a single value...
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
For -v 2 it should halve the global worksize (global Threads) number, so doubling this in the kernel should be no problem at all, right?
global_id == global_work_offset  + global_worksize

You're doubling the global id, not the worksize.
hero member
Activity: 772
Merit: 500
Hey Con,

The global_work_offset value you pass to clEnqueueNDRangeKernel is used as the starting value for get_global_id(0). If global_work_offset would be 10 and we had 5 work-items, the nonces generated would be 20, 21, 22, 23, 24, 25, 26, 27, 28 and 29 so we loose nonces from 10 to 19 and use ones, that should not be base ... would that create duplicate work? You pass -D GOFFSET to the kernel, right?
I suspect the problem is to do with overflow on 32 bit unsigned integers. Imagine a value close to 2^32. Since the values are doubled, and what happens on overflowing 32 bits is undefined, you may well get repeated ranges of nonces checked.

That would prove true for a base of (2147483647 - global worksize) because doubled it's over 2^32. How big are the nonce bases and what would be the global worksize for -I == 14 (as this is the maximum)?

Dia

global worksize = 2^(15 + intensity) so it's 29 max and it's double that for 2 vectors and so on...

cgminer always tries to test the entire range of nonces up to 2^32 so it will *always* get to a value above 2^31 where it will wrap with a doubling of the global id regardless of what intensity it's at.

For -v 2 it should halve the global worksize (global Threads) number, so doubling this in the kernel should be no problem at all, right?

Dia
Jump to: