further improved phatk_dia kernel for Phoenix + SDK 2.6 - 2012-01-13 - page 7.

miscreanity

legendary

Activity: 1316

Merit: 1005

Quote from: Diapolo on August 11, 2011, 10:42:23 AM

Download version 2011-08-11: http://www.mediafire.com/?s5c7h4r91r4ad4j

New version for your testing pleasure Wink

. Remember to use VECTORS2 as switch!
This one should be a bit faster for 58XX and 69XX cards compared to earlier versions PLUS it should not generate invalid shares, if more than 1 positve nonce is found in a work-group!

If a few of you could make a comparison (with older or other kernel versions) of accepted shares over a certain period of time, this woule be pretty cool!

Dia

6950 @ 920/300; Linux 2.6.38, 11.6/2.4; 2x 5 min runs for each setting with Phoenix 1.50

AGGRESSION=12 BFI_INT FASTLOOP=false VECTORS2

WORKSIZE=128
[374.89 Mhash/sec] [28 Accepted] [0 Rejected] [RPC (+LP)]
- Negligible difference from 2011-08-02 kernel.

WORKSIZE=256
[344.50 Mhash/sec] [25 Accepted] [0 Rejected] [RPC (+LP)]
- Significant drop of ~25-30 Mh/s from 08-02 kernel.

Diapolo

hero member

Activity: 772

Merit: 500

Download version 2011-08-11: http://www.mediafire.com/?s5c7h4r91r4ad4j

New version for your testing pleasure Wink

. Remember to use VECTORS2 as switch!
This one should be a bit faster for 58XX and 69XX cards compared to earlier versions PLUS it should not generate invalid shares, if more than 1 positve nonce is found in a work-group!

If a few of you could make a comparison (with older or other kernel versions) of accepted shares over a certain period of time, this woule be pretty cool!

Dia

Phateus

newbie

Activity: 52

Merit: 0

Quote from: Diapolo on August 09, 2011, 02:49:07 PM

Quote from: jedi95 on August 08, 2011, 09:56:28 PM

Quote from: bcforum on August 08, 2011, 09:50:54 PM

Quote from: jedi95 on August 08, 2011, 01:57:58 PM

1. AND the low 16-bits of H against the high 16 bits
2. Take the resulting 16-bit number and OR the low 8 bits against the high 8-bits
3. Take the resulting 8-bit number and OR the low 4 bits against the high 4-bits
4. Take the resulting 4-bit number and OR the low 2 bits against the high 2-bits
5. Take the resulting 2-bit number and NOR the first bit against the second bit

6. do bitwise AND of the resulting 1-bit number against the nonce
7. take the result from #6 and XOR the low 16-bits against the high 16-bits
8. take the resulting 16-bit number from #7 and OR the low 8-bits against the high 8-bits
9. store the result by doing output[OUTPUT_SIZE] = OUTPUT[result of #8] = nonce

Steps 1-5 create a single bit indicating if the nonce meets H == 0. When you bitwise AND this against the nonce in step 6 you will get 0 for any invalid nonces and for valid nonces you will just get the nonce again. (1 AND X = X)

I don't claim to understand this, but step (1) should be an OR, not an AND.

Yeah that's right. Must have missed that when I went over the post. I had it correct in the example though.

I tried to implement this, but the kernel only crashes the display driver THAT hard, I get a Bluescreen everytime ... weird.

Code:

	// Round 124
	Vals[7] += Vals[3] + P4(124) + P3(124) + P1(124) + P2(124) + ch(124) + s1(124) + H[7];

	...

	// lo 16 Bits OR hi 16 Bits
	uint positive = (Vals[7].x & 0x0000FFFFU) | (Vals[7].x & 0xFFFF0000U);
	// lo 8 Bits OR hi 8 Bits
	positive = (positive & 0x00FFU) | (positive & 0xFF00U);
	// lo 4 Bits OR hi 4 Bits
	positive = (positive & 0x0FU) | (positive & 0xF0U);
	// lo 2 Bits OR hi 2 Bits
	positive = (positive & 0x3U) | (positive & 0xCU);
	// lo 1 Bit NOR hi 1 Bit
	positive = ~((positive & 0x1U) | (positive & 0x2U));

	// nonce AND positive
	uint position = W_3.x & positive;
	// lo 16 Bits XOR hi 16 Bits
	position = (position & 0x0000FFFFU) ^ (position & 0xFFFF0000U);
	// lo 8 Bits OR hi 8 Bits
	position = (position & 0x00FFU) | (position & 0xFF00U);
		
	output[position] = W_3.x;

Dia

You need to shift the the bits for each stage:

For example, oring the top bits to the bottom bits should be:

Code:

uint positive = (Vals[7].x & 0x0000FFFFU) | ((Vals[7].x & 0xFFFF0000U) >> 16);

or just:

Code:

uint positive = (Vals[7].x & 0x0000FFFFU) | (Vals[7].x >> 16);

because the upper 16 bits will already be 0 because of the shift;

Otherwise, you will just get the original Vals[7] value;
if you want to do it that way, the code would be:

Code:

	uint positive = (Vals[7].x & 0x0000FFFFU) | (Vals[7].x >> 16);
	// lo 8 Bits OR hi 8 Bits
	positive = (positive & 0x00FFU) | (positive >> 8);
	// lo 4 Bits OR hi 4 Bits
	positive = (positive & 0x0FU) | (positive >> 4);
	// lo 2 Bits OR hi 2 Bits
	positive = (positive & 0x3U) | (positive >> 2);
	// lo 1 Bit NOR hi 1 Bit
	positive = ~((positive & 0x1U) | (positive >> 1));

However, similar to what I said earlier, the following code does the same thing:

Code:

	uint positive = 0xFFFFFFFF + min(Vals[7], 1u);

if Vals[7] ==0, then min(Vals[7], 1u) == 0, otherwise it equals 1
0xFFFFFFFF + 0 = 0xFFFFFFFF
0xFFFFFFFF + 1 = 0

oh yeah... you are getting blue screens because your address would be a random 32 bit number and it was probably trying to access memory that your video card doesn't have

Diapolo

hero member

Activity: 772

Merit: 500

Quote from: jedi95 on August 08, 2011, 09:56:28 PM

Quote from: bcforum on August 08, 2011, 09:50:54 PM

Quote from: jedi95 on August 08, 2011, 01:57:58 PM

1. AND the low 16-bits of H against the high 16 bits
2. Take the resulting 16-bit number and OR the low 8 bits against the high 8-bits
3. Take the resulting 8-bit number and OR the low 4 bits against the high 4-bits
4. Take the resulting 4-bit number and OR the low 2 bits against the high 2-bits
5. Take the resulting 2-bit number and NOR the first bit against the second bit

6. do bitwise AND of the resulting 1-bit number against the nonce
7. take the result from #6 and XOR the low 16-bits against the high 16-bits
8. take the resulting 16-bit number from #7 and OR the low 8-bits against the high 8-bits
9. store the result by doing output[OUTPUT_SIZE] = OUTPUT[result of #8] = nonce

Steps 1-5 create a single bit indicating if the nonce meets H == 0. When you bitwise AND this against the nonce in step 6 you will get 0 for any invalid nonces and for valid nonces you will just get the nonce again. (1 AND X = X)

I don't claim to understand this, but step (1) should be an OR, not an AND.

Yeah that's right. Must have missed that when I went over the post. I had it correct in the example though.

I tried to implement this, but the kernel only crashes the display driver THAT hard, I get a Bluescreen everytime ... weird.

Code:

	// Round 124
	Vals[7] += Vals[3] + P4(124) + P3(124) + P1(124) + P2(124) + ch(124) + s1(124) + H[7];

	...

	// lo 16 Bits OR hi 16 Bits
	uint positive = (Vals[7].x & 0x0000FFFFU) | (Vals[7].x & 0xFFFF0000U);
	// lo 8 Bits OR hi 8 Bits
	positive = (positive & 0x00FFU) | (positive & 0xFF00U);
	// lo 4 Bits OR hi 4 Bits
	positive = (positive & 0x0FU) | (positive & 0xF0U);
	// lo 2 Bits OR hi 2 Bits
	positive = (positive & 0x3U) | (positive & 0xCU);
	// lo 1 Bit NOR hi 1 Bit
	positive = ~((positive & 0x1U) | (positive & 0x2U));

	// nonce AND positive
	uint position = W_3.x & positive;
	// lo 16 Bits XOR hi 16 Bits
	position = (position & 0x0000FFFFU) ^ (position & 0xFFFF0000U);
	// lo 8 Bits OR hi 8 Bits
	position = (position & 0x00FFU) | (position & 0xFF00U);
		
	output[position] = W_3.x;

Dia

Diapolo

hero member

Activity: 772

Merit: 500

Quote from: indio007 on August 09, 2011, 11:35:18 AM

Just out of curiosity , how many unique downloads of your modification have there been? If you know of course.

The sum of all downloads is > 5500 (for all released versions).

Dia

indio007

full member

Activity: 224

Merit: 100

Just out of curiosity , how many unique downloads of your modification have there been? If you know of course.

Diapolo

hero member

Activity: 772

Merit: 500

Quote from: indio007 on August 09, 2011, 11:14:13 AM

Sent you half a bit to keep you motivated. Grin

Keep up the good work Diapolo

Woohoo I feel damn motivated Wink

... thanks mate!

Dia

indio007

full member

Activity: 224

Merit: 100

Sent you half a bit to keep you motivated. Grin

Keep up the good work Diapolo

jedi95

full member

Activity: 219

Merit: 120

Quote from: bcforum on August 08, 2011, 09:50:54 PM

Quote from: jedi95 on August 08, 2011, 01:57:58 PM

1. AND the low 16-bits of H against the high 16 bits
2. Take the resulting 16-bit number and OR the low 8 bits against the high 8-bits
3. Take the resulting 8-bit number and OR the low 4 bits against the high 4-bits
4. Take the resulting 4-bit number and OR the low 2 bits against the high 2-bits
5. Take the resulting 2-bit number and NOR the first bit against the second bit

6. do bitwise AND of the resulting 1-bit number against the nonce
7. take the result from #6 and XOR the low 16-bits against the high 16-bits
8. take the resulting 16-bit number from #7 and OR the low 8-bits against the high 8-bits
9. store the result by doing output[OUTPUT_SIZE] = OUTPUT[result of #8] = nonce

Steps 1-5 create a single bit indicating if the nonce meets H == 0. When you bitwise AND this against the nonce in step 6 you will get 0 for any invalid nonces and for valid nonces you will just get the nonce again. (1 AND X = X)

I don't claim to understand this, but step (1) should be an OR, not an AND.

Yeah that's right. Must have missed that when I went over the post. I had it correct in the example though.

bcforum

full member

Activity: 140

Merit: 100

Quote from: jedi95 on August 08, 2011, 01:57:58 PM

1. AND the low 16-bits of H against the high 16 bits
2. Take the resulting 16-bit number and OR the low 8 bits against the high 8-bits
3. Take the resulting 8-bit number and OR the low 4 bits against the high 4-bits
4. Take the resulting 4-bit number and OR the low 2 bits against the high 2-bits
5. Take the resulting 2-bit number and NOR the first bit against the second bit

6. do bitwise AND of the resulting 1-bit number against the nonce
7. take the result from #6 and XOR the low 16-bits against the high 16-bits
8. take the resulting 16-bit number from #7 and OR the low 8-bits against the high 8-bits
9. store the result by doing output[OUTPUT_SIZE] = OUTPUT[result of #8] = nonce

Steps 1-5 create a single bit indicating if the nonce meets H == 0. When you bitwise AND this against the nonce in step 6 you will get 0 for any invalid nonces and for valid nonces you will just get the nonce again. (1 AND X = X)

I don't claim to understand this, but step (1) should be an OR, not an AND.

Phateus

newbie

Activity: 52

Merit: 0

Quote from: jedi95 on August 08, 2011, 01:57:58 PM

The steps:

1. AND the low 16-bits of H against the high 16 bits
2. Take the resulting 16-bit number and OR the low 8 bits against the high 8-bits
3. Take the resulting 8-bit number and OR the low 4 bits against the high 4-bits
4. Take the resulting 4-bit number and OR the low 2 bits against the high 2-bits
5. Take the resulting 2-bit number and NOR the first bit against the second bit

6. do bitwise AND of the resulting 1-bit number against the nonce
7. take the result from #6 and XOR the low 16-bits against the high 16-bits
8. take the resulting 16-bit number from #7 and OR the low 8-bits against the high 8-bits
9. store the result by doing output[OUTPUT_SIZE] = OUTPUT[result of #8] = nonce

Steps 1-5 create a single bit indicating if the nonce meets H == 0. When you bitwise AND this against the nonce in step 6 you will get 0 for any invalid nonces and for valid nonces you will just get the nonce again. (1 AND X = X)

Steps 7-8 are to produce an 8-bit index that is 0 for all invalid nonces and hopefuly unique for each valid nonce assuming there are a small number of valid nonces. However in the worst case (more than 1 hash found in a single execution) at least 1 will be returned. However if 3 or less nonces are found per execution all of them should be returned in most cass.

Sorry to jump in in the middle of the conversation, but if I understand what you are trying to do...
Can't you just replace all of the steps with:

Code:

Valid = 1 - min(H, 1u);
Nonce = W[3];
OUTPUT[((Nonce & OUTPUT_MASK) + 1) * Valid] = Nonce;

if you are trying to remove all control flow? Any invalid nonce will be written into Output[0] and the valid nonces will be randomly distributed through the rest of the array.

I really don't know how the architecture handles having 4 billion threads writing to the same address, but... you may want to try it out...

Also, it is easy enough to make it work with VECTORS ;

Code:

Valid = 1 - (min(H.x, H.y), 1u);
//If .y is valid, add 1 to the nonce.
Nonce = W[3].x + min(H.y, 1);
OUTPUT[((Nonce & OUTPUT_MASK) + 1) * Valid] = Nonce;

(or you could just double the code for .x and .y)

OR

Code:

Valid = 1 - (min(H.x, H.y), 1u);
//If .y is valid, add 1 to the nonce.
Nonce = W[3].x;
OUTPUT[((Nonce & OUTPUT_MASK) + 1) * Valid] = Nonce;

and have the __init__ file check both Nonce and Nonce+1

another way of doing it would be (the compiler should replace the if statement with a set conditional):

Code:

Nonce = W[3];
Position = W[3] & OUTPUT_MASK;
if(H)
   Position = OUTPUT_MASK + 1;
//Invalid nonce are at the last position of the array, valid are distributed at the front
OUTPUT[Position] = Nonce;

Slightly faster would be to have the Position = the local thread # (since you save an &) and make sure that the size of the output* array is WORKSIZE + 1:

Code:

Nonce = W[3];
Position = get_local_id(0);
if(H)
   Position = WORKSIZE + 1;
OUTPUT[Position] = Nonce;

EDIT: Ooh, just thought of something else:

If it doesn't like writing everything to the same address: Make the buffer size = 2*WORKSIZE...

Code:

Nonce = W[3];
Position = get_local_id(0);
if(H)
   Position += WORKSIZE;
OUTPUT[Position] = Nonce;

Then all of the threads in a workgroup will write to a different address. The valid nonces will be in the first half, and the invalid will be in the second.

Now I have no idea if any of these things would be faster, but I think all of them would work...

Sorry to put so much code down... but this kind of coding isn't really an exact science...

Diapolo

hero member

Activity: 772

Merit: 500

jedi95

full member

Activity: 219

Merit: 120

Quote from: Diapolo on August 07, 2011, 02:21:39 AM

You have to compare the loss of valid nonces to the higher efficiency because of the removed control flow in the kernel (all current GPUs dislike if/else and so on). I thought this tradeoff would be well worth it, but you could prove me wrong. I was thinking about a better way of writing the positive nonces into output, but that didn't work.

Any good ideas for that part of the kernel will be a big plus!

Dia

After looking at the code more carefully your method is only problematic if more than 1 vector component returns a valid nonce. The odds of this happening are EXTREMELY small, since you would have to find more than 1 valid hash in a range of only 2 or 4 hashes.

That said, I have devised a way to remove the if(nonce) control structure entirely. This makes a couple assumptions:

1. Control flow instructions have a large clock cycle penalty regardless of the branch taken (so you get 44 cycle penalty on Cypress and Cayman regardless of if H == 0)
2. Writing values to output[] for every nonce even if the nonce is invalid does not incur a significant clock cycle cost relative to the control flow instructions. (ideally <10 clocks, but if it's below ~30 the code below will still be faster than the current code)

The steps:

1. OR the low 16-bits of H against the high 16 bits
2. Take the resulting 16-bit number and OR the low 8 bits against the high 8-bits
3. Take the resulting 8-bit number and OR the low 4 bits against the high 4-bits
4. Take the resulting 4-bit number and OR the low 2 bits against the high 2-bits
5. Take the resulting 2-bit number and NOR the first bit against the second bit

6. do bitwise AND of the resulting 1-bit number against the nonce
7. take the result from #6 and XOR the low 16-bits against the high 16-bits
8. take the resulting 16-bit number from #7 and OR the low 8-bits against the high 8-bits
9. store the result by doing output[OUTPUT_SIZE] = OUTPUT[result of #8] = nonce

Steps 1-5 create a single bit indicating if the nonce meets H == 0. When you bitwise AND this against the nonce in step 6 you will get 0 for any invalid nonces and for valid nonces you will just get the nonce again. (1 AND X = X)

Steps 7-8 are to produce an 8-bit index that is 0 for all invalid nonces and hopefuly unique for each valid nonce assuming there are a small number of valid nonces. However in the worst case (more than 1 hash found in a single execution) at least 1 will be returned. However if 3 or less nonces are found per execution all of them should be returned in most cass.

output[0] will be overwritten constantly by invalid nonces (since the 1-bit number from step 5 will be 0 unless the hash satisfies H == 0, the resulting 8-bit number will also be 0) output[>0] will contain valid nonces will a small chance of collisions.

Cypress and Cayman (58xx and 69xx respectively) have a 44 cycle latency for control flow instructions

Steps 1 - 8 should execute in 1 clock each (however they can't be vectorized, so this won't exploit any ILP)

Step 9 takes no longer than the current code for valid nonces, but this will now also apply to invalid nonces.

overall this should be fast, return only valid nonces, and retain the capability to return more than one nonce if the assumptions above are true.

An example of how even a single 1 in the input will cause the output of steps 1-5 to be 0:
--------------------------------------------------------------------------------------

H = 0000000000000001 0000000000000000

00000000 00000001
00000000 00000000
-------------------OR
00000000 00000001

0000 0000
0000 0001
----------OR
0000 0001

00 00
00 01
------OR
00 01

0 0
0 1
---OR
0 1

0
1
-NOR
0

ssateneth

legendary

Activity: 1344

Merit: 1004

I still don't know why people are doing "VECTORS VECTORS2". VECTORS is an invalid argument for diapolo phatk ever since 8-04. The only valid arguments are VECTORS2 and VECTORS4.

Quote

Important: since version 2011-08-04 (pre-release) you have to use the switch VECTORS2 instead of VECTORS. I made this change to be clear what vectors are used in the kernel (2- or 4-component). To use 4-component vectors use switch VECTORS4.

RedLine888

full member

Activity: 236

Merit: 109

Hi! Dunno whether the info I provide would be of any use but nevertheless...

Installed the 2011-08-04 kernel version and got + ~4 MHs on 6950 and - ~3 MHs on 5870 and my 5870 became unstable!!!

It works at 990 core and 360 mem with the previous version of your kernel and is perfectly stable but with this new version the driver crashes after a few seconds at even 980 core. The temps are perfect and stay at less than 78 C.

Thanx though for your work!

Diapolo

hero member

Activity: 772

Merit: 500

Quote from: joulesbeef on August 07, 2011, 06:49:40 PM

Quote

WORKSIZE=128p

typo or something knew I dont know about?

It's only a typo there ...

drlatino999

sr. member

Activity: 335

Merit: 250

Typo, let me edit that to reflect.

joulesbeef

sr. member

Activity: 476

Merit: 250

moOo

Quote

WORKSIZE=128p

typo or something knew I dont know about?

drlatino999

sr. member

Activity: 335

Merit: 250

Using the recommended settings -

Code:

-k phatk AGGRESSION=12 BFI_INT FASTLOOP=false VECTORS VECTORS2 WORKSIZE=128

My 6950 dropped 3C, 5830 stayed the same.

Beta-coiner1

hero member

Activity: 532

Merit: 500

Quote from: Diapolo on August 07, 2011, 02:15:22 PM

I can confirm the temps difference,which I thought was strange.Using Catalyst 11.6B/SDK 2.5 on a 6950 @867/1250 using V 4 W64 F3 temps are 3 C lower using GUI miner.Hash rate has also increased 3 Mh's using those settings as well as invalids are definitely much lower vs. Phataeus.

I have to ask to understand you ... you say that my current pre-release version generates 3°C less heat for your card and invalid share rate is lower in comparison to the latest Phateus phatk?

Dia
[/quote]Yes,that would be correct.also sent a Bitcent your way to help out even though it might not be much.Here's hoping to more development for the 69xx architecture. Wink

Topic: further improved phatk_dia kernel for Phoenix + SDK 2.6 - 2012-01-13 - page 7. (Read 107059 times)