Pages:
Author

Topic: further improved phatk_dia kernel for Phoenix + SDK 2.6 - 2012-01-13 - page 7. (Read 106928 times)

legendary
Activity: 1316
Merit: 1005
Download version 2011-08-11: http://www.mediafire.com/?s5c7h4r91r4ad4j

New version for your testing pleasure Wink. Remember to use VECTORS2 as switch!
This one should be a bit faster for 58XX and 69XX cards compared to earlier versions PLUS it should not generate invalid shares, if more than 1 positve nonce is found in a work-group!

If a few of you could make a comparison (with older or other kernel versions) of accepted shares over a certain period of time, this woule be pretty cool!

Dia


6950 @ 920/300; Linux 2.6.38, 11.6/2.4; 2x 5 min runs for each setting with Phoenix 1.50

AGGRESSION=12 BFI_INT FASTLOOP=false VECTORS2

WORKSIZE=128
[374.89 Mhash/sec] [28 Accepted] [0 Rejected] [RPC (+LP)]
- Negligible difference from 2011-08-02 kernel.

WORKSIZE=256
[344.50 Mhash/sec] [25 Accepted] [0 Rejected] [RPC (+LP)]
- Significant drop of ~25-30 Mh/s from 08-02 kernel.
hero member
Activity: 772
Merit: 500
Download version 2011-08-11: http://www.mediafire.com/?s5c7h4r91r4ad4j

New version for your testing pleasure Wink. Remember to use VECTORS2 as switch!
This one should be a bit faster for 58XX and 69XX cards compared to earlier versions PLUS it should not generate invalid shares, if more than 1 positve nonce is found in a work-group!

If a few of you could make a comparison (with older or other kernel versions) of accepted shares over a certain period of time, this woule be pretty cool!

Dia
newbie
Activity: 52
Merit: 0


1. AND the low 16-bits of H against the high 16 bits
2. Take the resulting 16-bit number and OR the low 8 bits against the high 8-bits
3. Take the resulting 8-bit number and OR the low 4 bits against the high 4-bits
4. Take the resulting 4-bit number and OR the low 2 bits against the high 2-bits
5. Take the resulting 2-bit number and NOR the first bit against the second bit

6. do bitwise AND of the resulting 1-bit number against the nonce
7. take the result from #6 and XOR the low 16-bits against the high 16-bits
8. take the resulting 16-bit number from #7 and OR the low 8-bits against the high 8-bits
9. store the result by doing output[OUTPUT_SIZE] = OUTPUT[result of #8] = nonce

Steps 1-5 create a single bit indicating if the nonce meets H == 0. When you bitwise AND this against the nonce in step 6 you will get 0 for any invalid nonces and for valid nonces you will just get the nonce again. (1 AND X = X)


I don't claim to understand this, but step (1) should be an OR, not an AND.


Yeah that's right. Must have missed that when I went over the post. I had it correct in the example though.

I tried to implement this, but the kernel only crashes the display driver THAT hard, I get a Bluescreen everytime ... weird.

Code:
	// Round 124
Vals[7] += Vals[3] + P4(124) + P3(124) + P1(124) + P2(124) + ch(124) + s1(124) + H[7];

...

// lo 16 Bits OR hi 16 Bits
uint positive = (Vals[7].x & 0x0000FFFFU) | (Vals[7].x & 0xFFFF0000U);
// lo 8 Bits OR hi 8 Bits
positive = (positive & 0x00FFU) | (positive & 0xFF00U);
// lo 4 Bits OR hi 4 Bits
positive = (positive & 0x0FU) | (positive & 0xF0U);
// lo 2 Bits OR hi 2 Bits
positive = (positive & 0x3U) | (positive & 0xCU);
// lo 1 Bit NOR hi 1 Bit
positive = ~((positive & 0x1U) | (positive & 0x2U));

// nonce AND positive
uint position = W_3.x & positive;
// lo 16 Bits XOR hi 16 Bits
position = (position & 0x0000FFFFU) ^ (position & 0xFFFF0000U);
// lo 8 Bits OR hi 8 Bits
position = (position & 0x00FFU) | (position & 0xFF00U);

output[position] = W_3.x;

Dia

You need to shift the the bits for each stage:

For example, oring the top bits to the bottom bits should be:

Code:
uint positive = (Vals[7].x & 0x0000FFFFU) | ((Vals[7].x & 0xFFFF0000U) >> 16);
or just:
Code:
uint positive = (Vals[7].x & 0x0000FFFFU) | (Vals[7].x >> 16);
because the upper 16 bits will already be 0 because of the shift;

Otherwise, you will just get the original Vals[7] value;
if you want to do it that way, the code would be:
Code:
	uint positive = (Vals[7].x & 0x0000FFFFU) | (Vals[7].x >> 16);
// lo 8 Bits OR hi 8 Bits
positive = (positive & 0x00FFU) | (positive >> 8);
// lo 4 Bits OR hi 4 Bits
positive = (positive & 0x0FU) | (positive >> 4);
// lo 2 Bits OR hi 2 Bits
positive = (positive & 0x3U) | (positive >> 2);
// lo 1 Bit NOR hi 1 Bit
positive = ~((positive & 0x1U) | (positive >> 1));

However, similar to what I said earlier, the following code does the same thing:
Code:
	uint positive = 0xFFFFFFFF + min(Vals[7], 1u);
if Vals[7] ==0, then min(Vals[7], 1u) == 0, otherwise it equals 1
0xFFFFFFFF + 0 = 0xFFFFFFFF
0xFFFFFFFF + 1 = 0


oh yeah...  you are getting blue screens because your address would be a random 32 bit number and it was probably trying to access memory that your video card doesn't have
hero member
Activity: 772
Merit: 500


1. AND the low 16-bits of H against the high 16 bits
2. Take the resulting 16-bit number and OR the low 8 bits against the high 8-bits
3. Take the resulting 8-bit number and OR the low 4 bits against the high 4-bits
4. Take the resulting 4-bit number and OR the low 2 bits against the high 2-bits
5. Take the resulting 2-bit number and NOR the first bit against the second bit

6. do bitwise AND of the resulting 1-bit number against the nonce
7. take the result from #6 and XOR the low 16-bits against the high 16-bits
8. take the resulting 16-bit number from #7 and OR the low 8-bits against the high 8-bits
9. store the result by doing output[OUTPUT_SIZE] = OUTPUT[result of #8] = nonce

Steps 1-5 create a single bit indicating if the nonce meets H == 0. When you bitwise AND this against the nonce in step 6 you will get 0 for any invalid nonces and for valid nonces you will just get the nonce again. (1 AND X = X)


I don't claim to understand this, but step (1) should be an OR, not an AND.


Yeah that's right. Must have missed that when I went over the post. I had it correct in the example though.

I tried to implement this, but the kernel only crashes the display driver THAT hard, I get a Bluescreen everytime ... weird.

Code:
	// Round 124
Vals[7] += Vals[3] + P4(124) + P3(124) + P1(124) + P2(124) + ch(124) + s1(124) + H[7];

...

// lo 16 Bits OR hi 16 Bits
uint positive = (Vals[7].x & 0x0000FFFFU) | (Vals[7].x & 0xFFFF0000U);
// lo 8 Bits OR hi 8 Bits
positive = (positive & 0x00FFU) | (positive & 0xFF00U);
// lo 4 Bits OR hi 4 Bits
positive = (positive & 0x0FU) | (positive & 0xF0U);
// lo 2 Bits OR hi 2 Bits
positive = (positive & 0x3U) | (positive & 0xCU);
// lo 1 Bit NOR hi 1 Bit
positive = ~((positive & 0x1U) | (positive & 0x2U));

// nonce AND positive
uint position = W_3.x & positive;
// lo 16 Bits XOR hi 16 Bits
position = (position & 0x0000FFFFU) ^ (position & 0xFFFF0000U);
// lo 8 Bits OR hi 8 Bits
position = (position & 0x00FFU) | (position & 0xFF00U);

output[position] = W_3.x;

Dia
hero member
Activity: 772
Merit: 500
Just out of curiosity , how many unique downloads of your modification have there been? If you know of course.

The sum of all downloads is > 5500 (for all released versions).

Dia
full member
Activity: 224
Merit: 100
Just out of curiosity , how many unique downloads of your modification have there been? If you know of course.
hero member
Activity: 772
Merit: 500
Sent you half a bit to keep you motivated.   Grin

Keep up the good work Diapolo

Woohoo I feel damn motivated Wink ... thanks mate!

Dia
full member
Activity: 224
Merit: 100
Sent you half a bit to keep you motivated.   Grin

Keep up the good work Diapolo
full member
Activity: 219
Merit: 120


1. AND the low 16-bits of H against the high 16 bits
2. Take the resulting 16-bit number and OR the low 8 bits against the high 8-bits
3. Take the resulting 8-bit number and OR the low 4 bits against the high 4-bits
4. Take the resulting 4-bit number and OR the low 2 bits against the high 2-bits
5. Take the resulting 2-bit number and NOR the first bit against the second bit

6. do bitwise AND of the resulting 1-bit number against the nonce
7. take the result from #6 and XOR the low 16-bits against the high 16-bits
8. take the resulting 16-bit number from #7 and OR the low 8-bits against the high 8-bits
9. store the result by doing output[OUTPUT_SIZE] = OUTPUT[result of #8] = nonce

Steps 1-5 create a single bit indicating if the nonce meets H == 0. When you bitwise AND this against the nonce in step 6 you will get 0 for any invalid nonces and for valid nonces you will just get the nonce again. (1 AND X = X)


I don't claim to understand this, but step (1) should be an OR, not an AND.


Yeah that's right. Must have missed that when I went over the post. I had it correct in the example though.
full member
Activity: 140
Merit: 100


1. AND the low 16-bits of H against the high 16 bits
2. Take the resulting 16-bit number and OR the low 8 bits against the high 8-bits
3. Take the resulting 8-bit number and OR the low 4 bits against the high 4-bits
4. Take the resulting 4-bit number and OR the low 2 bits against the high 2-bits
5. Take the resulting 2-bit number and NOR the first bit against the second bit

6. do bitwise AND of the resulting 1-bit number against the nonce
7. take the result from #6 and XOR the low 16-bits against the high 16-bits
8. take the resulting 16-bit number from #7 and OR the low 8-bits against the high 8-bits
9. store the result by doing output[OUTPUT_SIZE] = OUTPUT[result of #8] = nonce

Steps 1-5 create a single bit indicating if the nonce meets H == 0. When you bitwise AND this against the nonce in step 6 you will get 0 for any invalid nonces and for valid nonces you will just get the nonce again. (1 AND X = X)


I don't claim to understand this, but step (1) should be an OR, not an AND.
newbie
Activity: 52
Merit: 0
The steps:

1. AND the low 16-bits of H against the high 16 bits
2. Take the resulting 16-bit number and OR the low 8 bits against the high 8-bits
3. Take the resulting 8-bit number and OR the low 4 bits against the high 4-bits
4. Take the resulting 4-bit number and OR the low 2 bits against the high 2-bits
5. Take the resulting 2-bit number and NOR the first bit against the second bit

6. do bitwise AND of the resulting 1-bit number against the nonce
7. take the result from #6 and XOR the low 16-bits against the high 16-bits
8. take the resulting 16-bit number from #7 and OR the low 8-bits against the high 8-bits
9. store the result by doing output[OUTPUT_SIZE] = OUTPUT[result of #8] = nonce

Steps 1-5 create a single bit indicating if the nonce meets H == 0. When you bitwise AND this against the nonce in step 6 you will get 0 for any invalid nonces and for valid nonces you will just get the nonce again. (1 AND X = X)

Steps 7-8 are to produce an 8-bit index that is 0 for all invalid nonces and hopefuly unique for each valid nonce assuming there are a small number of valid nonces. However in the worst case (more than 1 hash found in a single execution) at least 1 will be returned. However if 3 or less nonces are found per execution all of them should be returned in most cass.


Sorry to jump in in the middle of the conversation, but if I understand what you are trying to do...
Can't you just replace all of the steps  with:
Code:
Valid = 1 - min(H, 1u);
Nonce = W[3];
OUTPUT[((Nonce & OUTPUT_MASK) + 1) * Valid] = Nonce;
if you are trying to remove all control flow?  Any invalid nonce will be written into Output[0] and the valid nonces will be randomly distributed through the rest of the array.

I really don't know how the architecture handles having 4 billion threads writing to the same address, but... you may want to try it out...

Also, it is easy enough to make it work with VECTORS ;

Code:
Valid = 1 - (min(H.x, H.y), 1u);
//If .y is valid, add 1 to the nonce.
Nonce = W[3].x + min(H.y, 1);
OUTPUT[((Nonce & OUTPUT_MASK) + 1) * Valid] = Nonce;
(or you could just double the code for .x and .y)

OR
Code:
Valid = 1 - (min(H.x, H.y), 1u);
//If .y is valid, add 1 to the nonce.
Nonce = W[3].x;
OUTPUT[((Nonce & OUTPUT_MASK) + 1) * Valid] = Nonce;
and have the __init__ file check both Nonce and Nonce+1


another way of doing it would be (the compiler should replace the if statement with a set conditional):
Code:
Nonce = W[3];
Position = W[3] & OUTPUT_MASK;
if(H)
   Position = OUTPUT_MASK + 1;
//Invalid nonce are at the last position of the array, valid are distributed at the front
OUTPUT[Position] = Nonce;

Slightly faster would be to have the Position = the local thread # (since you save an &) and make sure that the size of the output* array is WORKSIZE + 1:
Code:
Nonce = W[3];
Position = get_local_id(0);
if(H)
   Position = WORKSIZE + 1;
OUTPUT[Position] = Nonce;

EDIT:  Ooh, just thought of something else: 

If it doesn't like writing everything to the same address: Make the buffer size = 2*WORKSIZE...
Code:
Nonce = W[3];
Position = get_local_id(0);
if(H)
   Position += WORKSIZE;
OUTPUT[Position] = Nonce;
Then all of the threads in a workgroup will write to a different address.  The valid nonces will be in the first half, and the invalid will be in the second.

Now I have no idea if any of these things would be faster, but I think all of them would work...

Sorry to put so much code down... but this kind of coding isn't really an exact science...
hero member
Activity: 772
Merit: 500
full member
Activity: 219
Merit: 120
You have to compare the loss of valid nonces to the higher efficiency because of the removed control flow in the kernel (all current GPUs dislike if/else and so on). I thought this tradeoff would be well worth it, but you could prove me wrong. I was thinking about a better way of writing the positive nonces into output, but that didn't work.

Any good ideas for that part of the kernel will be a big plus!

Dia

After looking at the code more carefully your method is only problematic if more than 1 vector component returns a valid nonce. The odds of this happening are EXTREMELY small, since you would have to find more than 1 valid hash in a range of only 2 or 4 hashes.

That said, I have devised a way to remove the if(nonce) control structure entirely. This makes a couple assumptions:

1. Control flow instructions have a large clock cycle penalty regardless of the branch taken (so you get 44 cycle penalty on Cypress and Cayman regardless of if H == 0)
2. Writing values to output[] for every nonce even if the nonce is invalid does not incur a significant clock cycle cost relative to the control flow instructions. (ideally <10 clocks, but if it's below ~30 the code below will still be faster than the current code)

The steps:

1. OR the low 16-bits of H against the high 16 bits
2. Take the resulting 16-bit number and OR the low 8 bits against the high 8-bits
3. Take the resulting 8-bit number and OR the low 4 bits against the high 4-bits
4. Take the resulting 4-bit number and OR the low 2 bits against the high 2-bits
5. Take the resulting 2-bit number and NOR the first bit against the second bit

6. do bitwise AND of the resulting 1-bit number against the nonce
7. take the result from #6 and XOR the low 16-bits against the high 16-bits
8. take the resulting 16-bit number from #7 and OR the low 8-bits against the high 8-bits
9. store the result by doing output[OUTPUT_SIZE] = OUTPUT[result of #8] = nonce

Steps 1-5 create a single bit indicating if the nonce meets H == 0. When you bitwise AND this against the nonce in step 6 you will get 0 for any invalid nonces and for valid nonces you will just get the nonce again. (1 AND X = X)

Steps 7-8 are to produce an 8-bit index that is 0 for all invalid nonces and hopefuly unique for each valid nonce assuming there are a small number of valid nonces. However in the worst case (more than 1 hash found in a single execution) at least 1 will be returned. However if 3 or less nonces are found per execution all of them should be returned in most cass.

output[0] will be overwritten constantly by invalid nonces (since the 1-bit number from step 5 will be 0 unless the hash satisfies H == 0, the resulting 8-bit number will also be 0) output[>0] will contain valid nonces will a small chance of collisions.

Cypress and Cayman (58xx and 69xx respectively) have a 44 cycle latency for control flow instructions

Steps 1 - 8 should execute in 1 clock each (however they can't be vectorized, so this won't exploit any ILP)

Step 9 takes no longer than the current code for valid nonces, but this will now also apply to invalid nonces.

overall this should be fast, return only valid nonces, and retain the capability to return more than one nonce if the assumptions above are true.

An example of how even a single 1 in the input will cause the output of steps 1-5 to be 0:
--------------------------------------------------------------------------------------

H = 0000000000000001 0000000000000000

00000000 00000001
00000000 00000000
-------------------OR
00000000 00000001

0000 0000
0000 0001
----------OR
0000 0001

00 00
00 01
------OR
00 01

0 0
0 1
---OR
0 1

0
1
-NOR
0
legendary
Activity: 1344
Merit: 1004
I still don't know why people are doing "VECTORS VECTORS2". VECTORS is an invalid argument for diapolo phatk ever since 8-04. The only valid arguments are VECTORS2 and VECTORS4.
Quote
Important: since version 2011-08-04 (pre-release) you have to use the switch VECTORS2 instead of VECTORS. I made this change to be clear what vectors are used in the kernel (2- or 4-component). To use 4-component vectors use switch VECTORS4.
full member
Activity: 236
Merit: 109
Hi! Dunno whether the info I provide would be of any use but nevertheless...

Installed the 2011-08-04 kernel version and got + ~4 MHs on 6950 and - ~3 MHs on 5870 and my 5870 became unstable!!!

It works at 990 core and 360 mem with the previous version of your kernel and is perfectly stable but with this new version the driver crashes after a few seconds at even 980 core. The temps are perfect and stay at less than 78 C.

Thanx though for your work!
hero member
Activity: 772
Merit: 500
Quote
WORKSIZE=128p
typo or something knew I dont know about?


It's only a typo there ...
sr. member
Activity: 335
Merit: 250
Typo, let me edit that to reflect.
sr. member
Activity: 476
Merit: 250
moOo
Quote
WORKSIZE=128p
typo or something knew I dont know about?
sr. member
Activity: 335
Merit: 250
Using the recommended settings -
Code:
-k phatk AGGRESSION=12 BFI_INT FASTLOOP=false VECTORS VECTORS2 WORKSIZE=128

My 6950 dropped 3C, 5830 stayed the same.
hero member
Activity: 532
Merit: 500
I can confirm the temps difference,which I thought was strange.Using Catalyst 11.6B/SDK 2.5 on a 6950 @867/1250 using V 4 W64 F3 temps are 3 C lower using GUI miner.Hash rate has also increased 3 Mh's using those settings as well as invalids are definitely much lower vs. Phataeus.

I have to ask to understand you ... you say that my current pre-release version generates 3°C less heat for your card and invalid share rate is lower in comparison to the latest Phateus phatk?

Dia
[/quote]Yes,that would be correct.also sent a Bitcent your way to help out even though it might not be much.Here's hoping to more development for the 69xx architecture.Wink
Pages:
Jump to: