Scrypt based coins: Any proof that this algorithm is realy good?

Come-from-Beyond

legendary

Activity: 2142

Merit: 1010

Newbie

>> Could it really be simplified? That's the main question.

Theoretically it could be. But like factorization of a very long number it's hard to do. And I hope noone will manage to do it at least while I'm holding a lot of LTC.

>> So here we are again. Have you managed to do the alleged NAND/NOR-optimizations? If yes, then how many XORs does it actually eliminate?

I failed to complete the optimization. Replaced it with modified salsa which does precached results lookup. Traded CPU for RAM. Qty of XORs were reduced to 50%.

>> You lost me here. What are you trying to prove by posting some piece of useless code?

It's usefull code. It's easier to optimize coz u could move it into 1st or 2nd cycle, or leave it alone. It also helps to access memory in a "symmetrical" way.

>> And what makes you think that "Scrypt is good, but [1024, 1, 1] was a bad choice"?

2nd part of Scrypt looks like jumps from one chunk of memory to other. If u count how often each chunk is accessed u'll notice that some of them rn't accessed at all and some - very often. Good parameters r supposed to make probability of access equal for all chunks. That's how GOOD random number generators work. With [1024, 1, 1] I always got BAD distribution of access probability. So [1024, 1, 1] is a BAD choice.

ssvb

newbie

Activity: 39

Merit: 0

Quote from: Come-from-Beyond on January 20, 2012, 01:13:43 AM

Coz I posted the code which could be simplified using NAND/NOR-optimizations.

Could it really be simplified? That's the main question.

Quote from: Come-from-Beyond on January 19, 2012, 02:53:36 AM

Look at my function named Salsa(). Each cell from 16-vector uiA is XORed 4*2=8 times. My phrase was

"Imagine what will happen if someone get rid of redundant XORs of 16 temporary memory cells in SALSA..."

It's not about the posted code, it's about NAND/NOR-optimization.

So here we are again. Have you managed to do the alleged NAND/NOR-optimizations? If yes, then how many XORs does it actually eliminate?

Quote

I didn't say that posted code eliminates any XORs.

You lost me here. What are you trying to prove by posting some piece of useless code?

And what makes you think that "Scrypt is good, but [1024, 1, 1] was a bad choice"?

Come-from-Beyond

legendary

Activity: 2142

Merit: 1010

Newbie

>> So what was the purpose of dumping here this garbage C++ edition then? Why not using some pseudocode or intrinsics which would show the idea about the efficient SIMD implementation of scrypt?

Coz I posted the code which could be simplified using NAND/NOR-optimizations. Assembler edition is very hard to read. I wasn't going to show efficient SIMD implementation.

>> Better instructions scheduling on superscalar processors for making use of triple issue is not using "less calculations", it is more like just doing "the same calculations faster".

English is not my mother tongue, sorry if I failed to explain my idea. In few words its like "CPU spends less time".

>> That's an ordinary performance optimization and has nothing to do with the "quality of Scrypt" and does not help to "get rid of redundant XORs". I had a hope that you have something more impressive to show and was somewhat disappointed.

Look at my function named Salsa(). Each cell from 16-vector uiA is XORed 4*2=8 times. My phrase was

"Imagine what will happen if someone get rid of redundant XORs of 16 temporary memory cells in SALSA..."

It's not about the posted code, it's about NAND/NOR-optimization. I didn't say that posted code eliminates any XORs.

>> You have 3 loops looking like "for (i=0; i<32768; i+=16) { ... }" in your code. The first two of them are walking over the buffer sequentially one after another. And because the buffer is larger than L1 cache ("a CPU with 128K L1 per core is quite... uncommon" as you have noticed), you get L1 cache misses on each of these loops.

I thought so before I tried both ways. CPU understands when code do sequential memory access and prefetches data into a cache before the code needs it.

>> Merging these first two loops into one would reduce the number of L1 cache misses and also save some load/store instructions. This would actually make the code look more like the reference implementation.

Merged loop accesses 3 different chunks of memory. This leads to more MOVDQA instructions and makes CPU to maintain 3 flows of prefetching. My edition needs only 2 flows which improve memory access.

>> The thing that actually *might* make some sense is to have the data buffers for 4 hashes interleaved. So on the first loop you just run the SIMD code in exactly the same way as scalar code without the need for any shuffle instructions. However this data layout becomes problematic on the last loop because you need to access different memory areas for different hashes ('j' indexes are different for each of the simultaneously processed 4 hashes) and gather scattered data into different lanes of SIMD registers. Introducing the intermediate loop for deinterleaving the data in addition to XORing it would fix the problem. But the data needs to be interleaved again on the last loop. The disadvantage is extra interleaving/deinterleaving overhead. The advantage is no need for shuffles and possibly better instructions scheduling. Is this what your SSE2 code is actually trying to do?

I got rid of PSHUFDs this way. I collect data from uiStorage to be XORed with 4 hashes into a register using 4 other registers as pointers (EAX, EBX, ECX, EDX). Disadvantage of PSHUFDs is much bigger than disadvantage of collecting data into a register. EAX, EBX, ECX and EDX r set in previous loop iteration, so when I realy need this data it's already in a cache.

ssvb

newbie

Activity: 39

Merit: 0

Quote from: Come-from-Beyond on January 19, 2012, 03:10:44 PM

Code I posted here is C++ edition of the algo (for better reading). CPU runs assembler code which looks completely different

So what was the purpose of dumping here this garbage C++ edition then? Why not using some pseudocode or intrinsics which would show the idea about the efficient SIMD implementation of scrypt?

Quote

My implementation uses less calculations coz CPU process 3 instructions at once using a trick to avoid register stalls. So extra cycle between 2 others is necessary for "less calculations" and "sequential memory access".

Better instructions scheduling on superscalar processors for making use of triple issue is not using "less calculations", it is more like just doing "the same calculations faster". That's an ordinary performance optimization and has nothing to do with the "quality of Scrypt" and does not help to "get rid of redundant XORs". I had a hope that you have something more impressive to show and was somewhat disappointed.

Quote

Btw, a CPU with 128K L1 per core is quite... uncommon and ur statement about 3 passes is wrong.

You have 3 loops looking like "for (i=0; i<32768; i+=16) { ... }" in your code. The first two of them are walking over the buffer sequentially one after another. And because the buffer is larger than L1 cache ("a CPU with 128K L1 per core is quite... uncommon" as you have noticed), you get L1 cache misses on each of these loops. Merging these first two loops into one would reduce the number of L1 cache misses and also save some load/store instructions. This would actually make the code look more like the reference implementation.

The thing that actually *might* make some sense is to have the data buffers for 4 hashes interleaved. So on the first loop you just run the SIMD code in exactly the same way as scalar code without the need for any shuffle instructions. However this data layout becomes problematic on the last loop because you need to access different memory areas for different hashes ('j' indexes are different for each of the simultaneously processed 4 hashes) and gather scattered data into different lanes of SIMD registers. Introducing the intermediate loop for deinterleaving the data in addition to XORing it would fix the problem. But the data needs to be interleaved again on the last loop. The disadvantage is extra interleaving/deinterleaving overhead. The advantage is no need for shuffles and possibly better instructions scheduling. Is this what your SSE2 code is actually trying to do?

Roadhog2k5

full member

Activity: 131

Merit: 100

u, ur, r, coz.

Come-from-Beyond

legendary

Activity: 2142

Merit: 1010

Newbie

Quote from: ssvb on January 19, 2012, 02:13:34 PM

OK, thanks for the clarifications. Now we can see that it was just scaremongering and/or handwaving. So much for "do less calculations" and "sequential memory access" promises. In reality the demonstrated code snippet has more XOR operations and does three separate passes over 128K buffer instead of two (which means more L1 cache misses unless the hardware prefetcher saves the day). The pointless replacement of left rotations with right rotations also looks amusing.

Well... There are some explanations:

Code I posted here is C++ edition of the algo (for better reading). CPU runs assembler code which looks completely different and u should know this unless u r just a coin trader. Ur post looks like an attempt to defend ArtForz' position. But everyone sees that he is just upset that there are other miners which are better then his one (pooler's, mine and I know 2 other guys who developed much better miners).

My implementation uses less calculations coz CPU process 3 instructions at once using a trick to avoid register stalls. So extra cycle between 2 others is necessary for "less calculations" and "sequential memory access". Btw, a CPU with 128K L1 per core is quite... uncommon and ur statement about 3 passes is wrong.

Regarding replacement of left rotations with right ones. I suspect that u r NOT serious. Even a young programmer won't belive that this was made for optimizations or something like that. I made it coz there was already defined function for calculating SHA-256 that used right rotation.

PS: Ask more if u have questions

ssvb

newbie

Activity: 39

Merit: 0

OK, thanks for the clarifications. Now we can see that it was just scaremongering and/or handwaving. So much for "do less calculations" and "sequential memory access" promises. In reality the demonstrated code snippet has more XOR operations and does three separate passes over 128K buffer instead of two (which means more L1 cache misses unless the hardware prefetcher saves the day). The pointless replacement of left rotations with right rotations also looks amusing.

ArtForz

sr. member

Activity: 406

Merit: 257

Quote from: Come-from-Beyond on January 19, 2012, 09:30:58 AM

Quote from: ArtForz on January 19, 2012, 08:58:09 AM

Reference C implementation slower than optimized asm, news at 11.

As I wrote:

"When I wrote it using SSE2 (calc 4 hashes at once) I got 2500 h/s from each GHz of an Intel CPU."

See?

Quote from: ArtForz on January 19, 2012, 08:58:09 AM

Tip: Statements like that make me think "oh dear, another crypto kook".

I don't need tips from a guy who is so infantile. I don't know why u think that this topic has something personal against u.

Read https://bitcointalksearch.org/topic/core-bus-cache-what-is-more-important-for-mining-55670 to see results of my implementation. Hashrate was proven by litecoinpool.org statistics and tests were made by other guys. Ur miner is so slow that even simple optimizations give good boost to performance (+200% bonus).

---
Starting from this point I put u into ignore list so u won't waste my time.

*facepalm*
Again, completely avoiding the topic of outrageous claims without *any* proof.
Oh wait, the "proof" is writing a miner that's 20% slower than poolers. right.
Ignored.

Come-from-Beyond

legendary

Activity: 2142

Merit: 1010

Newbie

Quote from: ArtForz on January 19, 2012, 08:58:09 AM

Reference C implementation slower than optimized asm, news at 11.

As I wrote:

"When I wrote it using SSE2 (calc 4 hashes at once) I got 2500 h/s from each GHz of an Intel CPU."

See?

Quote from: ArtForz on January 19, 2012, 08:58:09 AM

Tip: Statements like that make me think "oh dear, another crypto kook".

I don't need tips from a guy who is so infantile. I don't know why u think that this topic has something personal against u.

Read https://bitcointalksearch.org/topic/core-bus-cache-what-is-more-important-for-mining-55670 to see results of my implementation. Hashrate was proven by litecoinpool.org statistics and tests were made by other guys. Ur miner is so slow that even simple optimizations give good boost to performance (+200% bonus).

---
Starting from this point I put u into ignore list so u won't waste my time.

ArtForz

sr. member

Activity: 406

Merit: 257

Quote from: Come-from-Beyond on January 19, 2012, 08:32:18 AM

My implementation of Scrypt is optimized for an Intel cpu and runs faster than ur code.

Reference C implementation slower than optimized asm, news at 11.

Quote from: Come-from-Beyond on January 19, 2012, 08:32:18 AM

The 1st step I made was optimization of ur miner and I wasn't satisfied with results, so I moved to original Tarsnap code and then to implementation posted here. I took into account a REAL cpu that has limited cache for code and data, not very clever branch predictor, register stalls and that is able to process 3 instructions at once.

So... exactly what pooler did, except he managed it without big secrecy or talking about boolean simplification and factor 1000 speedups "if it only had worked".

Tip: Statements like that make me think "oh dear, another crypto kook".

Come-from-Beyond

legendary

Activity: 2142

Merit: 1010

Newbie

Quote from: ArtForz on January 19, 2012, 08:15:18 AM

Quote from: Come-from-Beyond on January 19, 2012, 08:11:52 AM

Quote from: fivebells on January 19, 2012, 08:03:05 AM

Your fear that someone might find a faster implementation by running one instance of it through some magical optimization software doesn't seem that plausible to me. Roughly speaking, scrypt is based on randomly mixing up the results of a large number of SHA256 hashes, so any substantial optimization which didn't also optimize SHA256 would be pretty surprising. Any accurate optimization would still need random access to all those hashes, which forces a very large circuit.

I'm afraid that this circuit won't be large enough for modern FPGAs.

... says the guy claiming +1024 is negative.

I got u.

U assumed that time of calculating of a Scrypt-based hash is proportionally to number of operations. It's wrong.

For example, i*4+i could be faster than i*5.

My implementation of Scrypt is optimized for an Intel cpu and runs faster than ur code. The 1st step I made was optimization of ur miner and I wasn't satisfied with results, so I moved to original Tarsnap code and then to implementation posted here. I took into account a REAL cpu that has limited cache for code and data, not very clever branch predictor, register stalls and that is able to process 3 instructions at once.

ArtForz

sr. member

Activity: 406

Merit: 257

Quote from: Come-from-Beyond on January 19, 2012, 08:11:52 AM

Quote from: fivebells on January 19, 2012, 08:03:05 AM

Your fear that someone might find a faster implementation by running one instance of it through some magical optimization software doesn't seem that plausible to me. Roughly speaking, scrypt is based on randomly mixing up the results of a large number of SHA256 hashes, so any substantial optimization which didn't also optimize SHA256 would be pretty surprising. Any accurate optimization would still need random access to all those hashes, which forces a very large circuit.

I'm afraid that this circuit won't be large enough for modern FPGAs.

... says the guy claiming +1024 is negative.

Come-from-Beyond

legendary

Activity: 2142

Merit: 1010

Newbie

Quote from: fivebells on January 19, 2012, 08:03:05 AM

Your fear that someone might find a faster implementation by running one instance of it through some magical optimization software doesn't seem that plausible to me. Roughly speaking, scrypt is based on randomly mixing up the results of a large number of SHA256 hashes, so any substantial optimization which didn't also optimize SHA256 would be pretty surprising. Any accurate optimization would still need random access to all those hashes, which forces a very large circuit.

I'm afraid that this circuit won't be large enough for modern FPGAs.

Come-from-Beyond

legendary

Activity: 2142

Merit: 1010

Newbie

Quote from: ArtForz on January 19, 2012, 07:45:40 AM

Let's play a game of counting XORs, shall we?

...

Congratulations. Your implementation *increased* the total # of 16-dword XORs vs. reference scrypt by 1024... ?!?

Perhaps. I didn't check it by myself but I suppose ur counting is correct. So?

fivebells

sr. member

Activity: 462

Merit: 250

Your fear that someone might find a faster implementation by running one instance of it through some magical optimization software doesn't seem that plausible to me. Roughly speaking, scrypt is based on randomly mixing up the results of a large number of SHA256 hashes, so any substantial optimization which didn't also optimize SHA256 would be pretty surprising. Any accurate optimization would still need random access to all those hashes, which forces a very large circuit.

ArtForz

sr. member

Activity: 406

Merit: 257

Let's play a game of counting XORs, shall we?

Code:

for (i=0; i<32768; i+=16)
{
for (j=0; j<16; j++)
uiStorage[i+32+j]=uiStorage[i+j]^uiStorage[i+16+j];
Salsa(&uiStorage[i+32]);
}
for (i=0; i<32768; i+=32)
{
for (j=0; j<16; j++)
uiStorage[i+j]^=uiStorage[i+16+j];
}
for (i=0; i<32768; i+=16)
{
j=((uiStorage[32784]&1023)<<5)+(i&31);
for (k=0; k<16; k++)
uiStorage[32768+(i&31)+k]=uiStorage[32768+k]^uiStorage[32784+k]^uiStorage[j+k];
Salsa(&uiStorage[32768+(i&31)]);
}

first loop is 2048 times 1 16-dword XOR
2nd loop 1024 times 1 16-dword XOR
3rd loop 2048 times 2 16-dword XORs (sorry, a 3-input XOR is still 2 XORs)
total # of 16-dword XORs ... 7168

vs. a simple stupid scrypt(1024,1,1) (using 16-dword vectors for X and V to keep things more readable):

Code:

for (i = 0; i < 1024; i++) {
V[i*2+0] = X[0];
V[i*2+1] = X[1];
X[0] ^= X[1];
X[0] = salsa20(X[0]);
X[1] ^= X[0];
X[1] = salsa20(X[1]);
}
for (i = 0; i < 1024; i++) {
j = X[1].vec_element[0] & 1023;
X[0] ^= V[j * 2 + 0];
X[1] ^= V[j * 2 + 1];
X[0] ^= X[1]
X[0] = salsa20(X[0]);
X[1] ^= X[0];
X[1] = salsa20(X[1]);
}

let's count again...
1st loop 1024 times 2 16-dword XORs
2nd loop 1024 times 4 16-dword XORs
total # of 16-dword XORs ... 6144

Congratulations. Your implementation *increased* the total # of 16-dword XORs vs. reference scrypt by 1024... ?!?

Come-from-Beyond

legendary

Activity: 2142

Merit: 1010

Newbie

Quote from: ssvb on January 19, 2012, 05:37:02 AM

Quote from: Come-from-Beyond on December 21, 2011, 12:43:48 PM

I think no need to discuss this topic. We should just pray that noone will solve the problem how to simplify SALSA.

Praying is not very useful and obscurity is definitely not the right way to do things here

Even without sharing the sources yet, do you have any statistics for the number of 32-bit XOR, ADD and ROTATE operations before and after your optimization? Or even better would be the statistics for 128-bit SIMD operations (XOR, ADD, ROTATE, SHUFFLE).

I didn't count the statistics. I tried to simplify the algo using usual NAND/NOR-logic optimization techniques. (Un)fortunatelly I ran out of memory (I had only 16 Gb) so optimization wasn't completed.

If u wish u could try to do it urself using my implementation of [1024, 1, 1]-Scrypt. When I wrote it using SSE2 (calc 4 hashes at once) I got 2500 h/s from each GHz of an Intel CPU.

Code:

#define ROTR(x, n) ((x>>n)|(x<<(32-n)))

void Salsa(unsigned __int32 uiData[16])
{
unsigned __int32 uiA[16], i, j;

for (i=0; i<16; i++) uiA[i]=uiData[i];

for (i=0; i<4; i++)
{
uiA[4]^=ROTR((uiA[0]+uiA[12]), 25);
uiA[9]^=ROTR((uiA[5]+uiA[1]), 25);
uiA[14]^=ROTR((uiA[10]+uiA[6]), 25);
uiA[3]^=ROTR((uiA[15]+uiA[11]), 25);

uiA[8]^=ROTR((uiA[4]+uiA[0]), 23);
uiA[13]^=ROTR((uiA[9]+uiA[5]), 23);
uiA[2]^=ROTR((uiA[14]+uiA[10]), 23);
uiA[7]^=ROTR((uiA[3]+uiA[15]), 23);

uiA[12]^=ROTR((uiA[8]+uiA[4]), 19);
uiA[1]^=ROTR((uiA[13]+uiA[9]), 19);
uiA[6]^=ROTR((uiA[2]+uiA[14]), 19);
uiA[11]^=ROTR((uiA[7]+uiA[3]), 19);

uiA[0]^=ROTR((uiA[12]+uiA[8]), 14);
uiA[5]^=ROTR((uiA[1]+uiA[13]), 14);
uiA[10]^=ROTR((uiA[6]+uiA[2]), 14);
uiA[15]^=ROTR((uiA[11]+uiA[7]), 14);

uiA[1]^=ROTR((uiA[0]+uiA[3]), 25);
uiA[6]^=ROTR((uiA[5]+uiA[4]), 25);
uiA[11]^=ROTR((uiA[10]+uiA[9]), 25);
uiA[12]^=ROTR((uiA[15]+uiA[14]), 25);

uiA[2]^=ROTR((uiA[1]+uiA[0]), 23);
uiA[7]^=ROTR((uiA[6]+uiA[5]), 23);
uiA[8]^=ROTR((uiA[11]+uiA[10]), 23);
uiA[13]^=ROTR((uiA[12]+uiA[15]), 23);

uiA[3]^=ROTR((uiA[2]+uiA[1]), 19);
uiA[4]^=ROTR((uiA[7]+uiA[6]), 19);
uiA[9]^=ROTR((uiA[8]+uiA[11]), 19);
uiA[14]^=ROTR((uiA[13]+uiA[12]), 19);

uiA[0]^=ROTR((uiA[3]+uiA[2]), 14);
uiA[5]^=ROTR((uiA[4]+uiA[7]), 14);
uiA[10]^=ROTR((uiA[9]+uiA[8]), 14);
uiA[15]^=ROTR((uiA[14]+uiA[13]), 14);
}

for (i=0; i<16; i++) uiData[i]+=uiA[i];
}

unsigned __int32 uiStorage[32768+32];
..........................
..........................
..........................

for (i=0; i<32768; i+=16)
{
for (j=0; j<16; j++) uiStorage[i+32+j]=uiStorage[i+j]^uiStorage[i+16+j];
Salsa(&uiStorage[i+32]);
}
for (i=0; i<32768; i+=32)
{
for (j=0; j<16; j++) uiStorage[i+j]^=uiStorage[i+16+j];
}
for (i=0; i<32768; i+=16)
{
j=((uiStorage[32784]&1023)<<5)+(i&31);
for (k=0; k<16; k++) uiStorage[32768+(i&31)+k]=uiStorage[32768+k]^uiStorage[32784+k]^uiStorage[j+k];
Salsa(&uiStorage[32768+(i&31)]);
}
..........................
..........................
..........................

I can't post here my semi-optimized solution. It used some precomputing and required a lot of memory. Nothing special, but I decided to keep it in a cold dark place.

ssvb

newbie

Activity: 39

Merit: 0

Quote from: Come-from-Beyond on December 21, 2011, 12:43:48 PM

I think no need to discuss this topic. We should just pray that noone will solve the problem how to simplify SALSA.

Praying is not very useful and obscurity is definitely not the right way to do things here

Even without sharing the sources yet, do you have any statistics for the number of 32-bit XOR, ADD and ROTATE operations before and after your optimization? Or even better would be the statistics for 128-bit SIMD operations (XOR, ADD, ROTATE, SHUFFLE).

Come-from-Beyond

legendary

Activity: 2142

Merit: 1010

Newbie

I'm going to publish the miner when I create a "pool" for it.

grue

legendary

Activity: 2058

Merit: 1452

Quote from: Come-from-Beyond on December 21, 2011, 11:58:05 AM

PS: My modified Scrypt code do less calculations. It gives approx. 20% boost when compared to original code of Tarnsnap project due to sequential memory access. Imagine what will happen if someone get rid of redundant XORs of 16 temporary memory cells in SALSA...

do share!

Topic: Scrypt based coins: Any proof that this algorithm is realy good? (Read 6370 times)