Author

Topic: Scrypt Mem usage/ CPU power tradeoff (Read 2459 times)

eZc
member
Activity: 74
Merit: 10
April 01, 2013, 08:33:47 AM
#10
A bit OT but anyway:
So letz calc roughly the Idea of the usage of Sram and a FPGA...
So according to http://ch.mouser.com/Semiconductors/Memory/SRAM/_/N-4bzpt?P=1yzrctc&Keyword=SRAM&OrgTerm=sram&FS=True a Sram costs around 5 USD and uses around 50 IO pins...
Letz say we can upsacle a 100-200USD FPGA to 1000 IO. This gives us 20 paralell Lookups. So lets do the Math:
20*100000000/2500 ~ 800kH for around 500-600 USD. Well this is calculated very quick, rugh and quite optimistic to run 1000 IO at 100MHz. So probably not worth to go FPGA....
sr. member
Activity: 403
Merit: 251
March 28, 2013, 08:51:24 AM
#9

It looks like the scratchpad is regenerated extensively for cases of LOOKUP_GAP != 2 but not for LOOKUP_GAP == 2

If I'm not mistaken the 2 versions are logically the same for LOOKUP_GAP == 2
(salsa(V) if not called at all if j is even, and called only once if j is odd)
sr. member
Activity: 403
Merit: 251
March 28, 2013, 08:43:52 AM
#8

Thx; tc seams clear to me now. Do we use multiple calcs per shader (higher tc than shaderz) because we try to get lucky and wanna see the chance, that we could get out multiple useful values with a bulk mem transfer?
But what exactly does Lookup gap?
I cant belive its easyer to regenerate the scratchpad for lookups, than to wait for the mem access (i know random mem access takes extremly long, but still regenerate scratchpad??)

Several threads per Stream Processor, apparently.

I'm just guessing: GPU could work faster but VRAM random access speed is the limiting factor
(entire VRAM is used after all, and Lookup gap > 1 means you have to write+read several times
instead of once)

Performance hit from waiting for access of twice as much mem (hyper memory, or regular ram instead of cache)
would be worse than Lookup gap -- double Lookup gap and you can always get 50% less ram use for ~50% performance hit.
This would be the same for hypothetical Asics and FPGA.


CPU is more interesting. LTC's 128KB scrypt implementation fits in L2 cache, but L3 cache is almost as fast.

http://www.xbitlabs.com/articles/cpu/display/core-i7-3770k-i5-3570k_2.html
http://www.sisoftware.net/?d=qa&f=gpu_mem_latency&l=fr&a=

i7 8MB L3 cache / normal DDR3 ram       10 times faster   (latency  ~4ns / ~40ns)
HD6850 VRAM / Llano shared DDR3 ram    <2 times faster   (random access pattern test 703ns / 1110ns)
HD6850 256kB L2 cache / HD6850 VRAM     2 times faster   (365ns / 703ns)

So, 1 or 2MB lookup table size would be the "sweet spot" for most cpus? Even celerons have 2MB L3 cache...
legendary
Activity: 1484
Merit: 1005
March 26, 2013, 01:40:08 PM
#7
This is the scrypt core from reaper's implementation,
Code:
void scrypt_core(uint4 X[8], __global uint4*restrict lookup)
{
shittify(X);
const uint zSIZE = 8;
const uint ySIZE = (1024/LOOKUP_GAP+(1024%LOOKUP_GAP>0));
const uint xSIZE = CONCURRENT_THREADS;
uint x = get_global_id(0)%xSIZE;

for(uint y=0; y<1024/LOOKUP_GAP; ++y)
{
#pragma unroll
for(uint z=0; z lookup[CO] = X[z];
for(uint i=0; i salsa(X);
}
#if (LOOKUP_GAP != 2) && (LOOKUP_GAP != 4) && (LOOKUP_GAP != 8)
{
uint y = (1024/LOOKUP_GAP);
#pragma unroll
for(uint z=0; z lookup[CO] = X[z];
for(uint i=0; i<1024%LOOKUP_GAP; ++i)
salsa(X);
}
#endif
for (uint i=0; i<1024; ++i)
{
uint4 V[8];
uint j = X[7].x & 0x3FF;
uint y = (j/LOOKUP_GAP);
#pragma unroll
for(uint z=0; z V[z] = lookup[CO];
#if (LOOKUP_GAP == 2)
if (j&1)
salsa(V);
#else
uint val = j%LOOKUP_GAP;
for (uint z=0; z salsa(V);
#endif

#pragma unroll
for(uint z=0; z X[z] ^= V[z];
salsa(X);
}
unshittify(X);
}

It looks like the scratchpad is regenerated extensively for cases of LOOKUP_GAP != 2 but not for LOOKUP_GAP == 2, in which we
Code:
		if (j&1)
salsa(V);

As compared to
Code:
		uint val = j%LOOKUP_GAP;
for (uint z=0; z salsa(V);

The first loop is a simplication of the bottom loop for this specific case, and it looks like it only needs one run of salsa to keep moving.
eZc
member
Activity: 74
Merit: 10
March 26, 2013, 01:34:05 PM
#6

So how is it currently done in an 7970?
2048 paralell calcs would use 256MB of Ram when u timeshifted start you could probably reduce mem requirement to ~ 190MB. But obviously it uses with high tc 1-2GB. So how is it done?


I think tc and parallel calcs is the same (at least in the reaper
config file "gpu_thread_concurrency" means "how many parallel calcs")

for example:

gpu_thread_concurrency 4000
lookup_gap 1
>LTC buffer size: 500MB.              <--reaper says it uses 500MB, should be 4000*128KB=512000KB(?)


Sort of counterintuitive is that RAM usage can be reduced by much without killing
the hash rate completely:

lookup_gap 8
>LTC buffer size: 62.5MB.
95kH/s

lookup_gap 16
>LTC buffer size: 31.25MB.
58kH/s

lookup_gap 64
>LTC buffer size: 7.8125MB.
16kH/s

Scrypt makes it just too easy to regenerate the Scratchpad
for every lookup.


Thx; tc seams clear to me now. Do we use multiple calcs per shader (higher tc than shaderz) because we try to get lucky and wanna see the chance, that we could get out multiple useful values with a bulk mem transfer?
But what exactly does Lookup gap?
I cant belive its easyer to regenerate the scratchpad for lookups, than to wait for the mem access (i know random mem access takes extremly long, but still regenerate scratchpad??)
Does Lookup gap 64 that only 1/64 of the scratchpad size is stored (random or with any sense wich part is kept???), and when miss recalculate?
sr. member
Activity: 403
Merit: 251
March 22, 2013, 03:17:33 PM
#5

So how is it currently done in an 7970?
2048 paralell calcs would use 256MB of Ram when u timeshifted start you could probably reduce mem requirement to ~ 190MB. But obviously it uses with high tc 1-2GB. So how is it done?


I think tc and parallel calcs is the same (at least in the reaper
config file "gpu_thread_concurrency" means "how many parallel calcs")

for example:

gpu_thread_concurrency 4000
lookup_gap 1
>LTC buffer size: 500MB.              <--reaper says it uses 500MB, should be 4000*128KB=512000KB(?)


Sort of counterintuitive is that RAM usage can be reduced by much without killing
the hash rate completely:

lookup_gap 8
>LTC buffer size: 62.5MB.
95kH/s

lookup_gap 16
>LTC buffer size: 31.25MB.
58kH/s

lookup_gap 64
>LTC buffer size: 7.8125MB.
16kH/s

Scrypt makes it just too easy to regenerate the Scratchpad
for every lookup.
eZc
member
Activity: 74
Merit: 10
March 17, 2013, 01:46:55 PM
#4
You can regenerate the lookup table on the fly instead of storing the whole thing (just work in the space you have; for every lookup, you can generate the whole table by building the beginning of it in the RAM you have and then overwriting it to get to the next series of values).  This is used to speed up the GPU implementation a bit (lookup gap = 2, use approximately half size memory tables as I understand it).

Note that by having to regenerate the table in sequential fragments, things get very expensive very quickly in terms of having to overwrite the RAM buffer a lot and the ALU cycles required to regenerate the table.  This is why it is considered memory hard, and why it's proving a giant pain in the ass to make FPGAs for so far.

The implementation of scrypt for litecoin uses N=1024, r=1, p=1, hence the 128 KB of usage.  You can use more RAM by increasing N or p.  See: www.tarsnap.com/scrypt/scrypt.pdf

Using more than 128 KB with N=1024, r=1, p=1 will yield no faster results

Ok then i didnt fail to understand :-). It just uses 128kB.
So Mem intensive means in terms of mem traffic not in mem size; right(which is obviously true)?

However the Scrachpad generation is almost sequencial so no big deal in DDRx Ram. But the real troubles come with the random access in the second part.

The LX150 has only around 1.2MB of Bram so only around 10 or when you optimize it around 15 paralell calcs possible. With a multiple 1000 clk requirement to calc 1 scrypt --> not a good yield...
Since the higly randomness of the access, I consider FPGA/SRAM combinations as a worth thinking using LVDS deserializer to enhance IO speed. However probabably no chance against GPU.

So how is it currently done in an 7970?
2048 paralell calcs would use 256MB of Ram when u timeshifted start you could probably reduce mem requirement to ~ 190MB. But obviously it uses with high tc 1-2GB. So how is it done?
legendary
Activity: 1344
Merit: 1001
March 17, 2013, 01:38:03 PM
#3
It will take a computer science breakthrough to find a way to beat this trade off.
legendary
Activity: 1484
Merit: 1005
March 17, 2013, 12:23:30 PM
#2
You can regenerate the lookup table on the fly instead of storing the whole thing (just work in the space you have; for every lookup, you can generate the whole table by building the beginning of it in the RAM you have and then overwriting it to get to the next series of values).  This is used to speed up the GPU implementation a bit (lookup gap = 2, use approximately half size memory tables as I understand it).

Note that by having to regenerate the table in sequential fragments, things get very expensive very quickly in terms of having to overwrite the RAM buffer a lot and the ALU cycles required to regenerate the table.  This is why it is considered memory hard, and why it's proving a giant pain in the ass to make FPGAs for so far.

The implementation of scrypt for litecoin uses N=1024, r=1, p=1, hence the 128 KB of usage.  You can use more RAM by increasing N or p.  See: www.tarsnap.com/scrypt/scrypt.pdf

Using more than 128 KB with N=1024, r=1, p=1 will yield no faster results
eZc
member
Activity: 74
Merit: 10
March 17, 2013, 12:10:26 PM
#1
Hi
In many algorithms you can make the tradeoff between mem usage and using cpu power. For example by storing some stuff or just recalculate when needed.
I fail until now to understand how Scrypt can be maked faster when you try to use more mem. The Scratchpad for LTC is just 128kB in Size. What is the trick there(anyway of storing and reusing Salsa results)?

        for (i = 0; i < N; i++) {
            arraycopy(XY, Xi, V, i * (128 * r), 128 * r);
            blockmix_salsa8(XY, Xi, Yi, r);
        }

This just builds up the Scratchpad

        for (i = 0; i < N; i++) {
            int j = integerify(XY, Xi, r) & (N - 1);
            blockxor(V, j * (128 * r), XY, Xi, 128 * r);
            blockmix_salsa8(XY, Xi, Yi, r);
        }

And this pseudorandomly uses some elements of the Scratchpad. So 128kB ram used.
Lets say i have now 1TB of ram -> how can I make it faser?
Jump to: