...how hard it is to do the proof of work versus the cost of evaluating the proof. For example for hash cash the ratio is exceptionally good: it takes a single hash to check for a huge amount of work, up to 2^128 work for a perfect 256 bit hash...
...you make both the miners and those verifying their work compute a difficult variant of scrypt; each time a block is solved and advertised every node on the network must redo the work of the miner (in order to check it before broadcasting it as valid); this clearly does not scale, every node must have a processing power equal to the whole mining network...
Incorrect. All the miners are competing and computing scrypt
numerous times trying to guess the correct input to get the required output. Whereas, those verifying that the provided input matches the required output, only have to compute scrypt
once.
Parameters used are N=1024,p=1,r=1 -> 128KB of V.
Not exactly much, but enough to give current gen GPUs a major headache.
Basically, instead of aiming for "enough memory use to hit external memory on any device", aiming for "small enough to fit in L2 cache on CPUs, big enough to require too much on-chip memory/cache for running enough parallel instances to make GPU/FPGA/VLSI worth it."
So... maybe your GPU is 100 times faster than a CPU executing pure ALU code, but to do that you actually need to... like... store the per-thread state somewhere?
So, where to put 128KiB (or 32 KiB using a factor 4 time/space tradeoff) of memory for V *per thread*?
Registers? Way too small.
LDS - works, but you only got 64KiB/CU so you end up with a max of 2 threads/CU even with a factor 4 time/space tradeoff
GDS? way too small again.
L2 read cache? You got 512k of that, so with a /4 reduction you can keep 16 threads worth of V arrays there, but that's not really helping a whole lot due to reasons I'll explain shortly.
External GDDR5? Completely horrible for randomly reading 128 byte items all over the place...
...Best option so far is "use LDS, live with only being able to run 2 threads/CU (due to LDS size)... welll... thanks to what I explained above, that means you're effectively getting ALU throughput of half of one VLIW4 core per CU per clock.
So a whole 6970 ends up effectively performing as if you only had 12 fully utilized VLIW4 cores....
...So, if you have any constructive suggestions on what other approaches to try to wrangle better performance out of a AMD GPU for this variant of scrypt, I'm listening.
Given the roughly
order-of-magnitude ratio in measured kHash/Watt efficiency we now see between say a
HD 7970 and a (ll cores of a)
Core i7 920, I assume your mistake was thinking the latency of reading random 128 byte chunks of GDDR5 would be a significant factor. Given the measured kHash ratio (
~700 ÷ 37) of roughly 20 corresponds roughly to ratio in memory bandwidth between the two (except it appears to be a ratio of 1 if CPU is doing all accesses
from L1 cache which is 20 times faster than the 25 GB/s), and the cgminer code is
not employing a time/space tradeoff factor (i.e.
V is
1024 x 32 x 4 bytes and is not computing some elements on the fly to shrink the buffer size), then it appears that both CPU and GPU are memory bound at the maximum memory bandwidth. Since the HD 7970 has 6GB of GDDR5, then
6 x 1024 x 1024 KB ÷ 128 KB = 48 x 1024 threads could be run to maximize memory throughput, because computation in other threads could continue while some of the threads were blocked on waiting for memory to load into local caches and registers. I assume the HD 7970 is either able to load only 128 bytes per random access and maintain full memory bandwidth or it is merging blocked threads in contiguous memory areas and loading their memory accesses synchronized.
In theory, cgminer could gain even more advantage than it has now by implementing a configurable time/space factor, because it is not thread bound (48 x 1024 is greater than 20) and the relative FLOPS between the HD 7970 and the Core i7 920 is
8192 ÷ 25 = 40 (which is greater than 20), thus the HD 7970 is not likely FLOPS bound, so computing some of
V[i] instead of loading them from memory should increase kHash (yet I don't know if the incremental energy consumed for this incremental increase in kHash will be at the same efficiency).
Note the
Core i7 920 only has 32KB of L1 data cache per core. So even if you are employing all 4 cores by running 4 instances, I don't think you are keeping all 128 KB of memory accesses within L1, unless it is memory bound in L1 and pooler's cpuminer is employing a time/space factor of 4. I fear at 128 KB instead of 32 KB, the CPU's cpuminer may be accumulating memory latency, whereas the GPU's cgminer may not be accumulating as much memory latency.
Inb4 obvious, first thing I tried was the KISS "just allocate a huge buffer for V arrays in global memory, run 64-256 threads and have the mem and L2 controllers figure it out." approach. also with time/space tradeoff factors of 2 to 16. Ends up slower than going with LDS.
Wonder why that didn't work? Could it be that you were not associating each
V buffer with its own thread ("a huge buffer" instead of
48 x 1024 x 128 KB buffers), thus the thread blocking interaction with memory loading wasn't optimized?
If a future generation of GPUs has several MB of fast on-chip cache (which would make em competitive), a forced update with increased memory usage and parallelism is on the wall (e.g. N=4096, p=4 or 8 ).
And it'd still be "fast enough" to not cause too much of an issue with speed of verifying PoW, with N=4096,p=8 still only ~20ms per hash on a current decent CPU core.
Notice that N=4096 p=8 r=1 is one of the parameter sets used in the scrypt paper for relative cost on specialized hardware.
Based on my theory above, the
settings to perhaps restore LTC to a CPU-only mined coin, would be to set
N x r x p x 128 KB = 2GB (are you sure you meant
p above and not
r?), so that a GPU is limited to running a few threads. When thinking about the time cost of verifying (speed of computing) this hash, note it could be done once for the entire pool. Success of this strategy relies on the (
apparently false by a factor 20, i.e. up to 500 ÷ 25 if cpuminer was keeping all accesses with L1 cache, which I doubt) assumption that memory bandwidth does not drop significantly (given more accesses will be from outside the caches), or that current cpu scrypt is FLOPs bound. Probably also want to maximize the number of threads
p (up to the CPU's hardware threads) so as to minimize cumulative memory latency losses on the CPU (but then the GPU gets these additional threads too). Note even if the assumption above is false, reducing the GPU to just a few threads, might inhibit its ability to avoid cumulative memory latency losses (as compared to running up to 48 x 1024 threads, bcz each CPU core is more powerful than GPU core and the CPU is more optimized for lower # of hardware threads).
The other strategy to try is to reduce
N x r x 128 KB to 32 KB to fit into L1 cache and see if that raises the speed of the cpuminer by a significant factor if the CPU is not FLOPS (compute) bound in the 256 KB L2 cache (actually 64KB since split between the 4 cores of the Core i7). More over the L1 cache speed needs to be faster than the memory bandwidth of the GPU and the CPU has to be L1 cache memory bound.
Also as an improvement over the original scrypt, run Salsa20 in step 2 of
the scrypt ROMix and Salsa20/8 (or Chacha/4?) in step 3, so that time/space factors become less advantageous (even though cgminer is apparently not yet employing them). The hash in step2 should be made as slow as possible without step 2 becoming significant compared to step 3. To accomplish this and as another improvement over the original scrypt, I see no reason why the
N in the
for loop (not the
mod N) couldn't be a larger value in step 3 than step 2. Note as Percival pointed out in his discussion about memory latency,
memory bandwidth to FLOPS will continue to reduce, so the above adjustments either have to adjust to the future.
Because you (assumed you) kept CPU memory accesses within L1, you apparently did not concern yourself with the latency overhead that Colin Percival mentioned in
section 6 of his scrypt paper. If
N x r x p is increased, then r will need to be large enough to make latency insignificant, else the CPU would be at a disadvantage to the GPU and its apparently superior thread blocking memory controller (which exists because it has more hyperthreads to justify it).
The above two paragraphs insure the modified scrypt is memory bound against ASICs (and FPGAs) regardless of whether any improvement relative of CPU vs. GPU relative efficiency is achieved. Forcing ASICS to be memory bound means they are unlikely to significantly beat GPUs ever.
If the above ideas for staying within the 32KB cache on CPU (but not on GPU), or masking latency on the CPU with multiple threads but allowing insufficient threads on the GPU at 2+GB, does not sufficiently improve the CPU vs. GPU relative efficiency, then the other ideas to try are as follows.
I am reasonably sure the first idea below will solve the problem!- Run Scrypt of an Scrypt, such that the inner Scrypt is operating on 32KB or 64KB blocks (to stay within L1 or L2 cache) and the outer Scrypt is pseudo-randomly picking one of those blocks from XXGB of blocks based on Integrify() of the result of the inner Scrypt. So this will virtually eliminate the memory latency and keep it running at cache memory bandwidth (if it is not compute-bound), yet the GPU will not be able to run as many threads because of the XXGB requirement.
- GPU's GDDR5 RAM has up to (double an) order-of-magnitude more memory bandwidth than the CPU's DDR3 RAM (with upcoming CPU's DDR4 to double memory bandwidth), thus in large memory strategy (to prevent ASICs too) probably the only sure way to exclude GPUs is the CPU-only memory-bound scrypt could use more GDDR5 memory than any GPU will have-- which is currently 8GB for the upcoming Sony PS4 at 176 GB/s. GDDR5 is estimated to be double the cost of DDR3 (not including additional cooling and more expensive memory controllers and buses), so a CPU-only user would only need to spend less than $50 - 100 to upgrade to 8 - 16GB whereas there isn't much demand for a CPU with 8 -16GB of GDDR5, so even if was available it would probably be in the $400 - 550+ range (currently), i.e half (or more) of an order-of-magnitude. This would only run a 8+ thread on the GPU so in theory it would suffer latency slowdown relative to the CPU running 8+ threads (bcz each CPU core is more powerful than GPU core and the CPU is more optimized for lower # of hardware threads). Given the 18 - 24 months doubling in Moore's law, the ROI on the relative capital outlay on the GPU might outweigh any remaining efficiency advantage of the GPU. Stopping fools from wasting their (and society's) capital on expensive GPU debt-fueled obsessions would make for a more sane and less boom & bust economy.
The 8 - 16GB requirement would reduce the scale of Botnet attacks (since most hijacked computers would need to add DDR RAM). Although the average RAM in a new consumer-grade computer today is probably 6GB, not all the old computers have that, thus 8 - 16GB would favor those who do serious desktop computing (since they likely already have that much DDR RAM), which means we put more capital in the hands of programmers and those who are advancing technology (all human improvement is due to technology as the rest of society is always trying to get something for free via debt and socialism). They would run the miner on their desktop when they weren't working. This might increase demand for DDR RAM which has been declining with the decline of the desktop due to the rise of mobile devices, to help advance DDR4.
Probably want to scale the memory requirement by Moore's Law, but what the heck will IT professionals need 500 - 1600 GB of RAM for in a decade other than this? A peer-to-peer cloud (but wouldn't that need to be on always)? A cache of the web including a proliferation of apps and modules (which gets stored loaded off hard disk every day)?
The price per GB has been historically scaling at halving every 18 months between at least 1990 and 2010, i.e. 0.5 ^ (20 ÷ 1.5) = 1÷10,000 whereas 0.5 ^ (20 ÷ 2) = 1÷1024. The GB in a consumer-grade computer also roughly corresponds to the 10,000-to-1 ratio given the Commodore Amiga 600 for $500 in 1992 had 1MB (i.e. 1÷1024 GB) and Commodore 128D for $500 in 1987 had 128KB (i.e. 1÷8192 GB, 4.5 yrs is factor of 8 = 2 ^ (4.5÷1.5)). Note I had a an Atari ST 520 with 512KB computer in 1985, then I produced one of the world's first commercial consumer-grade wysiwyg word processors Word Up, so this reinforces my point that the technology movers will have more RAM than average. Before 1990, there may have been more demand for memory because it was far too minuscule and after 2010, now we may experience less demand due to the rise of mobile computing cannibalizing desktops.
GPU memory appears to be scaling at a doubling every 1.8 years (21 months). So if we scaled at this slower rate and started at say 8GB (above the average of roughly 6GB for a new computer), then eventually the average for computer would eventually catch up, and thus mining of the coin would become more inclusive over time. There's also a tree search algorithm here which runs significantly faster with smaller data sets on CPUs/MICA versus GPUs; if it could be integrated into an encryption algorithm it might destroy GPU performance
www.webislands.net/pubs/FAST__SIGMOD10.pdf
For the non-Larrabee CPU architecture it is only 2X faster within L2 cache where it is compute-bound. But small memory is where ASICs (FPGA) may be able compete? And the GPU could run multiple instances in parallel to overcome this advantage. The vaporware Larrabee (AMD has a similar architecture purportedly planned for 2014) is faster than GPU in large memory, but only because the "high speedup over GPUs comes from the inefficiency of performing horizontal SIMD operations on GPUs". I wonder if GPUs could gain that capability by the time that new CPU architecture is released.
Right now salsa20/8 is used as a core part of of scrypt (8 rounds for the inner loop). Replacing it with something like salsa20/2 (just two rounds in the inner loop) would significantly improve performance on CPU, because 4x less calculations would be involved. And the memory access pattern would remain the same, resulting in almost no improvement for the miners which depend on memory performance (GPU miners and also to some extent Cell/BE miner). So what's the problem? The variants of salsa20 with lower number of rounds are supposedly significantly less secure, at least in some cases:
Indeed that would sacrifice the uniformity of the distribution in step 3 of ROMix, which might open an attack to (only compute some unique paths and thus) violate Percival's criteria that
Integrify(H(x)) mod N is not significantly faster than
H(x).
Indeed Percival mentioned in
section 6 as one of his 5 criteria, that the hash (and my improvement only in step 3 which btw should also
employ SIMD (SSE)) computation cycles should minimized.