You wouldn't need 6GB. The latency on main memory for a GPU is horribly bad so bad that any process which needs random access to GPU main memory will be annihilated by a CPU in terms of performance. GPU memory is designed to stream textures and as such it couples massive bandwidth with extreme latency. Scrypt was designed to fill the L3 cache of a CPU. The developers of alt-coins had to intentionally lower the memory requirements by 99% to make GPU competitive. Yes Litecoin and clones use about 128KB of cache. The MINIMUM memory requirement for Scrypt is 12MB. It doesn't take 16GB. Try it out yourself or check out various hacking forums the OpenCL performance for Scrypt (2^14, 8, 1) is beyond pathetic. A cheap CPU will run circles around it.
This begs the question: Was this known to the devs of Litecoin and/or Tenebrix? I mean, why else did they intentionally lower the memory requirements? Big scam after all (hurr durr gpu resistant)?
I have never got a satisfactory answer. I will point out that it is intentional though. The default parameters for Scrypt are (N=2^14, r=8, p=1), the parameters used by Litecoin are N=2^10, r=1, p=1).
I am not sure if was a scam but the end result is the same, Litecoin is 99% less memory hard then the default Scrypt and about 1/7000th as memory hard as the parameters recommended by the author for high security (not realtime) applications.
Why not use Scrypt as intended. Scrypt with default variables has beyond horrible performance on GPUs. Litecoin developers modified it to make it roughly 128x less memory resistant (using only 128KB total).
I'm working on this too and the problem I'm anticipating is that it will take time to verify the hash. The performance I'm seeing running scrypt to require 256MB of memory is a hash time of at least 0.5 seconds . . . . time increasing linearly with memory required . . that's not so bad for mining where you can just have a low difficulty, but creates a problem for clients verifying the block chain - it makes it a slower process, and could start to bite as the block chain gets longer.
Artforz said in one of his last posts on this forum that the verification time is the reason you can't jack up the n values.
So, the algorithm is fine as it is. If you increase the amount of memory required, you end up with a GPU-favoured implementation of scrypt.
I don't understand this line but the rest of your post is a welcomed commentary that I do intend to provide counter-arguments for.
I would assume that the more memory required the *less* feasible GPU mining became. For instance you could (if artforz released the code) mine scrypt coins with a GPU but it would be so inefficient that you might as well just mine them with the CPU. My understanding is that increasing the amount of memory required further would make GPUs even more pitiful. If you kept increasing the memory required CPU's would decrease in hash power. Some CPU's with smaller and or slower amounts of cache (or inefficient cache usage) would fail to keep up. This would push innovation to improve memory management in CPU's as people try to design ways to make CPU's address large cache sizes faster or make more efficient use of L2 and L3 cache.
We would first see more efficient mining software just as people keep improving the existing scrypt miners but ultimately we would be pushing for CPU's that are continuously improving at memory hard math.
Although you argue it is difficult to make large amounts of cache easy to address there is room for competition and innovation in this area as people push the boundaries on what is possible with the CPU.
Yes it sounds like a lot of very difficult work I agree but that's the whole idea. It is a speculation market for emerging CPU technology.
Short version: compared to (1024,1,1) increasing N and r actually helps GPUs and hurts CPUs.
Longer version:
While things are small enough to fit in L2, each CPU core can act mostly independently and has pretty large read/write BW, make it big enough to hit external memory and you've got ~15GB/s shared between all cores.
Meanwhile, GPU caches are too small to be of much use, so... with random reads at 128B/item a 256 bit GDDR5 bus ends up well < 20% peak BW, at 1024B/item that % increases very significantly.
end result, a 5870 ends up about 6 times as fast as a PhenomII for scrypt(8192,8,1). (without really trying to optimize either side, so ymmv).
The only way to make scrypt win on CPU-vs-GPU again would be to go WAAAY bigger, think > 128MB V array so you don't have enough RAM on GPUs to run enough parallel instances to mask latencies... but that also means it's REALLY slow (hash/sec? sec/hash!) and you need the same amount of memory to check results... Now who wants a *coin where a normal node needs several seconds and 100s of megs to gigs of ram just to check a block PoW for validity?
The guys who maintain YACoin disagree with artforz and tacotime's position. They claim that verification time is not a problem based on their testing.
Sorry if this has been answered before, but I just found out about YACoin and I don't want to read all 170 pages.
Is YACoin just continually raising the N value? Does this mean it will eventually take a huge amount of time to check a block PoW for validity? How could this possibly be a good idea?
Probably the YACoin ongoing development thread will give you a better idea while reading much less than 170 pages:
https://bitcointalksearch.org/topic/annyac-yacoin-ongoing-development-206577My post with my benchmarks for hash rates at various values of N, and when YACoin will switch to those values of N, is in the 15th post:
https://bitcointalksearch.org/topic/m.2162620I benchmarked with a 4 year old dual Xeon E5450 server (almost stone age technology, but similar combined performance to today's i7-2600k's). It appears it'll be a few decades before even today's hardware (or hardware from 4 years ago) would have a problem with the time needed to validate a block PoW.
As time goes on, doubling of N becomes further and further apart in time. Advances in computing power will rapidly outpace the rising N over the long term.
Thanks for the reply! So am I to understand that artforz's analysis is wrong? I guess that wouldn't be the first time....
The other thread has the majority of the GPU discussion, including benchmarks from mtrlt, the developer of Reaper (the first GPU kernel released for Litecoin in response to ArtForz's claim Litecoin was GPU resistant). I disagree with ArtForz's claim that increasing N helps GPU's once both CPU's and GPU's are computing hashes large enough that they're pushed to external RAM. I would say ArtForz's analysis was cherry-picked based on the specific value of N (8192) where computation gets pushed out of the L2 cache on the AMD Phenom II he was testing with.
Indications, including from mtrlt's benchmarks, are that the performance spread between CPU's and GPU's narrows as N rises. As long as we don't cherry-pick a specific result from a certain value of N on an AMD Phenom II CPU..
Also note that YACoin doesn't use the same scrypt variant as Litecoin. The mixing algorithm is switched from salsa20/8 to chacha20/8, and the hashing algorithm is switched from SHA-256 to Keccak-512. Direct comparisons between hash rates of the two aren't quite going to be an apples-vs-oranges comparison for a given value of N.
Well, the N factor increases memory requirements for computing a single hash (thus it's using more memory and memory bandwidth). Current GPUs will quickly run out-of-memory (or there's other GPU-specific constraint that prevents the code from running at higher N, dunno). However, it also affects CPUs really hard (around 40% hashrate decrease if I remember correctly).
Nah, all you have to do is increase the lookup gap (via the previously published TMTO solution for scrypt from cgminer/reaper) and then you can compute the same hashes with less memory.
There's a probably bug in mtrlt's current code that doesn't allow calculation above N=4096, but it's possible that this particular TMTO implementation is not really optimized well for the GPU and that in the future with some hacking we'll see the gap further widen.
The further up the N value you get, the greater dependence on memory access speeds you typically observe (or at least, I observed using scrypt-jane on a CPU). I wouldn't be surprised if eventually an implementation for GPUs came along that was optimized and destroyed CPUs for efficiency and speed.
BLAKE is used as an entropy source to randomize memory access too, I wouldn't be surprised if you looked at accesses to the lookup table and found that they end up being less than random as well due to consistent ordering of some types of data in the block header (thus also diminishing the amount of memory required). I think pooler observed this when he was writing his CPU miner.
The whole point of trying to make a GPU-hard coin is to get a more even initial coin distribution than bitcoin/litecoin did. The number of people with CPUs is way higher than the number with good GPUs. There is no point to making a new alt-coin to change the mining algorithm if it doesn't promote a wider distribution by cutting out the GPU farmers. The algorithm doesn't have to last forever; it only has to last a few years until ASICs are developed. Litecoin lost its whole purpose when it was taken over by GPUs. If we knew a way to assign an equal amount of coins to everyone on the planet in a decentralized way, we would do that, but that technology is decades away. Distributing it to everyone with a CPU is way less fair, but it is still vastly superior to giving it to everyone with a GPU.
Here are the benchmarks from mtrlt, who was the first to write a GPU miner for litecoin. He switched to mining YACoin because it was more profitable for him. Now he is writing a GPU primecoin miner apparently, but I haven't paid attention for a several months.
Here are all my GPU benchmarking results, and also the speed ratio of GPUs and CPUs, for good measure.
GPU: HD6990 underclocked 830/1250 -> 738/1250, undervolted 1.12V -> 0.96V. assuming 320W power usage
CPU: WindMaster's 4 year old dual Xeon, assuming 80W power usage. In reality it's probably more, but newer processors achieve the same performance with less power usage.
N GPUspeed CPUspeed GPU/CPU power-efficiency ratio
32 10.02 MH/s 358.8 kH/s 6.98
64 6.985 MH/s 279.2 kH/s 6.25
128 3.949 MH/s 194.0 kH/s 5.1
256 2.004 MH/s 119.2 kH/s 4.2
512 1.060 MH/s 66.96 kH/s 3.95
1024 544.2 kH/s 34.80 kH/s 3.9
2048 278.7 kH/s 18.01 kH/s 3.88
4096 98.5 kH/s 9.077 kH/s 2.72
8192+ 0 H/s 4.595 kH/s 0
GPUs are getting comparatively slower bit by bit, until (as I've stated in an earlier post) at N=8192, GPU mining seems to break altogether.
EDIT: Replaced GPU/CPU ratio with a more useful power-efficiency ratio.
TacoTime asked if he had played with the lookup gap, and he said he had played with it quite a bit and couldn't get it to mine faster. You can see here that jacking up the N value DOES make GPU mining substantially less effective relative to the CPU, and apparently they are not having problems with verification times. YAcoin switches to N=8192 on August 13th. You should probably get WindMaster in here to comment.