To begin with, there are no consumer-grade GPUs with more than 6GB on the market.
Demand is a funny economics 101 thing, it causes supply to rise to meet it on the price curve. Assuming Memorycoin became significant. Although ASICs would probably take over before that point any way.
The point is we want to test what are the technical limitations, not just what the market currently bears, because a coin needs to be future-proof.
Because if GPUs can become more efficient at solving the hash by adding more memory, then we need to factor that into our analysis.
However see below, I now don't think more memory is necessary to increase parallelization.
Besides, I have already done all coalescing possible, both statistically and logically, and overall it only yields 3x-4x advantage over CPU (10hpm on 7870).
10 hpm is 2.5x correct? (FreeTrade reported 4 hpm on CPU) That is faster than the last report I had seen from you on this thread.
That is congruent with the conjecture when it is AES computation bound. Do you have any measurement giving an estimate of how close to compute bound your implementation is?
The PoW itself is not parallelizable due to CBC encryption.
I am forgetting from upthread discussion that the hash can run up to 16,384 threads simultaneously without needing more than 1GB.
How many threads are you running? Did you try increasing the number of threads?
The point I believe is to get multiple random memory accesses to overlap statistically and they will be stored in the 768 KB cache so latency is masked by memory bandwidth. Although I am not sure how sophisticated the GPU is on merging coincident random memory accesses across threads into a sequential memory access.
Of course, ASIC may employ different techniques to reduce the latency, 3D memory etc,
As far as I can see, it simply needs to have a similar main memory as the CPU (and perhaps an L2 cache), or perhaps even be PCIe card that runs on your PC.
The point is AES can be made to run much faster, if the CPU is compute bound, as I showed (see the link to the upthread post).
but in 64-byte chunks of 64k linear ranges.
I thought it was working on a random chunk of 64 KB in size? So the random access latency shouldn't be a factor, except that perhaps 64 KB is loaded so fast due to the very fast memory bandwidth of the GPU. I wondering if you did something wrong or are misinterpreting some statistics you've analyzed or I am not understanding the algorithm? Or if you are not running enough threads to statistically mask the latency?
Good luck in designing such an ASIC though.
Upthread I cited references for low transistor counts ASIC designs which run AES much faster.
But GPU is pretty much limited in what it can and what it can not.
GPU is limited only by very slow memory latency. And the lack of specialized AES instructions. The former can't be rectified as it is fundamental to what makes the memory bandwidth so fast. The latter could maybe be added to GPUs, since the transistor counts required are relatively small as I cited with references upthread.