It should be pointed out that there is nothing impossible in doing mining through a browser app, irrespective of whether that is cpu or gpu mining, and that browser-mining does not itself make scrypt GPU-friendly
(but it does bring up an interesting question as to possibility for mining-enabled "rich content" banners that bring you both ad-money and coinsez)
as to TBX memhardness, this is relevant
https://bitcointalksearch.org/topic/tenebrix-scaling-questions-45849I see now,
Basically, instead of aiming for "enough memory use to hit external memory on any device", aiming for "small enough to fit in L2 cache on CPUs, big enough to require too much on-chip memory/cache for running enough parallel instances to make GPU/FPGA/VLSI worth it."
Which makes sense. But most CPUs have L3 now too, so why not use a problem size that is effective into that range as well?
This article too helps make sense of it:
Another important distinction in memory performance between the GPU and the CPU is the role of the cache. Unlike the cache on the CPU, the GPU texture cache exists primarily to accelerate texture filtering. As a result, GPU caches need to be only as large as the size of the filter kernel for the texture sampler (typically only a few texels), which is too small to be useful for general-purpose computation. GPU cache formats are also optimized for locality in two dimensions, which is not always desirable. This is in contrast to the Pentium 4 cache, which operates at a much higher clock rate and contains megabytes of data. In addition, the Pentium 4 is able to cache both read and write memory operations, while the GPU cache is designed for read-only texture data. Any data written to memory (that is, the frame buffer) is not cached but written out to memory.
What does this mean for general-purpose computation on GPUs? The read-write CPU cache permits programmers to optimize algorithms to operate primarily out of the cache. For example, if your application data set is relatively small, it may fit entirely inside the Pentium 4 cache. Even with larger data sets, an application writer can "block" the computation to ensure that most reads and writes occur in the cache. In contrast, the limited size and read-only nature of the GPU cache puts it at a significant disadvantage. Therefore, an application that is more limited by sequential or random read bandwidth, such as adding two large vectors, will see much more significant performance improvements when ported to the GPU. The vector addition example sequentially reads and writes large vectors with no reuse of the data—an optimal access pattern on the GPU.
http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter32.htmlSo as long as the algorithm requires a cache utilization of above 128kB, it will probably be pretty slow on ASICs or GPUs. This is not an easily remedied problem because the only way to make this cache memory to physically add billions of more transistors onto the die for it.