Author

Topic: I got an idea that could make scrypt fpga/asics a lot faster. [TECHNICAL] (Read 914 times)

hero member
Activity: 672
Merit: 500
You cannot find GPU memory latency easily because it's a function of the kernel being run.
For graphics kernels, latency is effectively 0.
Scrypt on my 7750 is not memory limited at all (using GAP 2), my performance scale with GPU clock perfectly.

The memory you're talking about exists in commercial products. GPUs have it, it's called Local Data Share or "local memory". It's already twice as big as L1 (just because some idiots think cache is efficient), it has lower latency (potentially a couple of clocks), twice the bandwidth, it has a 32-way crossbar (!!!) and I've been told it burns 1/4 of the power.
hero member
Activity: 896
Merit: 1000
You need fast memory directly connected to the cores, not the DDR5 memory.
newbie
Activity: 28
Merit: 0
Quote
you are not talking about a few bits of memory to store, but a lot more...

Yeah. I know. 128kB (1,024,000 bits) per hash per core, if I remember correctly, which would require 1,024,000 flip flops (plus some overhead from the various calculations that have to be done). Remember that memory usage only starts to get gigantic when you run multiple cores. Each modern ASIC chip has as many cores as they could fit on the die. As for GPUs, my 7970, for example, has 2048 cores. 2048 times 128kB equals just over 262mB, which is right in the ballpark of how much GPU ram it uses while hashing. Divide the number of hashes/second (700kh/s) it runs by the number of cores (2048), then by the number of memory accesses per hash (I forgot), then take the reciprocal of that (divide one by it), and I'd bet you would come up with a figure somewhere close to the memory latency of the GPU's ram (googled for an hour, still can't find it).

But can you fit that many flip flops on a chip, you might ask? Well, I figure, if you can fit it on an FPGA, you can fit it on an ASIC. Look at page 10-11 of this user manual. According to that, you can get up to 2,443,000 flip flops on an FPGA, and that's just that brand/series alone, so it can certainly be done on an ASIC.
legendary
Activity: 1400
Merit: 1050
you are not talking about a few bits of memory to store, but a lot more...
newbie
Activity: 28
Merit: 0
I had a relatively simple design idea a while ago about how to speed up the hashrate of any asic/fpga design by a couple of magnitudes at the expense of a lot more logic. While I don't know if it was genius or retarded, I've been keeping that idea to myself in the hopes that someday I'd learn enough about chip design to make my own FPGA, and then later that I'd make a couple friends in the industry that would let me make a small run of asics at a do-able price.

Unfortunately, my interests have been moving in a different direction lately, and none of those hopes I had have come to fruition. So, I might as well share the idea with some of you fine folks. Just promise me that if you're a developer out there who wants to implement it in your design, you'll reserve me a spot on the pre-order and maybe give me an employee discount. Of course,  I can't make you do anything, but it'd be nice.

Anyway, on to the idea: The main problem with designing a fast scrypt ASIC is the fact that scrypt is memory intensive. We all know that already, right? Well, what you probably don't know is that the problem doesn't have so much to do with the physical amount of memory needed as much as it does with the amount of individual memory accesses needed. Why? Because memory, or more specifically, external RAM/cache memory, is slow. Really slow. And, if you design one, you'll find that the speed of your ASIC soon becomes the speed of your asics memory, divided by the number of cores you implemented, plus or minus a small bit of overhead.

So, my idea is, why use memory at all? Just because we have to store a bunch of bits at various points in the hashing process to use later doesn't mean we have to resort to a different chip entirely. Just build your own memory, in logic! Why use slow, multipurposed, RAM when you could use fast, application-specific, memory instead? After all, "application-specific" is in the name of the thing you're designing!

All you need is a bit of clever design and a whole shitload of flip-flops (as in 2 NOR gates feedbacking into each other, not the shoe). There's no seek time because you know where each bit is located, and now reads and writes are less complicated than adding 1+1. Memory latency is now just as fast as the rest of the circuit.

So... Good idea, or bad idea? Because I assume someone had to think of this before me.
Jump to: