Can the asic miners mine scrypt currencies ?

bronan

hero member

Activity: 774

Merit: 500

Lazy Lurker Reads Alot

Quote from: HellDiverUK on August 11, 2013, 11:35:42 AM

Wow, you're only the tenth person to ask, and the answer is still no.

They're a chip designed to do one thing, and one thing only. That's mine Bitcoins (or other pointless SHA256 altcoins).

lol no dude they are totally not designed for doing ONLY btc
in fact they are designed for several tasks totally not related to bitcoins at all
in most cases these chips are made to perform one purpose and are kinda long term used at many proccessing technics
but also has been used to perform graphics calculations for example in the 80's in the zx spectrum
Indeed most can do simply one thing that is calculate very fast so that they could be used for crypto calculations
These chips have one big disadvantage over for instance fpga, they can not be changed once made.
That why asic chips are so darn cheap and are so easy to be made, however in the bitcoin world they seem to be made of diamonds and gold.
Again these chips are very cheap to produce.

kramble

sr. member

Activity: 384

Merit: 250

Thanks very much for that explanation, I've been working off the original scrypt.c in cgminer (the OpenCL GPU code is rather beyond my ken), but your cell code does look useful.

Quote from: ssvb on August 18, 2013, 11:48:33 AM

I believe that similar pipelining for hiding the latency of external DRAM accesses can be also easily implemented with FPGA or ASIC. But FPGA or ASIC still must have a lot of memory bandwidth even after the scratchpad size reduction, otherwise the external memory will become a performance bottleneck. Beating the GPUs equipped with fast GDDR5 is going to be a tough challenge.

I'm just playing with the FPGA implementation as a hobby, though I'm hoping it may be of some use with all those bitcoin FPGA boards that are going to be just junk in a few months as the nethash climbs exponentially. So my code just uses the internal block ram resource, which is very limited (4.5Mbit on an LX150, enough for 4 full or 9 half scratchpads). Fitting the cores is no problem, but routing them is a nightmare. I've been recently been looking at pipelining, but it seems this just makes it even more unroutable. Still, there may be a way forward and your input was very welcome.

Jasinlee has his own project looking at using external SDRAM, which I guess will look a lot like a GPU style solution (exactly the same problems with ram bandwidth and latency).

Quote

Well, I'm already away from the party since long ago

I wish you well. I'm currently reading through your (and D&T and others) old threads for inspiration, so your words do live on

ssvb

newbie

Activity: 39

Merit: 0

Quote from: kramble on August 16, 2013, 04:14:09 PM

Interesting. So the GPU threads stall until the memory read is completed (given that for the full scratchpad, each blockmix cycle needs a 128 bytes read from an address generated by the previous blockmix).

Yes, and the GPU implements something like hyperthreading, but significantly beefed up (not just 2 virtual threads per core as in the CPU, but a lot more). A stalled GPU thread does not mean that the GPU ALU resources are idle, they are just allocated to executing some other threads.

Regarding the bandwidth vs. latency. Fortunately the reads as 128 byte chunks are just perfect for SDRAM memory. SDRAM is generally optimized for large burst reads/writes to do cache line fills and evictions. And the size of cache lines in processors is roughly in the same ballpark (typically even smaller than 128 bytes). Using such large bursts means that the memory bandwidth can be fully utilized without any problems. And the latency can be hidden.

Quote

It makes sense for the huge number of threads available on GPU, but I wonder if this approach works with FPGA too (using external SDRAM). Using internal block RAM to hold the thread state (B/Bo) and switch threads while waiting for the SDRAM. Not sure that works actually. Food for thought, thanks.

There is a software optimization technique called pipelining, which is rather widely used. It allows to fully hide the memory access latency for scrypt. In the Cell/BE miner (which was developed long before the mtrlt's GPU miner) I was calculating 8 hashes at once per SPU core. These hashes were split into two groups of 4 hashes for pipelining purposes. So the second loop, where the addresses depend on previous calculations, looks like this:

Code:

dma request the initial four 128 byte chunks for the first group
dma request the initial four 128 byte chunks for the second group
loop {
check dma transfer completion and do calculations for the first group
dma request the next needed four 128 byte chunks for the first group
check dma transfer completion and do calculations for the second group
dma request the next needed four 128 byte chunks for the second group
}

The idea is that while the DMA transfer from the external memory to the local memory is in progress, we just do calculations for another group of hashes without blocking. The actual code for this loop is here: https://github.com/ssvb/cpuminer/blob/058795da62ba45f4/scrypt-cell-spu.c#L331. Cell in Playstation3 has enough memory bandwidth headroom (with its total ~25GB/s memory bandwidth) and is only limited by the performance of ALU computations done by 6 SPU cores (or 7 SPU cores with a hacked firmware). So there was no need to implement the scratchpad lookup gap compression for that particular hardware.

I believe that similar pipelining for hiding the latency of external DRAM accesses can be also easily implemented with FPGA or ASIC. But FPGA or ASIC still must have a lot of memory bandwidth even after the scratchpad size reduction, otherwise the external memory will become a performance bottleneck. Beating the GPUs equipped with fast GDDR5 is going to be a tough challenge.

Quote

PS ssvb, you have some very interesting threads linked in your post history. Thank you for posting here, I'm late to this party and this helps enormously.

Well, I'm already away from the party since long ago

WindMaster

sr. member

Activity: 347

Merit: 250

Quote from: ssvb on August 16, 2013, 03:09:41 PM

That's not how scrypt GPU mining works. You are implying that the GPU memory is not used at all, but this is bullshit (just try to downclock the GPU memory and see the effect yourself). You are implying that the memory latency is somehow important, but this is also bullshit. The memory bandwidth is the limiting factor. You are implying that only a single 128K scratchpad is used per whole GPU (or per SIMD unit), but this is also wrong. In fact thousands of hashes are calculated simultaneously and each one of them needs its own scratchpad (of configurable size and not necessarily 128K). You really have no idea what you are talking about.

+1

I was a bit surprised and taken aback at DeathAndTaxes description of scrypt mining on GPU and the lack of understanding of how it is accomplished, given his post history. The idea that scrypt implementations on GPU do not store the scrypt scratchpad in external RAM, and instead can fit it in on-die RAM (with more than a handful of shaders processing scrypt), is way way incorrect and pretty far out there.

EDIT - Reviewing DeathAndTaxes post history going back to the early days of Litecoin, I'm stumped. Hey DeathAndTaxes, were you just trolling? Or, has someone hacked your account and posted it as a joke at your expense?

kramble

sr. member

Activity: 384

Merit: 250

Quote from: ssvb on August 16, 2013, 03:09:41 PM

You are implying that the memory latency is somehow important, but this is also bullshit.

Interesting. So the GPU threads stall until the memory read is completed (given that for the full scratchpad, each blockmix cycle needs a 128 bytes read from an address generated by the previous blockmix). It makes sense for the huge number of threads available on GPU, but I wonder if this approach works with FPGA too (using external SDRAM). Using internal block RAM to hold the thread state (B/Bo) and switch threads while waiting for the SDRAM. Not sure that works actually. Food for thought, thanks.

PS ssvb, you have some very interesting threads linked in your post history. Thank you for posting here, I'm late to this party and this helps enormously.

ssvb

newbie

Activity: 39

Merit: 0

Quote from: DeathAndTaxes on August 15, 2013, 11:25:43 PM

However "LTC Scrypt" uses a mere 128KB of RAM. It all occurs on the GPU die (which has more than enough register space and L2 cache to hold the scratch pad). GPU memory latency to main memory (i.e. the 2GB of RAM on a graphics card) is incredibly long and the memory latency from GPU die off card to main memory is measured in fractional seconds. Utterly useless for Scrypt. If LTC required that to be used, a GPU would be far inferior to CPU with their 2MB+ of L2 and 6MB+ of L3 low latency cache. "Luckily" the modified parameters selected for LTC use a tiny fraction (~1%) of what is recommended by the Scrypt author for memory hardness even in low security applications and roughly 1/6000th of what is recommended for high security applications. It makes the scratchpad just small enough to fit inside a GPU and allow significant acceleration relative to a CPU.

Try bumping the parameters up just a little, GPU performance falls off a cliff while CPU performance is far more gradual. It doesn't matter if you attempt this on a system with 16GB (or even 32GB) of main memory. You can even try using a 1GB vs 2GB graphics card with negligible change in performance. The small memory scratchpad ensures neither a GPU main memory or the computer's main memory is used. The cache, inside the CPU die for CPU mining, or inside GPU die for GPU mining is what is used. Ever wonder why GPU accelerated password cracking programs don't include scrypt? The default paramters make the average GPU execution time <1 hash per second. Not a typo. Not 1 MH/s or 1 KH/s but <1 hash per second.

That is why "reaper" was so revolutionary but only for the weakened version of Scrypt used by LTC. It requires much less memory but still too much memory for a single SIMD unit and GPU main memory has far too much latency. That makes LTC impossible to mine on a GPU right? Well people thought so for a year. Reaper used a workaround by slaving multiple SIMD units together it stores the scratchpad across the cache and registers of multiple SIMD units. Now this reduces the parallelism of the GPU (which is why a GPU is only up to 10x better than a CPU vs 100x better on SHA-256). The combined register/cache across multiple SIMD units is large enough to contain the Scrypt scratchpad. This wouldn't be possible at the default parameters (~20MB of low latency memory) but it certainly possible at the reduce parameters used by LTC.

That's not how scrypt GPU mining works. You are implying that the GPU memory is not used at all, but this is bullshit (just try to downclock the GPU memory and see the effect yourself). You are implying that the memory latency is somehow important, but this is also bullshit. The memory bandwidth is the limiting factor. You are implying that only a single 128K scratchpad is used per whole GPU (or per SIMD unit), but this is also wrong. In fact thousands of hashes are calculated simultaneously and each one of them needs its own scratchpad (of configurable size and not necessarily 128K). You really have no idea what you are talking about.

About the passwords hashing. That's a totally different application of scrypt algorithm and has different requirements. To prevent passwords bruteforcing, you want the calculation of a single hash to be as slow as possible (within reasonable limits, so that verifying passwords does not become too slow). That's why the recommended scrypt parameters are set so high. Just to give you an example, let's imagine that the LTC scrypt parameters are used for hashing passwords. With a GPU you can easily have ~1000 kHash/s LTC scrypt performance, it means that you can try 1000000 different passwords per second for bruteforcing purposes. And for example, when using only lowercase letters and not really long passwords, it's a matter of just seconds or minutes to bruteforce it with such hashing speed. That's why the parameters used for LTC scrypt are not fit for passwords hashing. Check http://en.wikipedia.org/wiki/Password_strength for more information.

However for mining purposes, making a single hash calculation as slow as possible is not a requirement. The absolute hashing speed is irrelevant. The difficulty is adjusted anyway, based on the total cryptocurrency network hashing speed. We just kinda care about the fairness between CPU/GPU/FPGA/ASIC, so that none of them gets a really huge advantage (normalized per device cost or transistors budget). And scrypt performance nicely depends both on the memory speed and on the speed of arithmetic calculations, doing a better job levelling the difference than sha256 from bitcoin.

Walking Glitch

sr. member

Activity: 252

Merit: 250

Amateur Professional

All I can say, is "10 seconds of googling.........................................."

kramble

sr. member

Activity: 384

Merit: 250

DeathAndTaxes is spot on, though omits that using a smaller scratchpad allows more processing units to be fitted on a given die for only a small reduction in performance per core (a 64kB scratchpad will run at 80% of the speed of a 128kB one).

The main issue with a Scrypt ASIC is the fact that modern GPU's are incredibly well optimised for the emasculated LTC Scrypt algorithm (one might even speculate that this was deliberate Undecided

). So the investment to create a competitive ASIC would be huge as it would need to use a similar state of the art process to that being used by the GPU's. And at the end of it you just get a chip with pretty much the same performance, that just can't render graphics Sad

I suppose you'll save on all the overhead of VRAM and DACs etc, so there will be some small advantage.

Anyway I've been writing some code for FPGA's which you might like to take a look at. The performance is crap (2kH/s on a DE0-Nano, 5-6khash/sec on a single LX150), but its early days yet (don't expect miracles, jasinlee's project is still the more practical one). https://github.com/kramble/FPGA-Litecoin-Miner

Etlase2

hero member

Activity: 798

Merit: 1000

Quote from: DeathAndTaxes on August 15, 2013, 11:15:55 PM

Well given that sASIC exist with 10x to 20x as much memory (as in on die negligible latency SRAM) as required for "LTC Scrypt" I don't see the "128KB barrier" being much more than paper thin.

It boils down to requiring more chips instead of more memory, and each of these chips must have access to a small amount of fast memory. Cost to build, electricity usage, etc. are still factors in play and there is no "slam dunk" that this is a more or less effective way to be ASIC-resistant.

Quote

The only real barrier is that the market cap (and thus annual mining revenue). It is still laughably low.

$50m is laughable?

Quote

LTC (et all) could have been memory hard but they chose (either by negligence or malice) to set the "barrier" incredibly low. The minimum recommendation by the AUTHOR (not some random guy but the guy who wrote it) is ~20MB of scratch pad. LTC chose to use ~1% of that.

The security of the network is not at risk because of the scrypt parameters chosen. Do not imply that it is.

digitalindustry

hero member

Activity: 798

Merit: 1000

‘Try to be nice’

The only point i'd disagree with "Death and Taxes" on is the strict relationship between market and Price with regard to sCrypt ASIC and LTC exchange price.

markets being made out of humans , and sCrypt being most of the "market" the desire for an sCrypt ASIC will be less rationally related to any specific price and more irrationally related to the intangible possibilities of future sCrypt Cryptocurrency.

So it wouldn't surprise me at all if they are not already in the works.

as stated before , the ASIC market is now well developed , the "idea" of ASICS is now a platform which many companies have ventured into, and if their aim is to turn a profit , i would say the there is more incentive to create an sCrypt ASIC now than there is to continue creating larger and larger SHA256 ASICS.

DeathAndTaxes

donator

Activity: 1218

Merit: 1080

Gerald Davis

Quote from: Buratino on August 15, 2013, 11:15:51 PM

No, current SHA256 ASIC miners can't. Scrypt algorithm has loops which require a lot of memory and all operations have to be sequential which can't be made parallel, so it needs full computing environment (cpu-ram) for one hashing process. The algorithm was designed in this way to avoid hacking.

None of that is correct. SHA-256 miners can never run Scrypt just like they can never run SHA-512 They are designed to do one thing and one thing only.

However "LTC Scrypt" uses a mere 128KB of RAM. It all occurs on the GPU die (which has more than enough register space and L2 cache to hold the scratch pad). GPU memory latency to main memory (i.e. the 2GB of RAM on a graphics card) is incredibly long and the memory latency from GPU die off card to main memory is measured in fractional seconds. Utterly useless for Scrypt. If LTC required that to be used, a GPU would be far inferior to CPU with their 2MB+ of L2 and 6MB+ of L3 low latency cache. "Luckily" the modified parameters selected for LTC use a tiny fraction (~1%) of what is recommended by the Scrypt author for memory hardness even in low security applications and roughly 1/6000th of what is recommended for high security applications. It makes the scratchpad just small enough to fit inside a GPU and allow significant acceleration relative to a CPU.

Try bumping the parameters up just a little, GPU performance falls off a cliff while CPU performance is far more gradual. It doesn't matter if you attempt this on a system with 16GB (or even 32GB) of main memory. You can even try using a 1GB vs 2GB graphics card with negligible change in performance. The small memory scratchpad ensures neither a GPU main memory or the computer's main memory is used. The cache, inside the CPU die for CPU mining, or inside GPU die for GPU mining is what is used. Ever wonder why GPU accelerated password cracking programs don't include scrypt? The default paramters make the average GPU execution time <1 hash per second. Not a typo. Not 1 MH/s or 1 KH/s but <1 hash per second.

That is why "reaper" was so revolutionary but only for the weakened version of Scrypt used by LTC. It requires much less memory but still too much memory for a single SIMD unit and GPU main memory has far too much latency. That makes LTC impossible to mine on a GPU right? Well people thought so for a year. Reaper used a workaround by slaving multiple SIMD units together it stores the scratchpad across the cache and registers of multiple SIMD units. Now this reduces the parallelism of the GPU (which is why a GPU is only up to 10x better than a CPU vs 100x better on SHA-256). The combined register/cache across multiple SIMD units is large enough to contain the Scrypt scratchpad. This wouldn't be possible at the default parameters (~20MB of low latency memory) but it certainly possible at the reduce parameters used by LTC.

DeathAndTaxes

donator

Activity: 1218

Merit: 1080

Gerald Davis

Quote from: Etlase2 on August 15, 2013, 11:06:03 PM

Quote from: DeathAndTaxes on August 15, 2013, 10:54:42 PM

The default Scrypt parameters were designed to do that. The parameters changed in LTC (and copied over in all clones) were weakened to reduce the memory hardness by 99%.

Memory, in general, is cheap. Fast memory is not. If the memory parameters are large with scrypt, an ASIC could be built using cheap memory with a smaller amount of processing units (or slower/cheaper ones). The LTC scrypt design seems to be in a very reasonable range where lots of chips can be used, but each requires a reasonably-sized amount of expensive memory and fast buses to keep up. It's difficult to say how it will/would have played out when/if ASICs are designed for the LTC scrypt algorithm. But for now, GPUs being in the sweet spot could only be a good thing imo, botnet coins are no fun.

Well given that sASIC exist with 10x to 20x as much memory (as in on die negligible latency SRAM) as required for "LTC Scrypt" I don't see the "128KB barrier" being much more than paper thin. I mean we are talking about KB here not MB or GB and Moore's law is still alive and well. The only real barrier is that the market cap (and thus annual mining revenue). It is still laughably low. Just like nobody was looking into Bitcoin ASICs when the price was $1 USD per BTC nobody is going to look into LTC ASICs until it is justified. That means either LTC forever remains uselessly small or it sheds it's "ASIC resistance" like a paper dragon when it breaks into any meaningful exchange rate. If LTC sustains a price action above $10 expect to see existing ASIC manufacturers turn their attention there. Remember eventually the margins on BTC Asic production will dry up due to over supply and limited demand. So you have a handful of experienced (by then) companies looking for a market to explooit. Bitcoin hardware is now a commodity play and if LTC prices support it, there is a chance to play out the ASIC mania all over again. Never discount an economic incentive. The idea of making 80%, 90%, 95% or more gross margins on the first batch will be attractive to companies facing paper thin margins, low barriers to entry, and heavy competition. LTC (et all) could have been memory hard but they chose (either by negligence or malice) to set the "barrier" incredibly low. The minimum recommendation by the AUTHOR (not some random guy but the guy who wrote it) is ~20MB of scratch pad. LTC chose to use ~1% of that.

Buratino

legendary

Activity: 1151

Merit: 1003

No, current SHA256 ASIC miners can't. Scrypt algorithm has loops which require a lot of memory and all operations have to be sequential which can't be made parallel, so it needs full computing environment (cpu-ram) for one hashing process. The algorithm was designed in this way to avoid hacking.

MrHempstock

full member

Activity: 140

Merit: 100

"Don't worry. My career died after Batman, too."

The NSA IS Satoshi, therefore the whole SHA256 mining boom was pre-ordained to provide some sort of service to our ever-vigilant government.
Let's just see what the REAL SHA256 botnet does...

Etlase2

hero member

Activity: 798

Merit: 1000

Quote from: DeathAndTaxes on August 15, 2013, 10:54:42 PM

The default Scrypt parameters were designed to do that. The parameters changed in LTC (and copied over in all clones) were weakened to reduce the memory hardness by 99%.

Memory, in general, is cheap. Fast memory is not. If the memory parameters are large with scrypt, an ASIC could be built using cheap memory with a smaller amount of processing units (or slower/cheaper ones). The LTC scrypt design seems to be in a very reasonable range where lots of chips can be used, but each requires a reasonably-sized amount of expensive memory and fast buses to keep up. It's difficult to say how it will/would have played out when/if ASICs are designed for the LTC scrypt algorithm. But for now, GPUs being in the sweet spot could only be a good thing imo, botnet coins are no fun.

DeathAndTaxes

donator

Activity: 1218

Merit: 1080

Gerald Davis

Quote from: Etlase2 on August 15, 2013, 10:52:22 PM

This would require a lot of engineering time to figure out. It also depends on how good those engineers are. Scrypt is a very complex algorithm that attempts to punish you by requiring more memory the faster you go. It is called a time memory tradeoff. In contrast, SHA256 is rather simple: faster is better.

The default Scrypt parameters were designed to do that. The parameters changed in LTC (and copied over in all clones) were weakened to reduce the memory hardness by 99%.

Etlase2