Pages:
Author

Topic: Can the asic miners mine scrypt currencies ? (Read 8901 times)

hero member
Activity: 774
Merit: 500
Lazy Lurker Reads Alot
August 18, 2013, 12:23:58 PM
#36
Wow, you're only the tenth person to ask, and the answer is still no.  

They're a chip designed to do one thing, and one thing only.  That's mine Bitcoins (or other pointless SHA256 altcoins).

lol no dude they are totally not designed for doing ONLY btc
in fact they are designed for several tasks totally not related to bitcoins at all
in most cases these chips are made to perform one purpose and are kinda long term used at many proccessing technics
but also has been used to perform graphics calculations for example in the 80's in the zx spectrum
Indeed most can do simply one thing that is calculate very fast so that they could be used for crypto calculations
These chips have one big disadvantage over for instance fpga, they can not be changed once made.
That why asic chips are so darn cheap and are so easy to be made, however in the bitcoin world they seem to be made of diamonds and gold.
Again these chips are very cheap to produce.
sr. member
Activity: 384
Merit: 250
Thanks very much for that explanation, I've been working off the original scrypt.c in cgminer (the OpenCL GPU code is rather beyond my ken), but your cell code does look useful.

I believe that similar pipelining for hiding the latency of external DRAM accesses can be also easily implemented with FPGA or ASIC. But FPGA or ASIC still must have a lot of memory bandwidth even after the scratchpad size reduction, otherwise the external memory will become a performance bottleneck. Beating the GPUs equipped with fast GDDR5 is going to be a tough challenge.

I'm just playing with the FPGA implementation as a hobby, though I'm hoping it may be of some use with all those bitcoin FPGA boards that are going to be just junk in a few months as the nethash climbs exponentially. So my code just uses the internal block ram resource, which is very limited (4.5Mbit on an LX150, enough for 4 full or 9 half scratchpads). Fitting the cores is no problem, but routing them is a nightmare. I've been recently been looking at pipelining, but it seems this just makes it even more unroutable. Still, there may be a way forward and your input was very welcome.

Jasinlee has his own project looking at using external SDRAM, which I guess will look a lot like a GPU style solution (exactly the same problems with ram bandwidth and latency).

Quote
Well, I'm already away from the party since long ago Smiley

I wish you well. I'm currently reading through your (and D&T and others) old threads for inspiration, so your words do live on  Smiley
newbie
Activity: 39
Merit: 0
Interesting. So the GPU threads stall until the memory read is completed (given that for the full scratchpad, each blockmix cycle needs a 128 bytes read from an address generated by the previous blockmix).
Yes, and the GPU implements something like hyperthreading, but significantly beefed up (not just 2 virtual threads per core as in the CPU, but a lot more). A stalled GPU thread does not mean that the GPU ALU resources are idle, they are just allocated to executing some other threads.

Regarding the bandwidth vs. latency. Fortunately the reads as 128 byte chunks are just perfect for SDRAM memory. SDRAM is generally optimized for large burst reads/writes to do cache line fills and evictions. And the size of cache lines in processors is roughly in the same ballpark (typically even smaller than 128 bytes). Using such large bursts means that the memory bandwidth can be fully utilized without any problems. And the latency can be hidden.
Quote
It makes sense for the huge number of threads available on GPU, but I wonder if this approach works with FPGA too (using external SDRAM). Using internal block RAM to hold the thread state (B/Bo) and switch threads while waiting for the SDRAM. Not sure that works actually. Food for thought, thanks.
There is a software optimization technique called pipelining, which is rather widely used. It allows to fully hide the memory access latency for scrypt. In the Cell/BE miner (which was developed long before the mtrlt's GPU miner) I was calculating 8 hashes at once per SPU core. These hashes were split into two groups of 4 hashes for pipelining purposes. So the second loop, where the addresses depend on previous calculations, looks like this:
Code:
dma request the initial four 128 byte chunks for the first group
dma request the initial four 128 byte chunks for the second group
loop {
    check dma transfer completion and do calculations for the first group
    dma request the next needed four 128 byte chunks for the first group
    check dma transfer completion and do calculations for the second group
    dma request the next needed four 128 byte chunks for the second group
}
The idea is that while the DMA transfer from the external memory to the local memory is in progress, we just do calculations for another group of hashes without blocking. The actual code for this loop is here: https://github.com/ssvb/cpuminer/blob/058795da62ba45f4/scrypt-cell-spu.c#L331. Cell in Playstation3 has enough memory bandwidth headroom (with its total ~25GB/s memory bandwidth) and is only limited by the performance of ALU computations done by 6 SPU cores (or 7 SPU cores with a hacked firmware). So there was no need to implement the scratchpad lookup gap compression for that particular hardware.

I believe that similar pipelining for hiding the latency of external DRAM accesses can be also easily implemented with FPGA or ASIC. But FPGA or ASIC still must have a lot of memory bandwidth even after the scratchpad size reduction, otherwise the external memory will become a performance bottleneck. Beating the GPUs equipped with fast GDDR5 is going to be a tough challenge.
Quote
PS ssvb, you have some very interesting threads linked in your post history. Thank you for posting here, I'm late to this party and this helps enormously.
Well, I'm already away from the party since long ago Smiley
sr. member
Activity: 347
Merit: 250
That's not how scrypt GPU mining works. You are implying that the GPU memory is not used at all, but this is bullshit (just try to downclock the GPU memory and see the effect yourself). You are implying that the memory latency is somehow important, but this is also bullshit. The memory bandwidth is the limiting factor. You are implying that only a single 128K scratchpad is used per whole GPU (or per SIMD unit), but this is also wrong. In fact thousands of hashes are calculated simultaneously and each one of them needs its own scratchpad (of configurable size and not necessarily 128K). You really have no idea what you are talking about.

+1

I was a bit surprised and taken aback at DeathAndTaxes description of scrypt mining on GPU and the lack of understanding of how it is accomplished, given his post history.  The idea that scrypt implementations on GPU do not store the scrypt scratchpad in external RAM, and instead can fit it in on-die RAM (with more than a handful of shaders processing scrypt), is way way incorrect and pretty far out there.

EDIT - Reviewing DeathAndTaxes post history going back to the early days of Litecoin, I'm stumped.  Hey DeathAndTaxes, were you just trolling?  Or, has someone hacked your account and posted it as a joke at your expense?
sr. member
Activity: 384
Merit: 250
You are implying that the memory latency is somehow important, but this is also bullshit.

Interesting. So the GPU threads stall until the memory read is completed (given that for the full scratchpad, each blockmix cycle needs a 128 bytes read from an address generated by the previous blockmix). It makes sense for the huge number of threads available on GPU, but I wonder if this approach works with FPGA too (using external SDRAM). Using internal block RAM to hold the thread state (B/Bo) and switch threads while waiting for the SDRAM. Not sure that works actually. Food for thought, thanks.

PS ssvb, you have some very interesting threads linked in your post history. Thank you for posting here, I'm late to this party and this helps enormously.
newbie
Activity: 39
Merit: 0
However "LTC Scrypt" uses a mere 128KB of RAM.  It all occurs on the GPU die (which has more than enough register space and L2 cache to hold the scratch pad).  GPU memory latency to main memory (i.e. the 2GB of RAM on a graphics card) is incredibly long and the memory latency from GPU die off card to main memory is measured in fractional seconds.  Utterly useless for Scrypt.   If LTC required that to be used, a GPU would be far inferior to CPU with their 2MB+ of L2 and 6MB+ of L3 low latency cache.  "Luckily" the modified parameters selected for LTC use a tiny fraction (~1%) of what is recommended by the Scrypt author for memory hardness even in low security applications and roughly 1/6000th of what is recommended for high security applications.  It makes the scratchpad just small enough to fit inside a GPU and allow significant acceleration relative to a CPU.  

Try bumping the parameters up just a little, GPU performance falls off a cliff while CPU performance is far more gradual.  It doesn't matter if you attempt this on a system with 16GB (or even 32GB) of main memory.  You can even try using a 1GB vs 2GB graphics card with negligible change in performance.  The small memory scratchpad ensures neither a GPU main memory or the computer's main memory is used.  The cache, inside the CPU die for CPU mining, or inside GPU die for GPU mining is what is used.  Ever wonder why GPU accelerated password cracking programs don't include scrypt?  The default paramters make the average GPU execution time <1 hash per second.  Not a typo.  Not 1 MH/s or 1 KH/s but <1 hash per second.

That is why "reaper" was so revolutionary but only for the weakened version of Scrypt used by LTC.  It requires much less memory but still too much memory for a single SIMD unit and GPU main memory has far too much latency.  That makes LTC impossible to mine on a GPU right?  Well people thought so for a year.  Reaper used a workaround by slaving multiple SIMD units together it stores the scratchpad across the cache and registers of multiple SIMD units.  Now this reduces the parallelism of the GPU (which is why a GPU is only up to 10x better than a CPU vs 100x better on SHA-256).  The combined register/cache across multiple SIMD units is large enough to contain the Scrypt scratchpad.  This wouldn't be possible at the default parameters (~20MB of low latency memory) but it certainly possible at the reduce parameters used by LTC.
That's not how scrypt GPU mining works. You are implying that the GPU memory is not used at all, but this is bullshit (just try to downclock the GPU memory and see the effect yourself). You are implying that the memory latency is somehow important, but this is also bullshit. The memory bandwidth is the limiting factor. You are implying that only a single 128K scratchpad is used per whole GPU (or per SIMD unit), but this is also wrong. In fact thousands of hashes are calculated simultaneously and each one of them needs its own scratchpad (of configurable size and not necessarily 128K). You really have no idea what you are talking about.

About the passwords hashing. That's a totally different application of scrypt algorithm and has different requirements. To prevent passwords bruteforcing, you want the calculation of a single hash to be as slow as possible (within reasonable limits, so that verifying passwords does not become too slow). That's why the recommended scrypt parameters are set so high. Just to give you an example, let's imagine that the LTC scrypt parameters are used for hashing passwords. With a GPU you can easily have ~1000 kHash/s LTC scrypt performance, it means that you can try 1000000 different passwords per second for bruteforcing purposes. And for example, when using only lowercase letters and not really long passwords, it's a matter of just seconds or minutes to bruteforce it with such hashing speed. That's why the parameters used for LTC scrypt are not fit for passwords hashing. Check http://en.wikipedia.org/wiki/Password_strength for more information.

However for mining purposes, making a single hash calculation as slow as possible is not a requirement. The absolute hashing speed is irrelevant. The difficulty is adjusted anyway, based on the total cryptocurrency network hashing speed. We just kinda care about the fairness between CPU/GPU/FPGA/ASIC, so that none of them gets a really huge advantage (normalized per device cost or transistors budget). And scrypt performance nicely depends both on the memory speed and on the speed of arithmetic calculations, doing a better job levelling the difference than sha256 from bitcoin.
sr. member
Activity: 252
Merit: 250
Amateur Professional
All I can say, is "10 seconds of googling.........................................."
sr. member
Activity: 384
Merit: 250
DeathAndTaxes is spot on, though omits that using a smaller scratchpad allows more processing units to be fitted on a given die for only a small reduction in performance per core (a 64kB scratchpad will run at 80% of the speed of a 128kB one).

The main issue with a Scrypt ASIC is the fact that modern GPU's are incredibly well optimised for the emasculated LTC Scrypt algorithm (one might even speculate that this was deliberate  Undecided ). So the investment to create a competitive ASIC would be huge as it would need to use a similar state of the art process to that being used by the GPU's. And at the end of it you just get a chip with pretty much the same performance, that just can't render graphics  Sad I suppose you'll save on all the overhead of VRAM and DACs etc, so there will be some small advantage.

Anyway I've been writing some code for FPGA's which you might like to take a look at. The performance is crap (2kH/s on a DE0-Nano, 5-6khash/sec on a single LX150), but its early days yet (don't expect miracles, jasinlee's project is still the more practical one). https://github.com/kramble/FPGA-Litecoin-Miner
hero member
Activity: 798
Merit: 1000
Well given that sASIC exist with 10x to 20x as much memory (as in on die negligible latency SRAM) as required for "LTC Scrypt" I don't see the "128KB barrier" being much more than paper thin.

It boils down to requiring more chips instead of more memory, and each of these chips must have access to a small amount of fast memory. Cost to build, electricity usage, etc. are still factors in play and there is no "slam dunk" that this is a more or less effective way to be ASIC-resistant.

Quote
The only real barrier is that the market cap (and thus annual mining revenue).  It is still laughably low.

$50m is laughable?

Quote
LTC (et all) could have been memory hard but they chose (either by negligence or malice) to set the "barrier" incredibly low.  The minimum recommendation by the AUTHOR (not some random guy but the guy who wrote it) is ~20MB of scratch pad.  LTC chose to use ~1% of that.

The security of the network is not at risk because of the scrypt parameters chosen. Do not imply that it is.
hero member
Activity: 798
Merit: 1000
‘Try to be nice’
The only point i'd disagree with "Death and Taxes" on is the strict relationship between market and Price with regard to sCrypt ASIC and LTC exchange price.

markets being made out of humans , and sCrypt being most of the "market" the desire for an sCrypt ASIC will be less rationally related to any specific price and more irrationally related to the intangible possibilities of future sCrypt Cryptocurrency.

So it wouldn't surprise me at all if they are not already in the works.

as stated before , the ASIC market is now well developed , the "idea" of ASICS is now a platform which many companies have ventured into, and if their aim is to turn a profit , i would say the there is more incentive to create an sCrypt ASIC now than there is to continue creating larger and larger SHA256 ASICS.
donator
Activity: 1218
Merit: 1079
Gerald Davis
No, current SHA256 ASIC miners can't. Scrypt algorithm has loops which require a lot of memory and all operations have to be sequential which can't be made parallel, so it needs full computing environment (cpu-ram) for one hashing process. The algorithm was designed in this way to avoid hacking.

None of that is correct.  SHA-256 miners can never run Scrypt just like they can never run SHA-512  They are designed to do one thing and one thing only.

However "LTC Scrypt" uses a mere 128KB of RAM.  It all occurs on the GPU die (which has more than enough register space and L2 cache to hold the scratch pad).  GPU memory latency to main memory (i.e. the 2GB of RAM on a graphics card) is incredibly long and the memory latency from GPU die off card to main memory is measured in fractional seconds.  Utterly useless for Scrypt.   If LTC required that to be used, a GPU would be far inferior to CPU with their 2MB+ of L2 and 6MB+ of L3 low latency cache.  "Luckily" the modified parameters selected for LTC use a tiny fraction (~1%) of what is recommended by the Scrypt author for memory hardness even in low security applications and roughly 1/6000th of what is recommended for high security applications.  It makes the scratchpad just small enough to fit inside a GPU and allow significant acceleration relative to a CPU.  

Try bumping the parameters up just a little, GPU performance falls off a cliff while CPU performance is far more gradual.  It doesn't matter if you attempt this on a system with 16GB (or even 32GB) of main memory.  You can even try using a 1GB vs 2GB graphics card with negligible change in performance.  The small memory scratchpad ensures neither a GPU main memory or the computer's main memory is used.  The cache, inside the CPU die for CPU mining, or inside GPU die for GPU mining is what is used.  Ever wonder why GPU accelerated password cracking programs don't include scrypt?  The default paramters make the average GPU execution time <1 hash per second.  Not a typo.  Not 1 MH/s or 1 KH/s but <1 hash per second.

That is why "reaper" was so revolutionary but only for the weakened version of Scrypt used by LTC.  It requires much less memory but still too much memory for a single SIMD unit and GPU main memory has far too much latency.  That makes LTC impossible to mine on a GPU right?  Well people thought so for a year.  Reaper used a workaround by slaving multiple SIMD units together it stores the scratchpad across the cache and registers of multiple SIMD units.  Now this reduces the parallelism of the GPU (which is why a GPU is only up to 10x better than a CPU vs 100x better on SHA-256).  The combined register/cache across multiple SIMD units is large enough to contain the Scrypt scratchpad.  This wouldn't be possible at the default parameters (~20MB of low latency memory) but it certainly possible at the reduce parameters used by LTC.
donator
Activity: 1218
Merit: 1079
Gerald Davis
The default Scrypt parameters were designed to do that.  The parameters changed in LTC (and copied over in all clones) were weakened to reduce the memory hardness by 99%.  

Memory, in general, is cheap. Fast memory is not. If the memory parameters are large with scrypt, an ASIC could be built using cheap memory with a smaller amount of processing units (or slower/cheaper ones). The LTC scrypt design seems to be in a very reasonable range where lots of chips can be used, but each requires a reasonably-sized amount of expensive memory and fast buses to keep up. It's difficult to say how it will/would have played out when/if ASICs are designed for the LTC scrypt algorithm. But for now, GPUs being in the sweet spot could only be a good thing imo, botnet coins are no fun.

Well given that sASIC exist with 10x to 20x as much memory (as in on die negligible latency SRAM) as required for "LTC Scrypt" I don't see the "128KB barrier" being much more than paper thin.  I mean we are talking about KB here not MB or GB and Moore's law is still alive and well. The only real barrier is that the market cap (and thus annual mining revenue).  It is still laughably low. Just like nobody was looking into Bitcoin ASICs when the price was $1 USD per BTC nobody is going to look into LTC ASICs until it is justified.  That means either LTC forever remains uselessly small or it sheds it's "ASIC resistance" like a paper dragon when it breaks into any meaningful exchange rate.   If LTC sustains a price action above $10 expect to see existing ASIC manufacturers turn their attention there.  Remember eventually the margins on BTC Asic production will dry up due to over supply and limited demand.  So you have a handful of experienced (by then) companies looking for a market to explooit.  Bitcoin hardware is now a commodity play and if LTC prices support it, there is a chance to play out the ASIC mania all over again.  Never discount an economic incentive. The idea of making 80%, 90%, 95% or more gross margins on the first batch will be attractive to companies facing paper thin margins, low barriers to entry, and heavy competition.   LTC (et all) could have been memory hard but they chose (either by negligence or malice) to set the "barrier" incredibly low.  The minimum recommendation by the AUTHOR (not some random guy but the guy who wrote it) is ~20MB of scratch pad.  LTC chose to use ~1% of that.
legendary
Activity: 1151
Merit: 1003
No, current SHA256 ASIC miners can't. Scrypt algorithm has loops which require a lot of memory and all operations have to be sequential which can't be made parallel, so it needs full computing environment (cpu-ram) for one hashing process. The algorithm was designed in this way to avoid hacking.
full member
Activity: 140
Merit: 100
"Don't worry. My career died after Batman, too."
The NSA IS Satoshi, therefore the whole SHA256 mining boom was pre-ordained to provide some sort of service to our ever-vigilant government.
Let's just see what the REAL SHA256 botnet does...
hero member
Activity: 798
Merit: 1000
The default Scrypt parameters were designed to do that.  The parameters changed in LTC (and copied over in all clones) were weakened to reduce the memory hardness by 99%. 

Memory, in general, is cheap. Fast memory is not. If the memory parameters are large with scrypt, an ASIC could be built using cheap memory with a smaller amount of processing units (or slower/cheaper ones). The LTC scrypt design seems to be in a very reasonable range where lots of chips can be used, but each requires a reasonably-sized amount of expensive memory and fast buses to keep up. It's difficult to say how it will/would have played out when/if ASICs are designed for the LTC scrypt algorithm. But for now, GPUs being in the sweet spot could only be a good thing imo, botnet coins are no fun.
donator
Activity: 1218
Merit: 1079
Gerald Davis
This would require a lot of engineering time to figure out. It also depends on how good those engineers are. Scrypt is a very complex algorithm that attempts to punish you by requiring more memory the faster you go. It is called a time memory tradeoff. In contrast, SHA256 is rather simple: faster is better.

The default Scrypt parameters were designed to do that.  The parameters changed in LTC (and copied over in all clones) were weakened to reduce the memory hardness by 99%. 
hero member
Activity: 798
Merit: 1000
Can the asic miners mine scrypt currencies ?

No. The current crop of machines referred to as ASICs around here are built for SHA256 hashes. ASIC simply means "application-specific integrated circuit", it is not specific to bitcoin, or to SHA256 or scrypt or anything. It is a generalized term for hardware that performs specific tasks.

Quote
i mean supposed the there is a new software version for enabling them to mine SCRYPT currencies (which is a lot) ?

ASICs are a piece of hardware, written in silicon. There can not be new software versions. New designs must be created and new machines must be built for scrypt, or any purpose.

Quote
and if yes then at what performance?

This would require a lot of engineering time to figure out. It also depends on how good those engineers are. Scrypt is a very complex algorithm that attempts to punish you by requiring more memory the faster you go. It is called a time memory tradeoff. In contrast, SHA256 is rather simple: faster is better.
member
Activity: 60
Merit: 10
So, I think what the OP meant to ask was:

"When is the answer to 'Can the asic miners mine scrypt currencies?' gonna be 'yes' instead of 'no'?"

I would say pretty soon...if I was drunk

Well that is the confusing thing about questions like this.

When will current SHA-256 ASICs be able to mine scrypt currencies?  Never.
When will someone develop an ASIC that implements scrypt* hashing?  When it becomes profitable to do so.

By scrypt I mean the watered down memory lite (only uses 128KB) version used by LTC and clones.  The actual memory hard version as designed by the author will probably never be ASIC accelerated.

LTC kinda uses 1.6GB of my VRAM Smiley  Of course, this is at 24000 thread concurrency.  Unless someone can design a Scrypt ASIC that works wonders at 1 TC, RAM needed will increase tremendously.
donator
Activity: 1218
Merit: 1079
Gerald Davis
So, I think what the OP meant to ask was:

"When is the answer to 'Can the asic miners mine scrypt currencies?' gonna be 'yes' instead of 'no'?"

I would say pretty soon...if I was drunk

Well that is the confusing thing about questions like this.

When will current SHA-256 ASICs be able to mine scrypt currencies?  Never.
When will someone develop an ASIC that implements scrypt* hashing?  When it becomes profitable to do so.

* By scrypt I mean the watered down memory lite (only uses 128KB) version used by LTC and clones.  The actual memory hard version as designed by the author will probably never be ASIC accelerated.
sr. member
Activity: 439
Merit: 250
So, I think what the OP meant to ask was:

"When is the answer to 'Can the asic miners mine scrypt currencies?' gonna be 'yes' instead of 'no'?"

I would say pretty soon...if I was drunk
Pages:
Jump to: