Pages:
Author

Topic: Creating an altcoin that will never perform well on GPUs/FPGAs/ASICs (Read 1839 times)

legendary
Activity: 1540
Merit: 1011
FUD Philanthropist™
then it will be a cpu botnet coin....................
newbie
Activity: 37
Merit: 0
Hi all - I've posted an idea for an ASIC/FPGA restant PoW here.
sr. member
Activity: 399
Merit: 257
Hi!

Being annoyed by the fact that all altcoins are at least mineable on a GPU (since my GPU sucks and my FPGA is quite low-end), I thought about creating an altcoin that will never perform well on a GPU, FPGA or ASIC. So I thought: "What is it that CPUs can do much better than GPUs?" and the answer of course is "Conditional branching!".

So my idea is basically this:
Many coins already chain multiple hash algorithms together, but they do so in an order that is known beforehand. This means it's no problem for a GPU and even on an FPGA/ASIC, you can just chain it. Instead, I want to make the next hash algorithm to use dependent on the output of the previous hash function. This can be done by giving each hash function a number and determining the number of the next hash function to use using (hash % num_hashfn).

So why is this hard for a GPU/FPGA/ASIC? Let's start with the GPU:
GPUs are good at processing lots of data using the same operations, but as soon as you need to branch, their performance is laughable. Since the next algorithm to use is different for each block you try (except for the first hash function to use, as that should be dependent on something like the last hash of the last block in the block chain), you are not using the same operations on a lot of data - you are using different operations on a lot of data and need to look at the output of the last hash to determine which code to execute next. GPUs suck at this, so much that it's commonly recommended to do 10 calculations at once and check at the end which one you need and then throw away the other 9.

Ok, so this is hard on GPUs. But why is it hard for FPGAs/ASICs? Well, there are two options you have to implement this on an FPGA/ASIC. Let's start with option A: Option A is to use a state machine and do the hash depending on the state. At the end, you set the state to the next hash function to use. This requires a flip flop to store the state and a flip flop needs one cycle to change. So you can only do one hash for each cycle, which means you would need to have a really high clock rate. Affordable FPGAs and ASICs usually have quite low clock rates and thus, this would significantly slow them down.
Option B is to use a lot of mux and demux. The output of (hash % num_hashfn) can be given as input to a number of demuxes which then give the wires to the correct hash function module and at the end of the circuit use a mux to select the output from the right hashing functions. However, that would make the circuit extremely complex, making it not fit many times on an FPGA and thus only allow very limited parallelism. The same is true for an ASIC, which would need an extremely large die.

As you can see, this algorithm works quite good on a CPU which has many GHz and is good at branching (well, compared to the others, that is), but is slow on a GPU and either slow or very expensive on an FPGA/ASIC.

So what do people think of this? Is this something people are interested in and I should implement? Any feedback on the design?

// Edit: Possible names for this would be BranchCoin or JumpCoin Wink.

I think combining various principles employed by hashing algorithms that stayed CPU-only for a decent amount of time combined with heavy usage of conditional branching would suit your purpose well. I read that Primecoin's use of integers is doing extremely well at being CPU-only, but I haven't really reviewed the actual algorithm to verify that. Protoshares' Momentum algorithm looked interesting as well. Consecutive rounds of hashing with different algorithms and large memory requirements were also effective for some time.

A quick implementation would be something like:

Hash1(input + nonce1) -> a
Hash2(a) -> b
if(nonce1.ToInt) is even { Hash3(input + a + b) -> c } else { Hash4(input + a + b) -> c }
Hash4(Hash3(Hash2(Hash1(c.ToInt.ToString)))) -> d
d must be lower than current diff

Hash1(input + nonce2) -> e
Hash2(e) -> f
if(nonce.ToInt) is even { Hash3(input + e + f) -> g } else { Hash4(input + e + f) -> g }
Hash4(Hash3(Hash2(Hash1(g.ToInt.ToString)))) -> h
h must be lower than current diff

next hash based on (block number ^ 2) % 4
Hash(d + h) must be lower than current diff


Or something...

That's prolly a bad algorithm now that I look at it. But something like that, with more branches and more memory requirements.
hero member
Activity: 535
Merit: 502
so, what exactly would be the point to yet another alt coin?

does it offer anything at all that 100 other alt coins cant already?  would any organisation choose to accept this 'jump' coin over BTC or LTC or DRK etc?

i'd put your energy into something more useful, you seem like quite an intelligent person, use it for something that can benefit the uptake of crypto currencies instead of creating another destined-to-fail alt coin.
hero member
Activity: 714
Merit: 500
It's called get a job coin. To mine it you have to do a human proof of work for 8 to 12 hours a day. The proof of work is tested by the work manager and if you hashed it successfully you will generate a block of coins on average every 14 days. It will even automatically convert to the currency of your citizenship and deposit right to your bank account. It is 100% gpu and asic resistant.
newbie
Activity: 22
Merit: 0
If you managed to develop a coin profitable enough, someone somewhere can design hardware for the task given the financial motivation and backing...then they can take preorders until their machines lose value enough to ship em out.
donator
Activity: 1218
Merit: 1079
Gerald Davis
I would love to have a miner that runs in my car while I am driving, no extra power cost

That would be called magic.
legendary
Activity: 1162
Merit: 1000
your best bet is Mediterranean coin it uses sha256 asics but they have to be met with an equal amount of cpu power to mine so you don't end up with any big whales plus its actually profitable to mine with block erupter usb's
newbie
Activity: 52
Merit: 0
I would get rich if we could make my coffee maker into a miner, that thing is running all the time. You guys are going to figure it out and I think it will be useful to have a coin that anyone can mine and feel like they are actually contributing to the network and the crypto economy. I would love to have a miner that runs in my car while I am driving, no extra power cost, and I have a 30 min commute each way. On long trips I could make a chest full of coin. Keep moving forward.
newbie
Activity: 37
Merit: 0
Quote
Hashcash with a memory bound hash function like Scrypt is a dead-end for CPU-friendly PoWs because verification is linear in memory usage, which prevents the use of the large amounts of memory required to present a serious obstacle to ASICs.

The ultimate CPU-friendly coin will have a PoW that takes GBs to prove, but can be verified instantly.

This implies memory is/becomes cheaper on ASICs than on general purpose machines.  ASICs are prevalent now because the opposite is true - compute is cheaper.

Until evidence indicates otherwise, I disagree.  A democratized PoW enables search and verification to be the same person.

That said, I think a memory-trapdoor function would be nice to have.  But it doesn't seem aligned with the requirements of a good PoW, and there isn't a good body of vetted peer-reviewed work in this area.
legendary
Activity: 990
Merit: 1108
What I can say is this:

1. X11 is trivial to obliterate via FPGAs and ASICs, and will be obliterated in the near future if X11 coins continue to appreciate.

2. Designs that focus on branching also play towards the strengths of ASICs and FPGAs.  ASICs and FPGAs evaluate branches in parallel.

3. When engineering ASICs, going from most expensive to least: RAM > I/O bandwidth* > Compute
  * I/O bandwidth is really a proxy for RAM, think DDR.

ANY design that relies on difficulty to code rather than cost to manufacture WILL fall to ASICs/FPGAs (i.e. X11).

Recommendations:

Scrypt was a good start. Scrypt does the following things well:

1. Focuses on the economics of ASICs and FPGAs (i.e. RAM > Compute)

Hashcash with a memory bound hash function like Scrypt is a dead-end for CPU-friendly PoWs because verification time is linear in memory usage, which prevents the use of the large amounts of memory required to present a serious obstacle to ASICs.

The ultimate CPU-friendly coin will have a PoW that takes GBs to prove, but can be verified instantly.

newbie
Activity: 37
Merit: 0
Here's a good starting point when thinking about the relative costs of gates vs flip flops on ASICs:

http://www.utdallas.edu/~mxl095420/EE6306/Final%20project/tsmc18_component.pdf

While it is only for TSMC 180nm, it is useful for napkin analyses as the relative cell sizes are similar to more advanced process nodes which tend to require NDAs.

XOR2X1 (1 bit XOR, drive strength 1) is 5.0 x 5.3um; DFFX1 is 5.0 x 11.2um.  So actually 2 gates of compute for every bit of memory (reinforces the trade off even more).

Also rotates have no cost as they are just wiring, and XORs are one of the larger logic cell sizes - AND2X1 is 5.0 x  2.6um so 4 gates for every bit of memory.  N-bit adders are slightly more tricky, as they tend to have N log N gates in unoptimized implementations.



If hypothetically you were presented with the following proof-of-work construction:

Code:
HashFns[10] // each is a fn ptr to a unique hash fn
for (int i = 0; i < 10; ++i) {
  state = HashFns[state % 10](state);
}

You would likely implement it on an ASIC/FPGA with a proportionate number of cores relative to the size of each hash function.  Each core would have a FIFO in front.  All branching would do is have you "enqueue" the output of each hash function to the FIFO belonging to the core of the next hash function.  This way all cores would be kept occupied, and no core would need to go idle.  Other than first few clocks, no work or silicon would be wasted. So 10 cores + 10 FIFOs.

This is what I meant by ASICs/FPGAs executing branches in parallel.  Well designed circuits always break up algorithms into separate steps/execution stages, which are then pipelined so each execution stage is kept occupied.


The implementation above is neither option A nor option B.  It exploits the lack of RAM usage to maximize compute area.  I agree that if options A and B were the only ways of implementing the design then yes branching would work, but professional designers would not do options A or B because they are inefficient like you call out Smiley.


Quote
Quote
ANY design that relies on difficulty to code rather than cost to manufacture WILL fall to ASICs/FPGAs (i.e. X11).
That's just not true. There is a limit of how many transistors you can place on a chip. If that's reached, you have to fall back to doing things serially. Considering that you need slow clock rates in order to be energy efficient, that will make it slow pretty quickly.

We're not disagreeing on the transistor limit.  I'm saying that when designing ASICs/FPGAs any time you hit serial execution just means you pipeline.  This is exactly how SHA256d ASICs work (128 serial rounds of SHA in SHA256d).


Separately - the reason cryptographic ASICs tend to run at lower clock rates is for efficiency and not due to any speed limitation of the flip-flops.  Take the (obsolete) 180nm cells above, a DFF (d-flip flop) has a max propagation delay of .28ns, 1second/.28ns = ~3.57GHz.  Clearly there must be another reason for lower clock speeds.

Clock signal propagation is one of the most power expensive events on an ASIC (a pulse that has to propagate through the entire ASIC).  Lowering the clock speed lets you implement more stages per clock since the signal has more time to propagate through gates (more time = lower drive voltage, fewer CMOS signal transitions).  It is a balance because after too many gates the output becomes unstable due to asymmetries in most algorithms.  Another bonus is that if you can do 2x the work per clock, you need fewer flip flops (due to fewer stages, where the FFs are present at the beginning of each stage to capture state), which means less die area and power.

This is why most SHA256 ASICs perform at least 2 SHA rounds per clock/pipeline stage.
newbie
Activity: 8
Merit: 0
A good rule of thumb is 32 bits of logic roughly equals 32 bits of RAM.
What is 32 bits of logic for you? Is one XOR with 32 flip flops 64 bits of logic (storing 32 bit, XORing 32 bit)? That's a really weird "unit" for the size of logic that I've never seen used before.

Quote
X11 plays towards the strengths of FPGAs and ASICs because it requires no RAM (and minimally, only flip flops for pipeline state management).
Flip flops already make it slow, as they need one cycle to change. FPGAs and ASICs usually have slow clock rates, so you want to keep the amount of cycles low. That's why implementing a CPU on an FPGA is dead slow compared to a real CPU. My proposal makes use of this and this is actually option A I mentioned, where you need to use flip flops and need M cycles instead of 1, where M is the number of hashing functions that you chain.

Quote
Designs that focus on branching also play towards the strengths of ASICs and FPGAs.  ASICs and FPGAs evaluate branches in parallel.
It's not that easy. There are two options, see above for option A and option B. One option is slow, the other extremely expensive because you then need a *lot* of unused logic. And option B is not only just expensive: It will also violate timing constraints if it gets too long, meaning it's not possible to program an FPGA that way / build an ASIC. So this already is a pretty good defense against FPGAs and ASCIs. Only a processor manufacturer like Intel could make an ASIC that does this better than a CPU at acceptable costs.

Quote
Choosing the next hash function to execute based on the current state of a hash doesn't impact performance since you can saturate both hardware branches by temporarily "saving" work in the pipeline, and then "releasing" work so that each subsequent hash unit is always fully occupied.
Uhm, there are no branches in an FPGA/ASIC. You can implement a CPU-like state machine to allow "branches", that's it. In this case, that would even be overkill. A state machine would already be enough. But that would need a flip flop, significantly slowing down performance.

Quote
This is easier than it seems because cryptographically secure hash functions are equally likely to take each branch (otherwise it wouldn't be cryptographically sound as it would be biased, which would lower entropy).
Actually, that's what makes it harder to implement on an FPGA/ASIC/GPU: It is not predictable. Therefore you have to either attach a hashing unit for each hash used to the output of the one before and then do the same for each output again, basically meaning that you grow exponentially. So, 10 hashing functions with 10 rounds: 10^10 already! And only 10 of them are used!. That really does not perform well, performance-wise *and* power consumption-wise.

Quote
When engineering ASICs, going from most expensive to least: RAM > I/O bandwidth* > Compute
  * I/O bandwidth is really a proxy for RAM, think DDR.
Well, this is neither. This is logic used on the die, in other words: Number of transistors. Option B makes the number of transistors so high that the efficiency is a joke. Option A makes it so slow that a CPU will be faster.

Quote
ANY design that relies on difficulty to code rather than cost to manufacture WILL fall to ASICs/FPGAs (i.e. X11).
That's just not true. There is a limit of how many transistors you can place on a chip. If that's reached, you have to fall back to doing things serially. Considering that you need slow clock rates in order to be energy efficient, that will make it slow pretty quickly.

Quote
Don't do something silly like X11 which only adds implementation complexity and no real ASIC cost complexity.
As shown above, the non-deterministic part of my proposal adds *real* ASIC cost complexity. 10^10 hashing units for calculating just one block, or 10 hashing units for one block that needs 10 cycles.

However, I do agree that introducing memory into all this will make it even harder. But it will also make it harder for CPUs. Memory is the slowest thing in modern computers.
newbie
Activity: 46
Merit: 0
I would be looking forward to see a truly ASIC/FPGA defeating algorithm emerge. The majority of the current mining population probably does not care about such a development, but there are still good reasons to do this. We would stop losing mining efficiency to people using specialized hardware, which is expensive and botnets, however nasty, are still very cheap and possibly accessible to many and they would still be competing with cloud miners for a possible 51% attack. This could be an important step towards creating a coin which does not favor the big player and does not create a huge barrier for an idividual to start mining on basically any machine.
newbie
Activity: 37
Merit: 0
OP is on the right track.

What I can say is this:

1. X11 is trivial to obliterate via FPGAs and ASICs, and will be obliterated in the near future if X11 coins continue to appreciate.  X11 only requires compute resources, which are cheapest as far as die area goes when dealing with FPGAs and ASICs.  A good rule of thumb is 32 bits of logic roughly equals 32 bits of RAM.  X11 plays towards the strengths of FPGAs and ASICs because it requires no RAM (and minimally, only flip flops for pipeline state management).

2. Designs that focus on branching also play towards the strengths of ASICs and FPGAs.  ASICs and FPGAs evaluate branches in parallel.  Choosing the next hash function to execute based on the current state of a hash doesn't impact performance since you can saturate both hardware branches by temporarily "saving" work in the pipeline, and then "releasing" work so that each subsequent hash unit is always fully occupied.  This is easier than it seems because cryptographically secure hash functions are equally likely to take each branch (otherwise it wouldn't be cryptographically sound as it would be biased, which would lower entropy).

3. When engineering ASICs, going from most expensive to least: RAM > I/O bandwidth* > Compute
  * I/O bandwidth is really a proxy for RAM, think DDR.



ANY design that relies on difficulty to code rather than cost to manufacture WILL fall to ASICs/FPGAs (i.e. X11).



Recommendations:

Scrypt was a good start. Scrypt does the following things well:

1. Focuses on the economics of ASICs and FPGAs (i.e. RAM > Compute)
2. Amplifies RAM cost with cryptographic operations between RAM accesses.  Requiring 64 clock cycles of Salsa20/8 between memory accesses amplifies memory requirements since the memory is tied up until the hash calculation completes.
3. Uses unpredictable state transitions to determine where to read next.


What Scrypt didn't do well and led to its downfall:

1. Used predictable state transitions to determine where to write next.  Because memory was filled in order from 0-1023, it enabled a compute/RAM tradeoff where you could skip every other memory access and compute it instead.  I.e. you could store values 0,2,4,...1022, (ignoring 1,3,5,...1023) and then if something tried to read position 5, lookup 4, and get 5 by taking 4 and running it though Salsa20/8 once.  This significantly weakened the memory cost (this is what the --lookup-gap GPU miner setting does).

2. Used too few cycles between RAM accesses.  Salsa 20/8 ties up memory for ~64 clocks per memory accesses.  Increasing the number of rounds would increase memory requirements without impacting cache on CPU miners.

How to implement a serious CPU-only miner:

1. Don't do something silly like X11 which only adds implementation complexity and no real ASIC cost complexity.

2. Fix predictable memory writing in Scrypt - right now Scrypt looks like:

Code:
for (i = 0; i < 1024; ++i)
{
  RAM[i] = state // Memory access is sequential
  state = Salsa20/8(state)
}

// BETTER fill unpredictable
Code:
for (i = 0; i < 1024; ++i)
{
  RAM.Insert(state.e % i, state) // Memory access is unpredictable
  state = Salsa20/8(state)
}

The 1 line change above causes computability problems that we have no good solution for.  It adds a requirement for O(1) memory access and O(1) memory insert.  The most well known solution for these requirements are hashtables, but hashtables work best when there are no collisions, and keys are immutable.

RAM.Insert() creates a worst-case for hashtables because each insert causes a collision, and modifies the keys of all items after the insert position.  Alternate data structures offer even worse amortized performance (for example trees are O(log(N)) memory access and O(log(N)) memory insert).

Also - we are also performing a modulus with a variable denominator (well...incrementing), which can get expensive in hardware.

3. Switch from Salsa20/8 to something like Sha256 between Scrypt memory accesses.  As mentioned above in Scrypt issue #2, increasing clocks between memory access has the effect of increasing FPGA/ASIC memory requirements without impacting CPU cache requirements.  Also Sha256 is convenient because it uses 32bit internal registers which democratizes it to x86 and ARM devices.

4. Consider floating-point crypto functions; FP operations are very hard to do on FPGAs/ASICs because they require a significant amount of R&D or IP.  If you coupled FP-crypto with the above, you'd be well on your way.


Also - I'm not saying everything here is perfect, but I'd love to have a serious discussion if there is interest.  Finally, wrote this long post quickly, so please excuse any grammar errors / typos.

EDIT: clarity Smiley.
newbie
Activity: 8
Merit: 0
doesn't Myriad solve this whole "what hardware to use" problem by letting everyone mine?
Well, it basically means that people who bought hybrid ASICs that can do SHA-256 and Scrypt at the same time have an advantage, doesn't it? Wink
legendary
Activity: 1232
Merit: 1000
doesn't Myriad solve this whole "what hardware to use" problem by letting everyone mine?
sr. member
Activity: 392
Merit: 250
Thats would create botnet like what others mention. I suggest creating a pure GPU coins instead (truely anti FPGA and ASICs).
donator
Activity: 1218
Merit: 1079
Gerald Davis
Well, and GPU-only coins favor gamers botnet operators,

The average botnet has 10,000x the computing power of the average gamer. 
most botnet computers don't have opencl installed

The whole point of a botnet is that the botnet has control over the zombie nodes.  Whatever is needed to accomplish the goals of the botnet will be installed.
newbie
Activity: 8
Merit: 0
Quote
The topic was if it would perform well. It would... If it is cost efficient would to develop and produce will depend on the exchange rate of the "jumpcoin". 

Well, as outlined in the initial post, option A would not perform and option B would be too expensive. Sure, an ASIC would beat any CPU if you ignore the cost and go for option B. The idea is independent from the exchange rate: If it would be more expensive than CPUs that have the same performance, why do it? The power savings will be quite minor if you have so much unused, yet powered logic (and powering off units takes a clock cycle, bringing you back to the performance bottleneck of option A).
Pages:
Jump to: