1GH/s, 20w, $700 (was $500) — Butterflylabs, is it for real? (Part 2) - page 28.

makomk

hero member

Activity: 686

Merit: 564

Quote from: 2112 on January 31, 2012, 05:01:08 PM

It seems like some of them are indeed pipelined, but the level of pipelining is equal to the level of unrolling. It seems like ztex uses 125-way unrolling and 125-way pipelining. So the design computes in a single clock rounds of hashes for nonces (N-124 to N). When nonce N is on the input the output shows the final hash for nonce N-124.

In general a 125-way unrolled design can be pipelined anywhere from 1 to 125 stages.

ztex's latest code actually has two pipeline stages for every SHA-256 round, which is partly why it's so much faster; ISE has trouble routing the design efficiently. It varies as to how much sense this makes though. Also, the FPGA synthesis tools support something called register rebalancing where they move the registers that divide up the calculations into pipeline stages backwards and forwards in order to get the best speed, so it's not necessarily a simple question of one (or two) pipeline stages per round.

Quote from: 2112 on January 31, 2012, 05:32:56 PM

Somebody else posted a code that explicitly uses ternary adders Y = A + B + C. As far as I know Xilinx ISE will always synthesize adder trees Y = (A + B) + C or Y = A + (B + C) or Y = (A + C) + B.

Actually, I seem to recall that it's quite happy to automatically use ternary adders on Spartan-6.

fizzisist

hero member

Activity: 720

Merit: 528

Quote from: RandyFolds on January 31, 2012, 06:10:53 PM

Anyone remember that thread where a guy had some crazy graphic utility for FPGA design? I didn't understand a lick of what everyone was talking about, but it seemed that he was hand plotting it, and the pictures were awesome...

https://bitcointalksearch.org/topic/algorithmically-placed-fpga-miner-255mhschip-supports-all-known-boards-49971

RandyFolds

sr. member

Activity: 448

Merit: 250

Quote from: 2112 on January 31, 2012, 06:07:45 PM

Quote from: rjk on January 31, 2012, 05:39:19 PM

So if I could get my hands on a 4x hex core server with 256GB of RAM, you FPGA guys would love me long time?

I recall somebody posting a screenshot of a control session for an Amazon EC2 farm containing over 50 machines doing the Xilinx design. I really don't think that using more brute-force would be helpful.

The SHA-family of algorithms are very regular and pretty much every bit depends on every bit. This hits a weak spot in the global optimization algorithm used by the FPGA tools.

I think that the way forward goes through the use of specialized synthesis tools that don't make generic assumptions about what kind of circuitry is being synthesized.

Anyone remember that thread where a guy had some crazy graphic utility for FPGA design? I didn't understand a lick of what everyone was talking about, but it seemed that he was hand plotting it, and the pictures were awesome...

2112

legendary

Activity: 2128

Merit: 1074

Quote from: rjk on January 31, 2012, 05:39:19 PM

So if I could get my hands on a 4x hex core server with 256GB of RAM, you FPGA guys would love me long time?

I recall somebody posting a screenshot of a control session for an Amazon EC2 farm containing over 50 machines doing the Xilinx design. I really don't think that using more brute-force would be helpful.

The SHA-family of algorithms are very regular and pretty much every bit depends on every bit. This hits a weak spot in the global optimization algorithm used by the FPGA tools.

I think that the way forward goes through the use of specialized synthesis tools that don't make generic assumptions about what kind of circuitry is being synthesized.

RandyFolds

sr. member

Activity: 448

Merit: 250

Quote from: DiabloD3 on January 31, 2012, 04:08:21 AM

Quote from: RandyFolds on January 30, 2012, 06:39:12 PM

To the FPGA guys here: Why is it 'rolled' and not 'furled'? It seems way more appropriate.

Because its always been unrolling loops. Ears are unfurled, loops are unrolled.

And just because "it's always been that way", it's ok? You don't happen to live in Alabama, now, do you? Tongue

DiabloD3

legendary

Activity: 1162

Merit: 1000

DiabloMiner author

Quote from: rjk on January 31, 2012, 05:39:19 PM

Quote from: 2112 on January 31, 2012, 05:01:08 PM

My guess is that the number of possible valid transformations overwhelm the synthesis tools and they blow up either on memory usage or time.

So if I could get my hands on a 4x hex core server with 256GB of RAM, you FPGA guys would love me long time?

They may even use lube.

rjk

sr. member

Activity: 448

Merit: 250

1ngldh

Quote from: 2112 on January 31, 2012, 05:01:08 PM

My guess is that the number of possible valid transformations overwhelm the synthesis tools and they blow up either on memory usage or time.

So if I could get my hands on a 4x hex core server with 256GB of RAM, you FPGA guys would love me long time?

2112

legendary

Activity: 2128

Merit: 1074

Quote from: 2112 on January 31, 2012, 05:01:08 PM

My guess is that the number of possible valid transformations overwhelm the synthesis tools and they blow up either on memory usage or time.

I apologize, I'm having problem posting and editing the posts.

There just so many logically-equivalent ways to synthesize the SHA-256. For example somebody earlier posted a snippet of his synthesis where he used the adders in the DSP blocks on the Virtex 6 chip. For this to be really beneficial on Spartan 6 chips one has to write a location-dependent Verilog: when near a DSP block use its adder, when far away synthesize the adder using local slice resources.

The number of available trade-offs is immense.

And thus far I have talked only about synthesis. But the full working design requires two more steps: place and route. This opens another of dimensions that need to be explored for optimization.

One guy here on this forum is working on a design where he wrote a Java program to generate a Verilog program that does hashing. The Verilog is all location-constrained to the particular slices.

Somebody else posted a code that explicitly uses ternary adders Y = A + B + C. As far as I know Xilinx ISE will always synthesize adder trees Y = (A + B) + C or Y = A + (B + C) or Y = (A + C) + B.

On some other site I've found an implementation that pipelines rounds in pairs: 128-way unrolled Bitcoin hash would've had 64-way pipelining. Again, it wasn't for Spartan 6, but some other Xilinx chip.

2112

legendary

Activity: 2128

Merit: 1074

Quote from: Inspector 2211 on January 31, 2012, 05:00:40 PM

Thus, I find it very hard to believe that current designs are not pipelined.

Yeah, you are right and I was wrong. It seems like the N-way unrolled designs are also N-way pipelined. But the degree of pipelining doesn't have to equal the degree of unrolling.

2112

legendary

Activity: 2128

Merit: 1074

Quote from: rjk on January 31, 2012, 04:18:52 PM

My understanding is that the lack of pipelining is due to the lack of registers in an FPGA, is this correct or not?

No, I don't think so. I think that the limitations are due to the heuristics used by FPGA synthesis tools. At least in Xilinxes the registers are essentially free. Pretty much each slice can have direct combinatorial outputs or registered outputs mixed with no restrictions.

I shouldn't have written about no pipelining. The more accurate way would be inflexible pipelining. It would be better to describe level of unrolling and level of pipelining as two variables that are somewhat independent.

I just looked again into the folder that I used to store the Verilog source code for Bitcoin hashers.

It seems like some of them are indeed pipelined, but the level of pipelining is equal to the level of unrolling. It seems like ztex uses 125-way unrolling and 125-way pipelining. So the design computes in a single clock rounds of hashes for nonces (N-124 to N). When nonce N is on the input the output shows the final hash for nonce N-124.

In general a 125-way unrolled design can be pipelined anywhere from 1 to 125 stages.

There are also other possible ways of pipelining the SHA-256. For example the (W(i) + K(i)) expansion function uses a four-way adder: K(i) + S1(W(i-2)) + W(-7) + S0(W(i-15)) + W(i-16). One could factor out the last two addends S0(W(i-15)) + W(i-16) and precompute them in previous round as S0(W(i-14)) + W(i-15). Or even go two rounds deep and compute S0(W(i-13)) + W(i-14). And so forth.

My guess is that the number of possible valid transformations overwhelm the synthesis tools and they blow up either on memory usage or time.

Anyway, those are just my speculations. I haven't spend much eyeball time analyzing the available codes.

Inspector 2211

sr. member

Activity: 448

Merit: 250

Quote from: rjk on January 31, 2012, 04:18:52 PM

Quote from: 2112 on January 31, 2012, 04:04:11 PM

Lets not confuse unrolling with pipelining. Current open-source designs are fully-unrolled, but I have yet to see a proper pipelined design that is open-sourced. Maybe pipelining is what BFL did? By pipelining the unrolled design they could significantly crank up the clock, since the FPGAs are limited more by the propagation delay in the signal routing than in the propagation delay in the actual logic.

My understanding is that the lack of pipelining is due to the lack of registers in an FPGA, is this correct or not?

A Spartan6-LX150 has 184000 flipflops, and for a double SHA-256 only 32768 flip-flops are needed. 128 stages x 256 width = 32768. Fits easily if you have 184000 at your disposal.

Thus, I find it very hard to believe that current designs are not pipelined. Also, a typical design such as the ZTEX design achieves 200 MH/s with 200 MHz. Assuming it is not a fully pipelined design, that would mean that all 128 (or 125) stages have to percolate through in a mere 5 ns, because 5 ns is the clock period of 200 MHz. 40 ps (picoseconds) per stage? I don't think so.

RandyFolds

sr. member

Activity: 448

Merit: 250

Quote from: rjk on January 31, 2012, 03:55:50 PM

Quote from: yochdog on January 31, 2012, 03:31:11 PM

4-6 weeks. Shocked

I had to do it.....I'm sorry......

Bad yochdog! That line is reserved for use by RandyFold's god-size presence ^(TM) only.

Fixed that for ya...

rjk

sr. member

Activity: 448

Merit: 250

1ngldh

Quote from: 2112 on January 31, 2012, 04:04:11 PM

Lets not confuse unrolling with pipelining. Current open-source designs are fully-unrolled, but I have yet to see a proper pipelined design that is open-sourced. Maybe pipelining is what BFL did? By pipelining the unrolled design they could significantly crank up the clock, since the FPGAs are limited more by the propagation delay in the signal routing than in the propagation delay in the actual logic.

My understanding is that the lack of pipelining is due to the lack of registers in an FPGA, is this correct or not?

2112

legendary

Activity: 2128

Merit: 1074

Quote from: Inspector 2211 on January 31, 2012, 10:29:17 AM

Yes, current FPGA designs are fully pipelined, as long as they *fit* into the FPGA, and thus you get a hash rate of 200 MH/s at a clock frequency of 200 MHz. And it's not 1300 cycles, but literally 122 (or 128 or something like that).

Lets not confuse unrolling with pipelining. Current open-source designs are fully-unrolled, but I have yet to see a ~~proper~~general pipelined design that is open-sourced. Maybe a different scheme of pipelining and unrolling is what BFL did? By flexibly pipelining the unrolled design they could significantly crank up the clock, since the FPGAs are limited more by the propagation delay in the signal routing than in the propagation delay in the actual logic.

Quote from: kano on January 31, 2012, 10:02:37 AM

As I asked before, do they execute in a manner like dominoes where the clock process advances data through the FPGA in steps?

No, this is a horrible analogy.

To understand the unrolling as applied to the logic design you need to understand the difference between combinatorial and sequential logic.

In combinatorial logic the outputs are simply a function of inputs.

In sequential logic the outputs are a function of inputs and the internal state. All the current FPGA hashing appliances use a variant of sequential logic called synchronous sequential logic: there is a dedicated clock input and a change on the clock is when the internal state gets updated.

Fully unrolled (128-way or 125-way) Bitcoin hash means that the logic that computes it is fully combinatorial, there is no internal state used inside the cascade of the two SHA246 hashers. The clock is still used in the fully-unrolled design: to increment the nonce counter and to sample the zero-comparator at the output of the hasher.

64-way unrolled Bitcoin hash means that there is one internal state register that stores the intermediate state. During the odd clock cycles it does single SHA-256 of the input (midstate and nonce) and stores it in the internal state. During the even clock cycles it does single SHA-256 of the internal state and presents it on the output. This cicrcuit is only about half the size of the above circuit.

Now fully unrolled and pipelined design would be about the same size as the above fully-unrolled design but it would have some internal state registers. If there is one level of pipelining then in each clock cycle first half of the circuit would compute single SHA-256 for nonce "N" and the second half of the circuit would compute single SHA-256 for the nonce "N-1".

In unpipelined design the input signal have to race through full 128 (or 125) rounds of SHA-256 to the zero-comparator. With one level pipelined design the the inputs have to race through 64 rounds of the first SHA-256 to the internal register simultaneously with another set of signals racing through 64 (or 61) rounds from the internal register to the zero-comparator. Simplistically one could say that the clock rate on this design could be almost double of the clock rate of the non-pipelined design.

Again, I haven't seen anyone publishing open-source designs that are ~~both~~ independently unrolled and pipelined. I think this is due to the limitations of the FPGA synthesis tools. They either require tens or hundreds of GB of RAM or months of CPU time.

Anyway, kano, I suggest that you throw away your domino set. Get yourself a free version of Xilinx ISE or Altera Quartus and use them to play FPGA design game. It is like playing Tetris, chess & contract bridge all on the same board.

Costia

newbie

Activity: 28

Merit: 0