New scalable pipelined FPGA core for SHA-256 - any interest?

magik

newbie

Activity: 44

Merit: 0

when I last looked/compiled the open source FPGA miner from that other thread, I saw the longest delay path/critical path involved 2 32-bit adders - I believe this line: "w_out(511 downto 480) <= s1 + w_in(319 downto 288) + s0 + w_in(31 downto 0);"

not sure how you've implemented your SHA-256 transform, and I never had time to optimize it myself, but faster clock speeds could probably be hit by breaking those 3 adds into 2 rounds. I was compiling onto a Xilinx FPGA, and a very low end one at that Spartan 3E-1600, and the Xilinx chips seems to have more trouble routing, and thus more clock problems ( or rather having to used reduced clock speeds ). But maybe that's something to think about when trying to cram as many cores running as fast as possible into your design. Also, if that 32-bit adder's critical path is taking up too much time, that also can be broken into multiple cycles, or maybe a different adder implementation with a carry look ahead.

mpfrank

sr. member

Activity: 247

Merit: 250

Cosmic Cubist

Quote from: lame.duck on June 25, 2011, 04:23:57 PM

Hello, of course such a core would be very intresting, and i will gladly help you testing.

For a board with 2 serial lines (ep2C35) i did testcompile a design with 2 cores @ LOOP=4 instead of of one with LOOP=3 and got 10% more performance clockwise. My (wild guess) is that thats of because lower routing requirement, but i did not check this. Maybe it is possible to start the cores one after one or even every other cycle so at the end we would only need 1 unit checking for the golden ticket.

Yeah, a little unrolling might help and it would certainly make sense to bundle this into 1 top-level module that takes care of iterating nonces and sends the golden ticket out to the customer when found... A generic map can specify the correct number of copies of the interior core to fully utilize the FPGA's resources.

I'll upload the project files to Github sometime in the next few days...

Olaf.Mandel

member

Activity: 70

Merit: 10

Quote from: mpfrank on June 25, 2011, 04:50:53 PM

[...]
A 64x MUX just needs three 5-bit (32-entry) LUTs for each bit of width [...]

I hadn't understood that "5 input" meant "5-bit wide selector=32 input" LUT. That is really wide. Thanks.

mpfrank

sr. member

Activity: 247

Merit: 250

Cosmic Cubist

Quote from: Olaf.Mandel on June 25, 2011, 04:26:52 PM

Quote from: mpfrank on June 25, 2011, 04:09:08 PM

...
I would be happy to post my code on Github, but I feel I really should test & debug it first... Unless you want to help out with that. Cheesy

Sorry, I am just now learning about how to use testbenches. Before, I always added some logic that wiggled a couple of pins as debug output...

Currently I am concentrating on hardware, and when I have that, I hope that the different code branches have merged: maybe there are just some generics to be configured for the miner, e.g. num_cores, depth, do_addition_pipeline, ...

That is why I am curious about your code: how much overhead does the loop add to the code? If I understand you correctly, not much (though I don't understand how a 64x MUX can be "cheap").

A 64x MUX just needs three 5-bit (32-entry) LUTs for each bit of width - two that use bits 4-0 of the select input, and one that uses bit 5 and the outputs from the first two LUTs. I'm assuming it's compiled that way, or else as a memory - but I haven't actually de-obfuscated the synthesized circuit enough to tell yet.

Olaf.Mandel

member

Activity: 70

Merit: 10

Quote from: mpfrank on June 25, 2011, 04:09:08 PM

...
I would be happy to post my code on Github, but I feel I really should test & debug it first... Unless you want to help out with that. Cheesy

Sorry, I am just now learning about how to use testbenches. Before, I always added some logic that wiggled a couple of pins as debug output...

Currently I am concentrating on hardware, and when I have that, I hope that the different code branches have merged: maybe there are just some generics to be configured for the miner, e.g. num_cores, depth, do_addition_pipeline, ...

That is why I am curious about your code: how much overhead does the loop add to the code? If I understand you correctly, not much (though I don't understand how a 64x MUX can be "cheap").

lame.duck

legendary

Activity: 1270

Merit: 1000

Hello, of course such a core would be very intresting, and i will gladly help you testing.

For a board with 2 serial lines (ep2C35) i did testcompile a design with 2 cores @ LOOP=4 instead of of one with LOOP=3 and got 10% more performance clockwise. My (wild guess) is that thats of because lower routing requirement, but i did not check this. Maybe it is possible to start the cores one after one or even every other cycle so at the end we would only need 1 unit checking for the golden ticket.

mpfrank

sr. member

Activity: 247

Merit: 250

Cosmic Cubist

Quote from: Olaf.Mandel on June 25, 2011, 02:32:02 PM

Hi,

short question of unrolling vs. not unrolling the loops: assuming a fully unrolled loop with 128 stages, the round variable k (cube roots of different primes) are hardcoded onto each flipflop. No MUX is needed. If one rolls up the pipeline, doesn't this add long MUXes?

These could probably be removed by utilising BlockRAM. I assume you used that. Is the compiler smart enough to do that for itself or did you have to code that?

I would really love to see your code to compare the hashes per area per time to a fully unrolled version on the same chip.

The "long" (64-input x 32-bit) muxes for the K constants should compile, I think, into just a couple of 5-input LUTs per output bit (i.e., 64 LUTs, in parallel) plus another 32 LUTs to use the 6th select bit. This <100 cells doesn't add much to the overall size of the design, and doesn't slow it down at all either, since the mux lookup can be done in parallel with (and should be faster than) the adds that happen in each pipeline stage.

Doing a few stages of loop unrolling might help amortize away the overhead of the auxilliary hardware such as preprocessor/postprocessor pipeline stages and the work dispatcher, but other than that, I think unrolling should not affect the peak frequency or cost-efficiency very much, since this design is already very fast and compact...

I would be happy to post my code on Github, but I feel I really should test & debug it first... Unless you want to help out with that. Cheesy

mpfrank

sr. member

Activity: 247

Merit: 250

Cosmic Cubist

Quote from: O_Shovah on June 25, 2011, 01:14:18 PM

Hello Mike

Your idea sure sounds interesting.

I Personally started a thread for the developement of a custom FPGA miner platform http://forum.bitcoin.org/index.php?topic=22426.0 so you might want to have look.

Your concept would help to reduce the minimum size needed für an FPGA chip.

Some questions towards your data:

- As i understand it you are using an Altera board , these 2.113 cells are LE's ?

- Can you please try to simulate your programm on a Cylone III with 39600 Cells http://search.digikey.com/scripts/DkSearch/dksus.dll?Detail&name=544-2501-ND

- Has your design been tested on a real board or have there just been done calculations so far ?

regards

Jens

Yah, by cells I meant LEs. On the Stratix, these are I think mostly 5-input LUTs (lookup tables).

On your EP3C40F780C8, I get about the same number of cells (2,051) and 3,692 memory bits. With some trimming you could fit about 20 of them. It's a slower chip (8 ns speed grade) so the fmax is down to 127-142 MHz. That gives an aggregate hash rate of about 20-22 Mhash/s on this $111 chip.

I have not tested yet, but in my experience these kinds of speed calculations are pretty accurate (a little conservative actually).

Olaf.Mandel

member

Activity: 70

Merit: 10

Hi,

short question of unrolling vs. not unrolling the loops: assuming a fully unrolled loop with 128 stages, the round variable k (cube roots of different primes) are hardcoded onto each flipflop. No MUX is needed. If one rolls up the pipeline, doesn't this add long MUXes?

These could probably be removed by utilising BlockRAM. I assume you used that. Is the compiler smart enough to do that for itself or did you have to code that?

I would really love to see your code to compare the hashes per area per time to a fully unrolled version on the same chip.

O_Shovah

sr. member

Activity: 410

Merit: 252

Watercooling the world of mining

Hello Mike

Your idea sure sounds interesting.

I Personally started a thread for the developement of a custom FPGA miner platform http://forum.bitcoin.org/index.php?topic=22426.0 so you might want to have look.

Your concept would help to reduce the minimum size needed für an FPGA chip.

Some questions towards your data:

- As i understand it you are using an Altera board , these 2.113 cells are LE's ?

- Can you please try to simulate your programm on a Cylone III with 39600 Cells http://search.digikey.com/scripts/DkSearch/dksus.dll?Detail&name=544-2501-ND

- Has your design been tested on a real board or have there just been done calculations so far ?

regards

Jens

coinage

member

Activity: 60

Merit: 10

Quote from: mpfrank on June 23, 2011, 03:14:37 PM

This particular FPGA is rather expensive; I haven't yet researched which FPGA platform would be most cost-effective for this design. But, if anyone else is interested in exploring this line of work, and helping to integrate this new core into a more complete mining solution, I would be happy to release the code.

150+ MHash/s might be good if you can achieve something similar on a less expensive device. The analyses I've seen so far (brief reading only) implied the FPGAs they had checked could yield an excellent hash rate per watt but not per dollar sunk in the hardware.

As soon as you get out of the sandbox, please repost in the appropriate forum, or you could try PM someone over there now to come look at your existing post. Your expertise and attitude is valuable.

Once electricity costs become a prime concern later on, FPGAs/ASICs may especially shine. That is a relief to me because I am disappointed at the extraordinarily amount of intentionally useless* & wasteful* computational processing that this project encourages people to undertake across the globe, and the possible impact on the environment since most of our high-powered computers are ultimately ... coal-fired.

*At the same time, financial transactions done the bitcoin way may be one of the most useful & valuable services those people could ever help facilitate. It's just disturbing that a more useful computational project has not been adopted to serve as proof-of-work (though I've read a proposal for just that in these forums. (That was probably the discussion proposing a new coin project "Bitcoin Plus" aka "BCP" which would accept one-way conversions from original BTC. No relation to bitcoinplus.com by the way.)

mpfrank

sr. member

Activity: 247

Merit: 250

Cosmic Cubist

Hi, I've been developing a new optimized SHA-256 core in VHDL. The design philosophy of this version revolves around these points:

Reorganize and aggressively pipeline the round processor so as to achieve a clock frequency (and hardware efficiency) close to the maximum possible on a given FPGA. (The critical-path delay of this particular design is, I believe, no more than one 32-bit add delay, plus register setup time.)
For improved scalability to maximally utilize FPGAs of any size, don't unroll the round loop, but instead build a small iterative, single-round processor, many copies of which can be operated in parallel. Each of these cores can simultaneously hash as many block candidates as it has pipeline stages (4 in this design). A properly designed work-dispatch unit (still to be written) can ensure that all cores always stay fully utilized hashing block candidates.

As an example of this approach's performance, here are some example stats derived for the current design, based on its compilation for a Stratix III FPGA (EP3SL150F1152C2N, as found in the Altera/Terasic DE3 board).

Area for 1 core, including test rig: 2,113 cells (plus a little memory)
Maximum frequency: 385 - 421 MHz (depending on temperature)
Clock cycles per SHA-256 (1 chunk): 64 (on average, if pipeline is kept full)
Clock cycles per double-SHA-256: 128 (ditto)
Bitcoin Mhash/s per core: 3.0 - 3.3 (temp-dependent)
Cores per FPGA: At least 50
Bitcoin Mhash/s per FPGA: 150 - 165 Mhash/s (temp-dependent)

This particular FPGA is rather expensive; I haven't yet researched which FPGA platform would be most cost-effective for this design. But, if anyone else is interested in exploring this line of work, and helping to integrate this new core into a more complete mining solution, I would be happy to release the code.

Regards,
-Mike

Topic: New scalable pipelined FPGA core for SHA-256 - any interest? (Read 3245 times)