Author

Topic: FPGA mining - massively parallel looped vs unrolled (Read 1568 times)

legendary
Activity: 2128
Merit: 1073
any real numbers to argue for or against what I propose?
Check out the posts by bitfury. He was/is offering such design for sale.

https://bitcointalksearch.org/topic/bitfury-design-licensing-mass-production-83332

http://www.bitfury.org/bitfury110.html

http://www.bitfury.org/xc6slx150.html  <- the planahead porn is here

Briefly: almost everyone is offering the unrolled designs because the rolled sea-of-hashers design hit some worst cases in the synthesis/place/route toolchains: they either fail to converge or converge to shamefully bad implementations. Any practical implementation would have to utilise some sort of workaround for the toolchain's lack of convergence.

I believe that the most recent/fastest bitstreams from ngzhang are also closed-source because of the effort he had to expend to successfully implement them. The default Xilinx ISE wasn't cutting it anymore.
full member
Activity: 226
Merit: 100
The fully unrolled version is best use of the FPGA resources. When you start folding the design you need to add control logic and large muxes that selects the combinatorial path for the current round. In FPGA's everything i simplemented in LUTs, so all that muxing will steal LUTs from the combinatorial logicfor the acctual algorithm.

You will reduce the number of registers needed, but that's not the most scarse resource. At least not with any normal FPGA fabric...


::EDIT

And if we take into consideration FPPGA pricing I think you would still go for the fully unrolled design. I think the FPGA with most LUT's / $ will fit such a design...
hero member
Activity: 1596
Merit: 502
3 times a spartan LX-150 vs 50 times a spartan LX-9 ?
In terms of logic it is 3 * 150 = 450 and 50 * 9 = 450 so the same?
Except that 50 spartan LX-9's are probably a lot more expensive than 3 spartan LX-150's.
newbie
Activity: 14
Merit: 0
I started a thread in the newbie section, but I would rather discuss it in it's rightful place.

Before I start, I have no desire to debate ASIC. As it's not the point of the question, first I am slightly doubting anything will transpire, and even if it does, I would like to talk about FPGA's and not theoretical ASIC's.

now, I am interested in making an array of FPGA chips that mine coins, and every where I look people are unrolling and pipeling to achieve 1 or 2 bitcoin mining (2x sha) cores. What I am wondering is what happens if you go the other way, and instead try and make a sha core as small as possible (looped and hand designed) then repeat it many times in cheaper FPGA chips to make a massively parallel set up instead.

the way I see it, it's a trade of, speed of one complete bitcoin hash vs amount of logic blocks used, with a threshold of the maximum logic blocks of the FPGA chip.

a simple example would be that having 2 sha cores linked up to do 1 bitcoin hash is simular is speed (I think) to 2x one sha core doing 1 bitcoin hash in 2 steps. only difference is that one can fit on a chip with half as many logic blocks. or am I wrong?

any real numbers to argue for or against what I propose?
Jump to: