Pages:
Author

Topic: Official Open Source FPGA Bitcoin Miner (Last Update: April 14th, 2013) - page 25. (Read 432966 times)

hero member
Activity: 592
Merit: 501
We will stand and fight.
Do you know if the code will run/fit on a Spartan XC2S30? I have a ProxMark3 (RFID hacking tool) that I've been playing with the FPGA on, wondering if its capable of mining.. I can deal with the ARM code to interface between the FPGA and USB.

I'm very sad to say, it is impossible...
the LX150 we used has approx. 150,000 logic-cells, but the XC2S30 has less than 1,000 of them.
in addition, the logic-cells in spartan6 is far enhanced than spartan2.
legendary
Activity: 1260
Merit: 1000
Drunk Posts
Do you know if the code will run/fit on a Spartan XC2S30? I have a ProxMark3 (RFID hacking tool) that I've been playing with the FPGA on, wondering if its capable of mining.. I can deal with the ARM code to interface between the FPGA and USB.
hero member
Activity: 686
Merit: 564
BUT: You did point something out that I think I missed. In the code I linked you'll see that the pre-calculated T1 value is stored in a separate register, not tx_state[7] as you listed in your example. On looking at my code, I believe you are correct; tx_state[7] is never used (except for the last round) so it could be removed or replaced with the partial calculation. Good catch, Anoynomous!

Not sure if the compiler catches this optimization automatically or not.
I'm reasonably sure Altera's compiler for Cyclone IV does because of the large decrease in resource usage. On Cyclone IV it uses less resources to store the partially pre-calculated T1 value than it does to store tx_state[`IDX(7)] because registering logic outputs is practically free but registering the output of another register ties up an entire LE per bit that can't be used for anything else. No idea if Xilinx's tools catch this though.

Double check me on this:

Code:
tx_pre_w <= s0(rx_w[2]) + rx_w[1];     // Calculate the next round's s0 + the next round's w[0].
tx_new_w <= s1(rx_w[14]) + rx_w[9] + rx_pre_w;

right.. though the tx_pre_w can be saved at w[0]'s place that is to be transmitted to next loop, it will save a register..


Oooh, cunning - nice one Anoynomous! Costs a register overall due to having to get rx_w[2] out of storage, but might be worthwhile. In theory could it be cheaper to do this with s1(rx_w[14]) + rx_w[9] instead?

Code:
tx_pre_w <= s1(rx_w[15]) + rx_w[10];     // Calculate the next round's s1 + the next round's w[9].
tx_new_w <= s0(rx_w[1]) + rx_w[0] + rx_pre_w;
newbie
Activity: 11
Merit: 0
Double check me on this:

Code:
tx_pre_w <= s0(rx_w[2]) + rx_w[1];     // Calculate the next round's s0 + the next round's w[0].
tx_new_w <= s1(rx_w[14]) + rx_w[9] + rx_pre_w;

right.. though the tx_pre_w can be saved at w[0]'s place that is to be transmitted to next loop, it will save a register..
hero member
Activity: 560
Merit: 517
Quote
well i had used LX150_test. and i dnt have a lx150 dev board, so i think i will just share my ideas here..
Oops, sorry, LX150_Test isn't really usable at the moment. I really need to add a useful README outlining all those different project variations ...

Thank you for contributing your idea!

Please take a look at the project variation I linked: https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/tree/master/projects/LX150_makomk_Test

You will find that your idea, for the most part, has already been implemented in there. Specifically look around this line.

BUT: You did point something out that I think I missed. In the code I linked you'll see that the pre-calculated T1 value is stored in a separate register, not tx_state[7] as you listed in your example. On looking at my code, I believe you are correct; tx_state[7] is never used (except for the last round) so it could be removed or replaced with the partial calculation. Good catch, Anoynomous!

Not sure if the compiler catches this optimization automatically or not.

Quote
again s0_w can be calculated a loop ahead and added to  rx_w[31:0]. this way our new_w will be shortened to:
Now that, I hadn't thought of. Another fantastic catch, Anoynomous!

Double check me on this:

Code:
tx_pre_w <= s0(rx_w[2]) + rx_w[1];     // Calculate the next round's s0 + the next round's w[0].
tx_new_w <= s1(rx_w[14]) + rx_w[9] + rx_pre_w;

Quote
if the above solution is applied, the calculation of new_w will be the new critical path...
The calculation of tx_state[0] is the current critical path:
Code:
t1 = rx_t1_part + e1_w + ch_w
tx_state[0] <= t1 + e0_w + maj_w;
Which is actually pretty good, since it's implemented as only two adders.
newbie
Activity: 11
Merit: 0
if the above solution is applied, the calculation of new_w will be the new critical path...
new_w = s1_w + rx_w[319:288] + s0_w + rx_w[31:0];

again s0_w can be calculated a loop ahead and added to  rx_w[31:0]. this way our new_w will be shortened to:

new_w = s1_w + rx_w[319:288] + rx_w[31:0];

dcreasing the critical path and possibly increasing the clock frequency...

Can anbody tell me the %age LUT utilized after synthesis... there may be a possibility of replacing the adders logic... Smiley

newbie
Activity: 11
Merit: 0
For S6-LX150, this is probably the preferred project to start from:
https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/tree/master/projects/LX150_makomk_Test
You'll want to adjust main_pll.v:98 to 5 for 50MHz, to make the compile easier and the firmware actually usable (assuming you have the S6-LX150T dev board) without cooling.

well i had used LX150_test. and i dnt have a lx150 dev board, so i think i will just share my ideas here..

the critical path in this circuit is "t1 = rx_state[`IDX(7)] + e1_w + ch_w + rx_w[31:0] + k"..
but k and rx_w[31:0] can be calculated one loop ahead and added to rx_state[`IDX(7)] at the point below:

state_buf[`IDX(7)] <= rx_state[`IDX(6)];


the new code should look like this:
state_buf[`IDX(7)] <= rx_state[`IDX(6)] + rx_w[31:0] + k;
----> where k and rx_w are of next loop

This will reduce the adders to:
t1 = rx_state[`IDX(7)] + e1_w + ch_w;

this should improve clock speed, provided routing issues dont interfere....





hero member
Activity: 560
Merit: 517
Quote
i am having a little trouble here. I had some experience in designing sha1 hash cracker on fpga, so this project caught my interest. When i downloaded the code and tried to compile it for S6 lx150, it took about an hour to just synthesize the code and then the software said i had overused my resources.. so i wanted to knw, where did i go wrong?...
Which project did you use?

For S6-LX150, this is probably the preferred project to start from:
https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/tree/master/projects/LX150_makomk_Test
You'll want to adjust main_pll.v:98 to 5 for 50MHz, to make the compile easier and the firmware actually usable (assuming you have the S6-LX150T dev board) without cooling.
newbie
Activity: 11
Merit: 0
hi to all,
i am having a little trouble here. I had some experience in designing sha1 hash cracker on fpga, so this project caught my interest. When i downloaded the code and tried to compile it for S6 lx150, it took about an hour to just synthesize the code and then the software said i had overused my resources.. so i wanted to knw, where did i go wrong?...
legendary
Activity: 1270
Merit: 1000
Finally got around to coding some maximum clock speed improvements for users of smaller Cyclone III and IV devices - now available from my new partial-unroll-speed branch. Expected minimum device size and speed is roughly as follows:

I've got so far:

EP3C25C6 135MHz
EP2C35C6 111(108)MHz
EP2C35C8 80Mhz

for the 85Degree Celsius slow timing model after playing with the options given from the 'timimg optimizing advisor). One point was that rerunning the compile process a second time doesn't not always give better or equal result (with timing driven options 'on'), so it could be wise the work with revisions or some other provisions made for keeping the optimum bitstream.

One idea i've got from the numbers: would it be more performant to use for the adressed cases only one pipeline that would compute both hashes alternating at a lower resource count even if the pipelines are not 100 % equal?
hero member
Activity: 686
Merit: 564
Quote
200MH is simply way out of the question for an S6-LX150.
That won't stop me from trying  Grin

As far as I can tell with the poking around I've done so far, the current bottleneck on the S6-LX150 is the far dependencies caused by the W calculations. These references make it so that the rounds are not isolated, and so cannot be routed into a uniform chain. This forces ISE to do completely absurd routing, splattering the placement of a round's components across a good 1/4th of the chip. And that, obviously, leads to massive routing delays. On my last few compiles, the worst-case paths were >80% routing (8ns+ of routing, with 2ns of logic).
I saw similar failures at one point. Try enabling register duplication for the Map stage and/or register rebalancing during synthesis. I think I can probably hit at least 140 MHz for 70 Mhash/s on SLX75 with two pipeline stages per round and both of those enabled, plus some other bits, but I need to fix some stuff and test the changes in simulation.
legendary
Activity: 1270
Merit: 1000
Hm, the datasheet tells 126 MHz performance and 1 clock cycle per hashing round,  i would interpret this numbers as this would give us approx 1 MHash/s for bitcoin hashing.

There are some papers on the net on SHA2 cores (McEnvoy and another one) which are capable of running at 120 MHz  on a quite old Virtex2 using 1k Lut/SLices??? reaching similar perfomance numbers, they use a pipelined design which need ca. 68 rounds for a single SHA2 Hash.

But regarding the resource usage, there are one the numbers for a single core, but not how the design scales up up to a FPGA full of cores.

I apologize that if there are no misunderstanding, the  datasheet tells that IP core could run at 126MHz and provide a hash rate of 977Mbps


That means approx. 8bit/clk.
And also means a bitcoin hashing rate at 3.8MH/s(1 bitcoin hash is 256bit of data, is that right?).

IMHO No, one bitcoin hash uses 2 'normal' SHA256 hashes, but this would give 1,9 Mhash which ist still the double if using the MHz/64=MHash/s asumption. I have no clue how the troughput will be counted, my understanding so far was that it will use the output data rate for  hashing 64 bit chunks of input data. (If the input data set to be hashed is larger than 64 bit, the input will be processed in 64 bit chunks that are expanded to 256 bit, but the output data size will not grow in size)
hero member
Activity: 592
Merit: 501
We will stand and fight.

EDIT1:
I found this :
http://www.heliontech.com/downloads/fast_hash_xilinx_datasheet.pdf#view=Fit
In this Commercial Ip core, they use 309 slices (SLX150 has 23038 of them), generate a transport of 977Mbps.
If we use 80% slices of one SLX150, we can implement 60 of these cores, generate a transport of 58G. about 228MH/s.

So ,reach 200MH/s is very possible, isn't it?

Hm, the datasheet tells 126 MHz performance and 1 clock cycle per hashing round,  i would interpret this numbers as this would give us approx 1 MHash/s for bitcoin hashing.

There are some papers on the net on SHA2 cores (McEnvoy and another one) which are capable of running at 120 MHz  on a quite old Virtex2 using 1k Lut/SLices??? reaching similar perfomance numbers, they use a pipelined design which need ca. 68 rounds for a single SHA2 Hash.

But regarding the resource usage, there are one the numbers for a single core, but not how the design scales up up to a FPGA full of cores.

I apologize that if there are no misunderstanding, the  datasheet tells that IP core could run at 126MHz and provide a hash rate of 977Mbps


That means approx. 8bit/clk.
And also means a bitcoin hashing rate at 3.8MH/s(1 bitcoin hash is 256bit of data, is that right?).

legendary
Activity: 1270
Merit: 1000

EDIT1:
I found this :
http://www.heliontech.com/downloads/fast_hash_xilinx_datasheet.pdf#view=Fit
In this Commercial Ip core, they use 309 slices (SLX150 has 23038 of them), generate a transport of 977Mbps.
If we use 80% slices of one SLX150, we can implement 60 of these cores, generate a transport of 58G. about 228MH/s.

So ,reach 200MH/s is very possible, isn't it?

Hm, the datasheet tells 126 MHz performance and 1 clock cycle per hashing round,  i would interpret this numbers as this would give us approx 1 MHash/s for bitcoin hashing.

There are some papers on the net on SHA2 cores (McEnvoy and another one) which are capable of running at 120 MHz  on a quite old Virtex2 using 1k Lut/SLices??? reaching similar perfomance numbers, they use a pipelined design which need ca. 68 rounds for a single SHA2 Hash.

But regarding the resource usage, there are one the numbers for a single core, but not how the design scales up up to a FPGA full of cores.
hero member
Activity: 592
Merit: 501
We will stand and fight.
Quote
200MH is simply way out of the question for an S6-LX150.
That won't stop me from trying  Grin
Good luck!  Wink


Besides pipelining, there a another way to enhance performance in IC design, which is logic-copy.
I didn't read the codes yet( because I put all my spare time on the dual XC6SLX150 mining board design), but after read this thread, if we facing   terrible routing problems, why not we try another architecture.
The possible way is, implement a core, optimized  roll up, like a calculate equipment(maybe better use DSP48As) around a  signal 512bit register(maybe use LUTs to implement Distributed RAM instead of using registers), runing at 200MHZ+(it's very possible), about 64clocks per hash.  and we can implentment 100+ of them per chip.
This way, we can also generate a very MH/s.

Certainly, I'm not a expert, just for discussing.


EDIT1:
I found this :
http://www.heliontech.com/downloads/fast_hash_xilinx_datasheet.pdf#view=Fit
In this Commercial Ip core, they use 309 slices (SLX150 has 23038 of them), generate a transport of 977Mbps.
If we use 80% slices of one SLX150, we can implement 60 of these cores, generate a transport of 58G. about 228MH/s.

So ,reach 200MH/s is very possible, isn't it?
newbie
Activity: 35
Merit: 0
Quote
200MH is simply way out of the question for an S6-LX150.
That won't stop me from trying  Grin

As far as I can tell with the poking around I've done so far, the current bottleneck on the S6-LX150 is the far dependencies caused by the W calculations. These references make it so that the rounds are not isolated, and so cannot be routed into a uniform chain. This forces ISE to do completely absurd routing, splattering the placement of a round's components across a good 1/4th of the chip. And that, obviously, leads to massive routing delays. On my last few compiles, the worst-case paths were >80% routing (8ns+ of routing, with 2ns of logic).

Yeah, it looks like a "giant snake" that traverses the chip Cheesy

Quote
The current critical path is approximately two 3-way 32-bit adders implemented as 16 total slices, thanks to the Spartan-6 fast carry look ahead chains. Is there a means of optimizating that logic that I have missed?

These are the adders that I tried to move into DSP48s, as they have dedicated carry paths to and from adjacent DSPs in a column.  I didn't look at all how to optimize the actual math/operations at all though.
hero member
Activity: 560
Merit: 517
Quote
My criticism of this design (your design?)  is that there is too much pipelining.
Thank you for the criticism. I really do appreciate the feedback, and I am by no means an expert Smiley

My intuition is similar to yours, in that a more traditional serial design should achieve better utilization and performance on the Spartan-6 architecture. But it is very easy to underestimate the massive amount of optimizations that occur in the fully unrolled design that takes my current primary focus.

I have a functioning serial implementation, but so far my estimates for its total performance once put in parallel on the S6-LX150 is not exciting. Something like 120MH/s of performance. It's in the back of my mind, and there is plenty more work to be done in optimizing and perfecting it, but it hasn't shown me enough promise to warrant being in my mental spotlight like the unrolled design.

Quote
The logic you are using to compute the basic hashes is not optimal, and you have not spent any time trying to optimize for your critical path.
The current critical path is approximately two 3-way 32-bit adders implemented as 16 total slices, thanks to the Spartan-6 fast carry look ahead chains. Is there a means of optimizating that logic that I have missed?
newbie
Activity: 54
Merit: 0
Quote
200MH is simply way out of the question for an S6-LX150.
That won't stop me from trying  Grin
Good luck!  Wink

Quote
As far as I can tell with the poking around I've done so far, the current bottleneck on the S6-LX150 is the far dependencies caused by the W calculations. These references make it so that the rounds are not isolated, and so cannot be routed into a uniform chain. This forces ISE to do completely absurd routing, splattering the placement of a round's components across a good 1/4th of the chip. And that, obviously, leads to massive routing delays. On my last few compiles, the worst-case paths were >80% routing (8ns+ of routing, with 2ns of logic).
My criticism of this design (your design?)  is that there is too much pipelining.  If you have ever taken a computer architecture class, pipelining can be a very serious impediment to having a high speed design.  This sounds counter-intuitive, but the cost you pay for all those registers is very high.  This can be mitigated quite a bit on FPGAs, since registers are part of each CLB (and in a sense they can come for "free" if you have enough combinatorial logic).  This is certainly part of your routing problem.
Quote
If W is buffered between each round as a 512-bit register, instead of chains of shift registers and BRAMs, then the rounds can be isolated, but ISE fails to Map such a design for reasons I have not yet nailed down. 512-bits*~100 is quite a lot of registers  Undecided

If I, or someone else, can find a way to isolate the rounds and put them into a more consistent chain, then I highly suspect that both performance and area will improve considerably.

I may create a "fake" design that focuses specifically on the W calculations (without digester rounds), and see if I can somehow get them routed into a sensible structure (even if it requires manual placement  Angry )
The roadblock to having a high density FPGA design (in this case) is not your routing issues.  The logic you are using to compute the basic hashes is not optimal, and you have not spent any time trying to optimize for your critical path.  I would suggest you concentrate your efforts in this domain (hint hint).  Keep in mind that you are duplicating each round 128 times, so any logic savings per round is magnified by 128x.
hero member
Activity: 560
Merit: 517
Quote
200MH is simply way out of the question for an S6-LX150.
That won't stop me from trying  Grin

As far as I can tell with the poking around I've done so far, the current bottleneck on the S6-LX150 is the far dependencies caused by the W calculations. These references make it so that the rounds are not isolated, and so cannot be routed into a uniform chain. This forces ISE to do completely absurd routing, splattering the placement of a round's components across a good 1/4th of the chip. And that, obviously, leads to massive routing delays. On my last few compiles, the worst-case paths were >80% routing (8ns+ of routing, with 2ns of logic).

If W is buffered between each round as a 512-bit register, instead of chains of shift registers and BRAMs, then the rounds can be isolated, but ISE fails to Map such a design for reasons I have not yet nailed down. 512-bits*~100 is quite a lot of registers  Undecided

If I, or someone else, can find a way to isolate the rounds and put them into a more consistent chain, then I highly suspect that both performance and area will improve considerably.

I may create a "fake" design that focuses specifically on the W calculations (without digester rounds), and see if I can somehow get them routed into a sensible structure (even if it requires manual placement  Angry )
newbie
Activity: 54
Merit: 0
But the question is, 2 of XC6SLX150 -3n will finally give us how much MH/s? If we want to make FPGA mining to be a feasible choice, the MH/s pre $ must close the GPU mining.
1 HD6870( now buy new HD5850s are difficult) is about 180$, provide a hashing power of 270MH/s, about 1.5Mhs/$. I think at least, a 1Mhs/$ is necessary for FPGA mining.
Can we optimize the XC6SLX150 to about 200Mh/s performance? Is it possible?

200MH/s is not possible on this part.  Artforz was able to tweak his to approx 118MH, and that was with a well optimized design with overclocking.  This open-source design isn't terribly efficient, but actually produces a very good result of approx 100MH on the Spartan 150 if the design can be routed at 100 MHZ.

200MH is simply way out of the question for an S6-LX150.
Pages:
Jump to: