Pages:
Author

Topic: Official Open Source FPGA Bitcoin Miner (Last Update: April 14th, 2013) - page 44. (Read 432950 times)

newbie
Activity: 54
Merit: 0
3) Quite a large amount of W can be pre-calced off the FPGA as well. Again, Data is constant except for the 4th word which is the nonce. I won't go into the math here, because it's rather long. The first 16 values of W are sourced directly from Data, so that's a no-brainer. However, also note that all of Data after the first 4 words (after the nonce) is constant for all getwork requests! So you can actually hard code those values as they are in the code on the public repo. What that code doesn't do, however, is just add K with the known W values, thus saving you an adder for the first 16 rounds.

After the first 16 rounds, everything up to round 35 (I think) has some amount of W that can be pre-calculated off the FPGA and given to it with the rest of the work data.

I just wanted to add that this statement is not true.  The computation of future rounds of W (above the initial 16 values) uses mixing functions S0() and S1() (in the SHA-nomenclature), which are not composable with add arithmetic.  So the need to compute S0(W[0]) + A is not the same as computing S0(W[0] + A).  The nonce is the value feeding the S0() function, so you need to recompute the entire W block if it changes.

It's a great idea (I tried this too), but it just doesn't work. Wink
hero member
Activity: 560
Merit: 517
Quote
Regarding your optimizations, I'm fairly sure that at least part 1 and possibly also part 4 will be performed automatically by xst.
I don't trust synthesizers. Tongue I've gotten optimizations by simply moving things into separate modules to clearly define the hierarchy.

Quote
Have you tried routing it at a lower clock frequency? Start out with something ridiculous like 10MHz to check if the design can be routed at all, and then increase it until things get tight.
Yeah, I will have to try that. Last time I tried, it was at 50MHz.

I tried routing 16 rounds only. Worked fine. 32 was fine. But then at 64 it failed. So it's doing something weird ...

Quote
If it doesn't route for 10MHz as well, my suspicion would be that the Spartan series just don't support as flexible routing as the Virtex series do.
I certainly hope not. The Spartan series is supposed to compete with Altera's Cyclone which routes these designs without any problems. Their architectures are certainly different, but if it can't even route a 64-round SHA-256 design on an LX150 then ... I dunno what to think about Xilinx.

Quote
Thank you! It will take me some time to double check it and commit it to the repo; it's a busy week at my job. I'm posting these replies during compilation rounds  Wink

Anyway, this is great input TheSeven, I really appreciate it. It makes me a little more worried about cramming the design into my LX150, but it's all good discussion and I'll keep working at it regardless.
hero member
Activity: 504
Merit: 500
FPGA Mining LLC
Wow, Terasic apparently noticed this project and posted it on their Facebook page. Neato! And they also gave a shout-out to Bitcoin!  Cheesy

Nice one, liked it Smiley

Quote
BTW, if you want to, you could host my VHDL version in your git repository as well, for the Xilinx users.
Sure, that'd be great!

http://dl.dropbox.com/u/23683845/fpgaminer-xilinx.zip

Quote
Never ever. I'm if you mean XC6SLX150-3, this isn't going to cross 120MH/s, at least not with this design.
I've tried to synthesize a fully-unrolled (131 pipeline stages) design for that FPGA, and even though it fits (it doesn't on the LX100 variant), there seem to be difficulties routing it, and attempting to synthesize it for 120MHz didn't work out at all. The router is still running (since almost 48 hours now), so I don't even know which frequency it did reach yet.
You mention 131 pipelined stages. I assume you mean something like, 128 round modules, 2 final adder stages, and a third ... misc stage?

2x 64 digesters, 2 final adders and the =0 comparison

My estimate for the 6SLX150-3N comes from assuming that Xilinx's estimation on equivalent LCs is accurate and that the design would use a similar number of 4-LUTs and FFs on Xilinx, as compared to Altera.

The optimized design can fit into 80K LEs on a Cyclone 3, possibly 75K with some more cramming; at 80MHz and one cycle per full-hash. Assuming LE == LC that would allow two of those to fit into the 6SLX150-3N because it has the equivalent of 150K LCs. That would achieve a combined hash-rate of 160MH/s.

Obviously, you can see that my assumptions all come from "LE == LC." Indeed, when I've synthesized my design for the 6SLX150, utilization was roughly what I expected. But as you pointed out, routing is where I got stuck. ISE refused to route the design. That's on my list of things to do; figure out why ISE won't route the design, even though it will route a cut-down design (only implementing a few rounds instead of all 128) just fine, and utilization is <70%.

I don't think two of those pipelines actually fit in there. I've got 60% slice usage on the LX150, and on the LX100 the placer failed even though it has more slices than slices actually used on the LX150.
Have you tried routing it at a lower clock frequency? Start out with something ridiculous like 10MHz to check if the design can be routed at all, and then increase it until things get tight.
If it doesn't route for 10MHz as well, my suspicion would be that the Spartan series just don't support as flexible routing as the Virtex series do.

Regarding your optimizations, I'm fairly sure that at least part 1 and possibly also part 4 will be performed automatically by xst.
hero member
Activity: 560
Merit: 517
Wow, Terasic apparently noticed this project and posted it on their Facebook page. Neato! And they also gave a shout-out to Bitcoin!  Cheesy

Quote
BTW, if you want to, you could host my VHDL version in your git repository as well, for the Xilinx users.
Sure, that'd be great!

Quote
Never ever. I'm if you mean XC6SLX150-3, this isn't going to cross 120MH/s, at least not with this design.
I've tried to synthesize a fully-unrolled (131 pipeline stages) design for that FPGA, and even though it fits (it doesn't on the LX100 variant), there seem to be difficulties routing it, and attempting to synthesize it for 120MHz didn't work out at all. The router is still running (since almost 48 hours now), so I don't even know which frequency it did reach yet.
You mention 131 pipelined stages. I assume you mean something like, 128 round modules, 2 final adder stages, and a third ... misc stage?

My estimate for the 6SLX150-3N comes from assuming that Xilinx's estimation on equivalent LCs is accurate and that the design would use a similar number of 4-LUTs and FFs on Xilinx, as compared to Altera.

The optimized design can fit into 80K LEs on a Cyclone 3, possibly 75K with some more cramming; at 80MHz and one cycle per full-hash. Assuming LE == LC that would allow two of those to fit into the 6SLX150-3N because it has the equivalent of 150K LCs. That would achieve a combined hash-rate of 160MH/s.

Obviously, you can see that my assumptions all come from "LE == LC." Indeed, when I've synthesized my design for the 6SLX150, utilization was roughly what I expected. But as you pointed out, routing is where I got stuck. ISE refused to route the design. That's on my list of things to do; figure out why ISE won't route the design, even though it will route a cut-down design (only implementing a few rounds instead of all 128) just fine, and utilization is <70%.

Quote
Could you provide some details on your optimizations? Is the optimized verilog code available somewhere?
The code isn't on the public repo yet, no, because it's a work in progress and I haven't taken the time to put it up yet. It's on the list of things to do Wink

There are four classes of optimizations:

1) The last 3 rounds the second SHA-256 pass are not needed. You only need to check that Round64.H is equal to 0, and the last three rounds do not affect H. Here's the math:

Code:
// Round numbers are 1 based, so we go from Round1 to Round64
Round61.E = Round60.D + Round60.H + s1 + ch + k[61] + w[61]
Round62.F = Round61.E
Round63.G = Round62.F
Round64.H = Round63.G

Let's simplify:

32'h00000000 == Round64.H + InitialState.H == Round63.G == Round62.F == Round61.E == Round60.D + Round60.H + s1 + ch + k[61] + w[61]
32'h00000000 == Round60.D + Round60.H + s1 + ch + k[61] + w[61] + InitialState.H
32'h00000000 == Round60.D + Round60.H + s1 + ch + 32'h90befffa + w[61] + 32'h5be0cd19  //K and InitialState are known
32'h136032ED == Round60.D + Round60.H + s1 + ch + w[61]

That allows you to remove 3 full rounds and 9 adders.


2) Pre calculating the first few stages of the first SHA-256 pass is also possible. The first pass is calculated from the 512-bit DATA string that getwork requests give us, and the MIDSTATE. Between pieces of work, all of that data remains constant, except for the nonce, which we are increasing. The nonce is at W[3] (0 based), and W[3] isn't used until Round4. So, you can run the first 3 rounds of SHA-256 on a controller (PC, microcontroller, whatever) before handing the "work" to the FPGA. The FPGA then picks up where you left off and never has to calculate those first 3 rounds.

That amounts to giving the FPGA Midstate, Data, and Midstate', where Midstate' is the 256-bit state as of your pre-calced Round3.


3) Quite a large amount of W can be pre-calced off the FPGA as well. Again, Data is constant except for the 4th word which is the nonce. I won't go into the math here, because it's rather long. The first 16 values of W are sourced directly from Data, so that's a no-brainer. However, also note that all of Data after the first 4 words (after the nonce) is constant for all getwork requests! So you can actually hard code those values as they are in the code on the public repo. What that code doesn't do, however, is just add K with the known W values, thus saving you an adder for the first 16 rounds.

After the first 16 rounds, everything up to round 35 (I think) has some amount of W that can be pre-calculated off the FPGA and given to it with the rest of the work data.


4) The final optimization is this:

Code:
t1 := h + s1 + ch + k[i] + w[i]
Having to add K and W makes the adder tree larger. K is a constant and always known. W can be calculated at any point before a given round. H is also known at least one round ahead of time (as you've learned in the first optimization). So it's possible to do this calculation in the previous round:

Code:
pre-t1 = g + k[i+1] + w[i+1]

and then in the next round:

Code:
t1 := pre-t1 + s1 + ch

That shrinks the critical path down, allowing for higher clock rates. Note that I haven't implemented that particular optimization, so double check that I did my math right.
hero member
Activity: 504
Merit: 500
FPGA Mining LLC
Quote
@fpgaminer: As you apparently seem to know the altera side of things rather well, and as the altera FPGAs seem to be the more cost-effective ones, which FPGA would you suggest for a copacobana-like miner design?
It seems like the Xilinx devices are far cheaper, for reasons I can't explain. The only problem is my code doesn't currently support Xilinx. I'm working on it, but until it's done there's no way to know for sure if the utilization and performance is similar.

Really? I had the opposite impression.
BTW, if you want to, you could host my VHDL version in your git repository as well, for the Xilinx users.

Back-of-the-napkin:
You'd want to go with, obviously, the chips giving the lowest $ per MH/s figure. The price scaling on the chips is very non-linear. A quick scan over the Altera Cyclone 4 series shows that a Cyclone 4, C40-7N chip has the lowest (best) ratio. It's $82.96 in singles from Altera's website and would produce an estimated 40MH/s (if you can fit a half-mining core in it). That's $2.08 per MH/s. You'd probably get a small discount if you're buying 128 to build a big array, or if their sales team can price match Xilinx's offerings. 128 of those babies would get you 5GH/s and consume ~320Watts.

For reference, Xilinx's SLX150-3N chips are ~$120 in some quantity, with an estimated performance of 160MH/s. That's $0.75 per MH/s.

Never ever. I'm if you mean XC6SLX150-3, this isn't going to cross 120MH/s, at least not with this design.
I've tried to synthesize a fully-unrolled (131 pipeline stages) design for that FPGA, and even though it fits (it doesn't on the LX100 variant), there seem to be difficulties routing it, and attempting to synthesize it for 120MHz didn't work out at all. The router is still running (since almost 48 hours now), so I don't even know which frequency it did reach yet.

Quote
Yes, a huge but slow Virtex 5, at 97% slice usage, and getting it to run at 120MHz at these utilization levels was a bit of a challenge.
I might nevertheless add parameterized unrolling  to accomodate for other people's needs. I'm probably going to release the source code soon, including the Python frontend.
Wow, I'm surprised utilization is 97%, but I guess that's fairly close to the 90K LEs that my unoptimized version uses on Cyclone 3/4. You can apply the usual optimizations (last 3 rounds aren't needed, pre-calced W, etc) to get better utilization but I'm not sure you could cram much into whatever space that would save you. You can also play with the adder trees a bit, moving W + k into the prior rounds to improve MHz performance.

This is where you apparently know more about the algorithm than I do. I've just translated your code to VHDL.
Could you provide some details on your optimizations? Is the optimized verilog code available somewhere?
Getting rid of 3 rounds would very likely allow to increase the clock frequency even further.
EDIT: Now that I think about it, I see how the last rounds could be removed, but I'm fairly certain that the synthesis tool was clever enough to have done that already.
hero member
Activity: 560
Merit: 517
Quote
@fpgaminer: As you apparently seem to know the altera side of things rather well, and as the altera FPGAs seem to be the more cost-effective ones, which FPGA would you suggest for a copacobana-like miner design?
It seems like the Xilinx devices are far cheaper, for reasons I can't explain. The only problem is my code doesn't currently support Xilinx. I'm working on it, but until it's done there's no way to know for sure if the utilization and performance is similar.

Back-of-the-napkin:
You'd want to go with, obviously, the chips giving the lowest $ per MH/s figure. The price scaling on the chips is very non-linear. A quick scan over the Altera Cyclone 4 series shows that a Cyclone 4, C40-7N chip has the lowest (best) ratio. It's $82.96 in singles from Altera's website and would produce an estimated 40MH/s (if you can fit a half-mining core in it). That's $2.08 per MH/s. You'd probably get a small discount if you're buying 128 to build a big array, or if their sales team can price match Xilinx's offerings. 128 of those babies would get you 5GH/s and consume ~320Watts.

For reference, Xilinx's SLX150-3N chips are ~$120 in some quantity, with an estimated performance of 160MH/s. That's $0.75 per MH/s.

Quote
Yes, a huge but slow Virtex 5, at 97% slice usage, and getting it to run at 120MHz at these utilization levels was a bit of a challenge.
I might nevertheless add parameterized unrolling  to accomodate for other people's needs. I'm probably going to release the source code soon, including the Python frontend.
Wow, I'm surprised utilization is 97%, but I guess that's fairly close to the 90K LEs that my unoptimized version uses on Cyclone 3/4. You can apply the usual optimizations (last 3 rounds aren't needed, pre-calced W, etc) to get better utilization but I'm not sure you could cram much into whatever space that would save you. You can also play with the adder trees a bit, moving W + k into the prior rounds to improve MHz performance.
full member
Activity: 121
Merit: 100
are going to email them and request some more detail.
They aren't going to tell you how it works. They want to sell it. Smiley

Haha I know, just wondering about pricing n maybe could request a sample or something :3

Also See below.

http://opencores.org/project,sha_core
hero member
Activity: 504
Merit: 500
FPGA Mining LLC
I'm doing some research on this topic and found http://www.heliontech.com/fast_hash.htm
thoughts?
Hm. Might be ~20% faster than mine, according to their datasheet, but will need more difficult handling.
are going to email them and request some more detail.
They aren't going to tell you how it works. They want to sell it. Smiley

Quote
XC5VLX110T-1FF1136: "Measuring FPGA performance... FPGA running at 119.958388 MH/s"
Very nice  Cool That's a Virtex 5, right? How much utilization are you getting? If you have some spare space you could update your VHDL code to support parameterized unrolling like the latest update and squeeze another hashing core in there.
Yes, a huge but slow Virtex 5, at 97% slice usage, and getting it to run at 120MHz at these utilization levels was a bit of a challenge.
I might nevertheless add parameterized unrolling  to accomodate for other people's needs. I'm probably going to release the source code soon, including the Python frontend.
hero member
Activity: 560
Merit: 517
Quote
XC5VLX110T-1FF1136: "Measuring FPGA performance... FPGA running at 119.958388 MH/s"
Very nice  Cool That's a Virtex 5, right? How much utilization are you getting? If you have some spare space you could update your VHDL code to support parameterized unrolling like the latest update and squeeze another hashing core in there.


Quote
With optimisation set for minimum LEs, it uses 22330. There are 22320 available.

Oh well, it fits with 4
Try commenting out line 107, the virtual_wire for "NONC". It isn't currently used and should save a few LEs.
full member
Activity: 121
Merit: 100
I'm doing some research on this topic and found http://www.heliontech.com/fast_hash.htm

thoughts?

are going to email them and request some more detail.
hero member
Activity: 504
Merit: 500
FPGA Mining LLC
XC5VLX110T-1FF1136: "Measuring FPGA performance... FPGA running at 119.958388 MH/s"
newbie
Activity: 17
Merit: 0
Anyone performed this miner with Terasic DE0-Nano board? How many Mhash/s do you get?
hero member
Activity: 504
Merit: 500
FPGA Mining LLC
June 2nd, 2011 - Flexible Unrolling Added: Smaller Device Support

Thanks to the patch submitted by Udif, the code now supports a configurable amount of loop unrolling. The original design was fully unrolled, with 128 total round modules. By adjusting the CONFIG_LOOP_LOG2 Verilog define, you can choose to unroll to 64 round modules, 32, 16, 8, or 4. This makes the design smaller, at the equivalent cost of speed, which should allow it to run on many more FPGAs.

If you're interested in trying the code on a smaller FPGA, open the projects/DE2_115_Unoptimized_Pipelined project in Quartus. Then go to Assignments->Settings->Analysis & Synthesis Settings->Verilog HDL Input. You should see a CONFIG_LOOP_LOG2 macro setting, which you can set from 0 to 5. 0 gives full unrolling (largest, fastest), and 5 gives the smallest design. You will also need to go to Assignments->Device and choose your FPGA, and set the correct clock pin in Assignments->Pin Planner. Then just compile and program!

If you would like help adjusting the design for your specific chip&board, just let me know. You just need to know the pin location of a clock source, and which Altera FPGA it is (specifically, including package and speed grade).

+1

Well done!

No luck for DE0-Nano users. With the setting at 3:

Code:
Error: Fitter requires 1397 LABs to implement the project, but the device contains only 1395 LABs

With optimisation set for minimum LEs, it uses 22330. There are 22320 available.

Oh well, it fits with 4 Tongue
I bet that you could get rid of those 10 LEs somehow. However, this still wouldn't mean that the design can be successfully routed, and even if it can be, timing performance would be awful. It's likely that you'll get more MH/s with the smaller version.

@fpgaminer: As you apparently seem to know the altera side of things rather well, and as the altera FPGAs seem to be the more cost-effective ones, which FPGA would you suggest for a copacobana-like miner design? Smiley
full member
Activity: 196
Merit: 100
June 2nd, 2011 - Flexible Unrolling Added: Smaller Device Support

Thanks to the patch submitted by Udif, the code now supports a configurable amount of loop unrolling. The original design was fully unrolled, with 128 total round modules. By adjusting the CONFIG_LOOP_LOG2 Verilog define, you can choose to unroll to 64 round modules, 32, 16, 8, or 4. This makes the design smaller, at the equivalent cost of speed, which should allow it to run on many more FPGAs.

If you're interested in trying the code on a smaller FPGA, open the projects/DE2_115_Unoptimized_Pipelined project in Quartus. Then go to Assignments->Settings->Analysis & Synthesis Settings->Verilog HDL Input. You should see a CONFIG_LOOP_LOG2 macro setting, which you can set from 0 to 5. 0 gives full unrolling (largest, fastest), and 5 gives the smallest design. You will also need to go to Assignments->Device and choose your FPGA, and set the correct clock pin in Assignments->Pin Planner. Then just compile and program!

If you would like help adjusting the design for your specific chip&board, just let me know. You just need to know the pin location of a clock source, and which Altera FPGA it is (specifically, including package and speed grade).

+1

Well done!

No luck for DE0-Nano users. With the setting at 3:

Code:
Error: Fitter requires 1397 LABs to implement the project, but the device contains only 1395 LABs

With optimisation set for minimum LEs, it uses 22330. There are 22320 available.

Oh well, it fits with 4 Tongue
hero member
Activity: 560
Merit: 517
June 2nd, 2011 - Flexible Unrolling Added: Smaller Device Support

Thanks to the patch submitted by Udif, the code now supports a configurable amount of loop unrolling. The original design was fully unrolled, with 128 total round modules. By adjusting the CONFIG_LOOP_LOG2 Verilog define, you can choose to unroll to 64 round modules, 32, 16, 8, or 4. This makes the design smaller, at the equivalent cost of speed, which should allow it to run on many more FPGAs.

If you're interested in trying the code on a smaller FPGA, open the projects/DE2_115_Unoptimized_Pipelined project in Quartus. Then go to Assignments->Settings->Analysis & Synthesis Settings->Verilog HDL Input. You should see a CONFIG_LOOP_LOG2 macro setting, which you can set from 0 to 5. 0 gives full unrolling (largest, fastest), and 5 gives the smallest design. You will also need to go to Assignments->Device and choose your FPGA, and set the correct clock pin in Assignments->Pin Planner. Then just compile and program!

If you would like help adjusting the design for your specific chip&board, just let me know. You just need to know the pin location of a clock source, and which Altera FPGA it is (specifically, including package and speed grade).
newbie
Activity: 1
Merit: 0
I also made 2 tries (after converting the code into vhdl)

- one serialized where i want to fit the FF into Blockrams

- one parallel solution with 1 or more skipped FF stage between each other stage

First try should be quiet good for devices with a lot small blockrams but few FF (old Altera) or any other small device, i'm still working here.

Second is great if your device is too small for 80k but you still want to have a full chain working, because some logic can be shared (64 stages unrolled needs only as much logic as ~40 single stages)
With 1 skipped FF stage i only need 40k FF/Lut pairs on V6 running at 70-80 Mhz(not tweaked). Still Lut count doesn't decrease much obviously here, so you either need 6/7 Input luts or a very special device, as most have more FF than LUTs.


I will try to get a version with around 15 MH/s on a bemicro stick (49$) or maybe ~20MH/s on a bemicro sdk (79$).

This will make for very easy expansion and easy pc communication. One can easily plug in 10 of these to one pc.


Will post code when it fits good at a specific device.
sr. member
Activity: 520
Merit: 253
555
Care to share your code? Many hands make light work, etc etc Smiley
I will, once I've successfully found the first proof of work using it. Well, unless you would like to write the python part for it.

I could probably do this if I had a suitable FPGA to try your code...

I've been playing around with a Spartan 3E board for a couple of weeks, and I use RS232 for similar reasons as you. I've found pyserial a nice and simple way to communicate with the board, as I already know Python quite well.

I planned to implement a toy miner using the Opencores sha256 module, and use jgarzik's pyminer on the computer end. Unfortunately, I keep having trouble with some details. The output side of that module seems rather erratic (confirmed by other people's bug reports), and I'm still learning the very basics of Verilog. Also, it seems that open source miners for low-end FPGAs will soon be available anyway.
hero member
Activity: 504
Merit: 500
FPGA Mining LLC
Any particular reason you didn't want to use the JTAG for the host communication?  It'd make it so much easier to use random boards...
Really? Which boards don't have a serial interface?
I chose to not stick with JTAG as I want the software to run outside of those bulky FPGA vendors' tools. In my case I want to run the high-level code on a small ARM board that's hooked up to the FPGA board via RS232, for which there are no drivers for my Xilinx JTAG cable available.
This also allows me to get rid of tcl and use python for the "PC"-side part.

Care to share your code? Many hands make light work, etc etc Smiley
I will, once I've successfully found the first proof of work using it. Well, unless you would like to write the python part for it.
newbie
Activity: 3
Merit: 0

There is also a great Pull request that was submitted a day or two ago. It allows the design to scale down to fit into smaller chips, which I know a lot of people have been waiting for. I'm just waiting for some free time to open up so I can dive in, test the new patch out, and merge it. Many thanks to udif for submitting such a wonderful improvement!

The verilog code was updated on my fork of fpgaminer's git, and seems to be working under the simulator. I will try on real HW later today.
You should now be able to fold the 2x64 pipe stages to 2xN stages where N is 1,2,4,..64 (for N=64 it behaves as the original code).
Ofcourse folding the HW pipe into loops means that it will run 64/N times slower.

I was able to fit an EP3C25 at >90% capacity with N=8.
newbie
Activity: 8
Merit: 0
I did manage to synthesize a design for an xc5vlx110t-1ff1136 (which I had sitting around anyway) running at 120MH/s in the meantime. I chose to not use JTAG for communication, and implemented a simple RS232 interface instead.

The next step will be getting myself used to how all this mining business works and how to communicate with the mining pools.

Can someone provide me some checking values (256bits midstate, 96bits data, 32bits nonce) which result in a "golden ticket" (hash with the first 32bits being zero), so that I can verify that my design works correctly?
Any particular reason you didn't want to use the JTAG for the host communication?  It'd make it so much easier to use random boards...

Care to share your code? Many hands make light work, etc etc Smiley
Pages:
Jump to: