Pages:
Author

Topic: Official Open Source FPGA Bitcoin Miner (Last Update: April 14th, 2013) - page 26. (Read 432966 times)

newbie
Activity: 35
Merit: 0
I don't know the usage for the single unrolled core on a S6 150, but here's the usage from my design in the V6 240T using 2 cores.  I'm guessing you'll want to use all 180 of those DSP48s to reduce the logic usage.  The number of slices and LUTs used is going to be close to your maximum capacity, but with 372 (186 per core) DSP48s used.

Code:
Device Utilization Summary:

Slice Logic Utilization:
  Number of Slice Registers:               101,697 out of 301,440   33%
    Number used as Flip Flops:              98,581
    Number used as Latches:                      1
    Number used as Latch-thrus:                  0
    Number used as AND/OR logics:            3,115
  Number of Slice LUTs:                     88,763 out of 150,720   58%
    Number used as logic:                   67,920 out of 150,720   45%
      Number using O6 output only:          32,057
      Number using O5 output only:           1,667
      Number using O5 and O6:               34,196
      Number used as ROM:                        0
    Number used as Memory:                   9,892 out of  58,400   16%
      Number used as Dual Port RAM:              0
      Number used as Single Port RAM:            0
      Number used as Shift Register:         9,892
        Number using O6 output only:         7,362
        Number using O5 output only:             0
        Number using O5 and O6:              2,530
    Number used exclusively as route-thrus: 10,951
      Number with same-slice register load: 10,889
      Number with same-slice carry load:        62
      Number with other load:                    0

Slice Logic Distribution:
  Number of occupied Slices:                27,898 out of  37,680   74%
  Number of LUT Flip Flop pairs used:      105,799
    Number with an unused Flip Flop:        28,962 out of 105,799   27%
    Number with an unused LUT:              17,036 out of 105,799   16%
    Number of fully used LUT-FF pairs:      59,801 out of 105,799   56%
    Number of slice register sites lost
      to control set restrictions:               0 out of 301,440    0%

Specific Feature Utilization:
  Number of RAMB36E1/FIFO36E1s:                 40 out of     416    9%
    Number using RAMB36E1 only:                 40
    Number using FIFO36E1 only:                  0
  Number of RAMB18E1/FIFO18E1s:                  0 out of     832    0%
  Number of BUFG/BUFGCTRLs:                      5 out of      32   15%
    Number used as BUFGs:                        5
    Number used as BUFGCTRLs:                    0
  Number of ILOGICE1/ISERDESE1s:                 0 out of     720    0%
  Number of OLOGICE1/OSERDESE1s:                 0 out of     720    0%
  Number of BSCANs:                              0 out of       4    0%
  Number of BUFHCEs:                             0 out of     144    0%
  Number of BUFIODQSs:                           0 out of      72    0%
  Number of BUFRs:                               0 out of      36    0%
  Number of CAPTUREs:                            0 out of       1    0%
  Number of DSP48E1s:                          372 out of     768   48%
  Number of EFUSE_USRs:                          0 out of       1    0%
  Number of FRAME_ECCs:                          0 out of       1    0%
  Number of GTXE1s:                              4 out of      20   20%
    Number of LOCed GTXE1s:                      4 out of       4  100%
  Number of IBUFDS_GTXE1s:                       1 out of      12    8%
  Number of ICAPs:                               0 out of       2    0%
  Number of IDELAYCTRLs:                         0 out of      18    0%
  Number of IODELAYE1s:                          0 out of     720    0%
  Number of MMCM_ADVs:                           1 out of      12    8%
  Number of PCIE_2_0s:                           1 out of       2   50%
    Number of LOCed PCIE_2_0s:                   1 out of       1  100%
  Number of STARTUPs:                            1 out of       1  100%
  Number of SYSMONs:                             0 out of       1    0%
  Number of TEMAC_SINGLEs:                       0 out of       4    0%
hero member
Activity: 592
Merit: 501
We will stand and fight.
Hi guys.
I'm working hard around the XC6SLX150 -3N FPGA this week. I'm trying to design a FPGA computing unit for bitcoin mining and some other related project.
By the work of this thread:

https://bitcointalk.org/index.php?topic=22426.580
Modular FPGA Miner Hardware Design Development

I think, if we do not use expensive power-modules(instead of discrete comps, cheaper  but the design and test difficulty is upupup), find a  way to buy cheap FPGAs, well Design for manufacturability, have hundreds of people will buy it, etc......
After that, I'm quite sure  the daughter board will be able to be manufactured in 400$ for all costs.

But the question is, 2 of XC6SLX150 -3n will finally give us how much MH/s? If we want to make FPGA mining to be a feasible choice, the MH/s pre $ must close the GPU mining.
1 HD6870( now buy new HD5850s are difficult) is about 180$, provide a hashing power of 270MH/s, about 1.5Mhs/$. I think at least, a 1Mhs/$ is necessary for FPGA mining.
Can we optimize the XC6SLX150 to about 200Mh/s performance? Is it possible?

That means: 2 fully pipelined cores run at 100MHz, per FPGA. Note that XC6SLX150 has about 60% logic resource of XC6VLX240T and 180 DSP48A1s (XC6VLX240T has 768 of ESP48E1s).

If the above performance is possible, I can make the 400$ prize dual XC6SLX150 board come true in 1-2 month.
hero member
Activity: 556
Merit: 500
How about this universal board? http://www.hdl.co.jp/en/index.php/accessories/zkb-054.html It appears it'll take a socketed version of the chip and is only about 180$. If the chip can be sourced for 150-170$ It will still be expensive at 330$ but cheaper than 700$
You are a little off here.  This is not a board that can be used to mount FPGAs.  This board is designed specifically to aggregate the other FPGA dev boards this company sells onto one motherboard.  It won't help for finding a cheap host for raw FPGA parts.


I had a feeling it wasn't that easy. Definitely a challenge to make fpgas feasible bitcoin miners. I found a site that offers a board for about 500$ http://shop.ztex.de/product_info.php?products_id=64&language=en

Holy moly take a look at this 12x LX150s on one pcie board. http://www.dinigroup.com/new/DNBFC_S12_PCIe.php  Grin

Also this page seems like a good reference http://www.fpga-faq.com/FPGA_Boards.shtml
newbie
Activity: 54
Merit: 0
How about this universal board? http://www.hdl.co.jp/en/index.php/accessories/zkb-054.html It appears it'll take a socketed version of the chip and is only about 180$. If the chip can be sourced for 150-170$ It will still be expensive at 330$ but cheaper than 700$
You are a little off here.  This is not a board that can be used to mount FPGAs.  This board is designed specifically to aggregate the other FPGA dev boards this company sells onto one motherboard.  It won't help for finding a cheap host for raw FPGA parts.
hero member
Activity: 556
Merit: 500
I'm interested in getting into fpgas, however where can I source a spartan-6 lx150 for cheap? on the hardware comparison site in the comments it says
Quote
3N 484-pin chip is ~$150, 0.67Mhash/$

That's about what they cost in single quantity, and as FPGAs of that size go, that is cheap. That price is just for the chip; it would need to be soldered onto a suitable board to be usable. It's in a 484-pin ball grid array package with 1mm ball pitch, which is a bit too advanced for most hobbyists to solder down themselves. Unfortunately, off-the-shelf development boards with LX150 chips on them are pretty expensive, partly because they support features of the device which add substantial cost to the board (such as having lots of layers in order to route out all 484 pins) that we don't need for the mining application. I don't have the link handy or remember the manufacturer's name at the moment, but the cheapest off-the-shelf LX150-based board that I've seen recently costs around $600-$700 if I recall correctly.

There's another thread in which folks are working on a fairly low-cost LX150-based board that's optimized for mining:

https://bitcointalksearch.org/topic/modular-fpga-miner-hardware-design-development-22426

How about this universal board? http://www.hdl.co.jp/en/index.php/accessories/zkb-054.html It appears it'll take a socketed version of the chip and is only about 180$. If the chip can be sourced for 150-170$ It will still be expensive at 330$ but cheaper than 700$
hero member
Activity: 560
Merit: 517
Quote
Finally got around to coding some maximum clock speed improvements for users of smaller Cyclone III and IV devices - now available from my new partial-unroll-speed branch. Expected minimum device size and speed is roughly as follows:
More fantastic work, makomk!  Cool *applause*

Quote
I've been playing around with the xilinx-verilog port in the github repo and can confirm that it works just fine on the Xilinx Spartan-6 XC6LX9 microboard eval board from Avnet for $69.
Thank you for taking the time to share your experiences with all these mini eval boards, jonand. That's great information.

It's a shame that LX9 microboard uses a 324 landing. Would be neat to re-solder an LX150 to it, but the LX150 doesn't come in 324 package :/

Quote
But finally, my experiment is working @ 2250 Mhash/s and about 100w!  The cost is out of control, but since I had these cards laying around from other experiments, I figured I'd give it a shot.
*drools* For reference, an AMD 5850 only gets ~350MH/s for ~150W.  Tongue
newbie
Activity: 35
Merit: 0
Can you humor me with a guess of how much this hardware would cost, iidx?

2250 Mhash/s- Wow! I wish I had that

It's pretty unreasonable, I think each of the cards were $2000 when I bought them about 6 months ago for a project.  Xilinx has "generously" reduced the price to $1795 now...  Not sure what the raw chip prices are, maybe $500 each in volume.

Quote
Another group at my company has custom emulation platforms with more than 50 (!!) Virtex 5 parts each. I wish I could spend some quality with one of those and make it slave away in the Bitcoin mines, but sadly, that's not going to happen. I could get away with stuff like that when I was working for a little startup instead of a megacorp, but then we couldn't afford toys like that back then.

Wow, 50 devices!!  Maybe you can ask if you can do some performance testing for that group Cheesy

I actually have another board that has a Virtex 5 on it (ML555) that I thought about also trying to use.  However, it has a pretty small device and no heatsink, so it probably would be a bad idea and yield 125-150 MHash at best.
member
Activity: 98
Merit: 10
That ML605 hack sounds delightful! I love it!

Another group at my company has custom emulation platforms with more than 50 (!!) Virtex 5 parts each. I wish I could spend some quality with one of those and make it slave away in the Bitcoin mines, but sadly, that's not going to happen. I could get away with stuff like that when I was working for a little startup instead of a megacorp, but then we couldn't afford toys like that back then.
full member
Activity: 210
Merit: 100
Can you humor me with a guess of how much this hardware would cost, iidx?

2250 Mhash/s- Wow! I wish I had that
newbie
Activity: 35
Merit: 0
Hi Guys,

I had a bunch of spare stuff laying around at work, so I whipped up a mining configuration as an exercise.

Supplies:
6 Xilinx ML605 cards (XC6VLX240T, Virtex 6)
PCIe switch development kit (an external board that connects to your PC and has a ton of PCIe slots)
TI USB to GPIO pod (for the ML605 power supplies)

Starting with the Xilinx Verilog port of the code, I found that I could fit 2 instances of the LOOP = 0 core.  However, that wasn't enough for me.  I figured that if I used the DSP48s in the Virtex 6, I could fit at least 1 more in there.  With the 3 cores, I am using about 558 of the hardware multipliers.  Sadly, there isn't enough to fit a 4th in there.  I may be able to fit it in there if I use a few less hardware multipliers, but it will be tight.  I ended up running the cores at 125 Mhz, because that's the same speed my PCIe internal interface runs at.  There is more headroom available, 150 Mhz is probably doable, but power will start to become a concern.

It turned out that the ML605's power supplies could supply enough power to run 3 cores, but the digital power managers were not set to allow the full rated current.  I re-programmed the power managers to allow for the rated current in order to get the 3 core version of my design working.  The designs use about 16-18 watts of power and 16-17A on the VCCint rail (1v).  I had to supply additional cooling to make sure the power supplies didn't over heat (they have no cooling normally).

Next, I had to connect it to my PC for mining data.  Now, of course 6 serial ports wasn't going to be the most elegant solution (and my PC actually had no serial ports).  I used an off the shelf PCIe core in conjunction with the Xilinx hard IP to connect the 3 bitcoin cores to the PC.  Sadly, the PCIe core is a licensed product, so I won't be able to share the source here.

The hardest part for me was last - I had to figure out what data to get from a mining pool, what to do with it and how to get it in the card.  I found some open source C# mining libraries (I need to credit the guy, but I don't have the code in front of me), modified it and wrote a mining program to feed and poll all of the cards.  It was a pain in the ass to get that finally working, but through analyzing a bunch of different mining software I figured it out.

But finally, my experiment is working @ 2250 Mhash/s and about 100w!  The cost is out of control, but since I had these cards laying around from other experiments, I figured I'd give it a shot.

I'd be happy to contribute the changes to the modifications to usethe DSP48s, but I can't actually distribute the PCIe DMA/PIO engine since it's licensed (not from Xilinx).  I'm happy to distribute the source for the software too, since I didn't really find a C#/.NET windows version that suited my needs.

Questions and comments welcome!
member
Activity: 98
Merit: 10
I found the Spartan 6 LX150 boards that I mentioned previously:

http://www.hdl.co.jp/en/index.php/xilinx-series1/spartan-6.html

They start at around $700 for LX150 boards (less for smaller parts). I don't know of any cheaper off-the-shelf Spartan 6 LX150-based boards right now, but I'd love to hear about any.
member
Activity: 98
Merit: 10
I'm interested in getting into fpgas, however where can I source a spartan-6 lx150 for cheap? on the hardware comparison site in the comments it says
Quote
3N 484-pin chip is ~$150, 0.67Mhash/$

That's about what they cost in single quantity, and as FPGAs of that size go, that is cheap. That price is just for the chip; it would need to be soldered onto a suitable board to be usable. It's in a 484-pin ball grid array package with 1mm ball pitch, which is a bit too advanced for most hobbyists to solder down themselves. Unfortunately, off-the-shelf development boards with LX150 chips on them are pretty expensive, partly because they support features of the device which add substantial cost to the board (such as having lots of layers in order to route out all 484 pins) that we don't need for the mining application. I don't have the link handy or remember the manufacturer's name at the moment, but the cheapest off-the-shelf LX150-based board that I've seen recently costs around $600-$700 if I recall correctly.

There's another thread in which folks are working on a fairly low-cost LX150-based board that's optimized for mining:

https://bitcointalksearch.org/topic/modular-fpga-miner-hardware-design-development-22426
hero member
Activity: 556
Merit: 500
I'm interested in getting into fpgas, however where can I source a spartan-6 lx150 for cheap? on the hardware comparison site in the comments it says
Quote
3N 484-pin chip is ~$150, 0.67Mhash/$
newbie
Activity: 12
Merit: 0
Makomk's sha256-transform.v improved the verilog-xilinx port a bit. unroll=2 is now working on the sp605 board with at least a 63MHz hash_clk.

100 shares found and speed seems to stabilize somewhere in the 15Mh/s region.

Anyone else running LX45?

newbie
Activity: 12
Merit: 0
The teknohog xilinx-verilog port also works good with the Xilinx Spartan 6 LX45T eval board called "SP605".

The only trick is that TX and RX in the SiLabs USB-RS232 adapter is reversed so you need to swap the pins in the UCF file to make a null modem.

After finding 7 blocks I have a hash rate of 13.1 +-4.95 with a clock speed of 63MHz and unrolling = 3.


I'm not saying these boards are useful for any real life mining but if you need a working reference example this does the job.

You can test run these devices with the xilinx ISE webpack license (which reports some stats back to xilinx but has a $0 cost). The only thing you risk is a salesguy calling you and asking when you will be purchasing volume..  Tongue

Good luck, have fun!

newbie
Activity: 12
Merit: 0
I've been playing around with the xilinx-verilog port in the github repo and can confirm that it works just fine on the Xilinx Spartan-6 XC6LX9 microboard eval board from Avnet for $69.

Couldn't fit anything but unrolling = 5 so speed is a few Mhash/s only, but I really recommend anyone who want to have a look at FPGAs to get this little USB "dongle" sized board. It has built-in usb jtag-cable and usb-serial console so nothing extra is required to get it up and running teknohog's code.

Let me know if anyone wants the UCF file, I just took the pin numbers from the avnet UCF example and replaced the UCF from github.

It is also nice to replace the 7segment with a simple LED output to see that your serial comm is working.

I tried clocks up to >90MHz and it runs just fine.

As a courtesy to teknohog I used his account (in the miner.py) when mining so he got the reward for the few shares I found..

Oh, I actually used windows7 for this, I couldn't get the digilent xilinx cable drivers to work on my linux box. It was easier to install python on the windows laptop.


You can also look at it this way: Learn FPGA programming, get a job and make more money than you most likely ever will on mining bitcoins! Smiley

hero member
Activity: 686
Merit: 564
Finally got around to coding some maximum clock speed improvements for users of smaller Cyclone III and IV devices - now available from my new partial-unroll-speed branch. Expected minimum device size and speed is roughly as follows:

LOOP_LOG2=0: minimum device size ~75k LEs, maximum theoretical speed 110 MHash/s @ 110 MHz. Use the DE115_makomk_mod directory in fpgaminer's main git repository.
LOOP_LOG2=1: minimum device size ~50k LEs, max theoretical speed 50 MHash/s @ 100 MHz. Use DE2_115_Unoptimized_Pipelined directory in the partial-unroll-speed branch of my github repository.
LOOP_LOG2=2: minimum device size ~30k LEs, max theoretical speed 25 MHash/s @ 100 MHz. Use same partial-unroll-speed code, but edit the line after "`ifdef USE_EXPLICIT_ALTSHIFT_FOR_W" in sha256_transform.v to read "if(LENGTH >= 5) begin" rather than 4.
LOOP_LOG2=3: minimum device size ~20k LEs, max theoretical speed 13.75 MHash/s @ 110 MHz. Use partial-unroll-speed again, no modifications required.
LOOP_LOG2>3: not supported by either of these two code code versions, stick to the original partially-unrolled code in the DE2_115_Unoptimized_Pipelined directory of fpgaminer's git repository? (Bear in mind it may not be very area-efficient.)

It may be possible to get some of these up to 110 MHz or fit them onto slightly smaller devices (actual LE usage is currently about 74k, 46k, 26k?, and 16k respectively).

Also: remember that some or all of these may require extra cooling in order to operate safely.
hero member
Activity: 686
Merit: 564
If you haven't tweaked ISE's setting for the minimum size of shift register to infer yet, you probably need to do that. (By default two or more registers in a row are automatically replaced by a shift register.)

Hmmm. Actually, that doesn't seem to quite solve the problem. Try something like this patch. Causes slightly worse timing elsewhere on XC6SLX75 so it's actually a net loss there, but you may be able to avoid or work around this. (Edit: Also, KEEP is probably not the right constraint to use here; it will cause unneeded registers and logic and RAM we want to elimiate to be kept.)
member
Activity: 89
Merit: 10

About the interfaces/protocol:

The data would most likely end up going through a com port eventually.  Might it not be better to have an 8bit data width in the registers etc ?
Then you don't have to worry about dis/assembling the 32bit words either and the usual what are LSB and MSB, endian ARGH! Tongue

These huge 512/256 bit internal busses between slave, master and comm will eat quite a bit of routing resources.
I would suggest having an internal serial format using the system clock + 1 bit bus and a clock enable/data valid ctrl signal.
Sender has a tiny state machine/counter, while the receiver module is only a shift register with a clock enable.

To make up for the tiny tiny waste in waiting for new data your could have double buffers in the slaves.


hero member
Activity: 686
Merit: 564
I just tried adding an extra register to the shifter's inferred RAM. After compilation it failed timing (80MHz) and ... the register was gone. I'm guessing it optimized the register away somehow, or balanced it. Either way, it ended up having a negative impact. I'm running another compile with USE_XILINX_BRAM_FOR_W off to see how that works. Perhaps we need to find a way for ISE not to optimize the shifter so much when USE_XILINX_BRAM_FOR_W is being used?
If you haven't tweaked ISE's setting for the minimum size of shift register to infer yet, you probably need to do that. (By default two or more registers in a row are automatically replaced by a shift register.)
Pages:
Jump to: