Official Open Source FPGA Bitcoin Miner (Last Update: April 14th, 2013) - page 39.

TheSeven

hero member

Activity: 504

Merit: 500

FPGA Mining LLC

Quote from: Pixie on June 12, 2011, 03:20:34 PM

Quote from: makomk on June 12, 2011, 10:51:59 AM

If you're doing HardCopy-style structured ASICs, in theory you could put a fastish 32-bit processor and Ethernet MAC on the ASIC itself. It'd probably only take up a smallish proportion of the chip and you'd just need boot flash and Ethernet PHY chips externally. Not sure how much sense this would make though.

Depending on expense, i'd look at adding a small XMOS processor to the board. It being transputer in essence was designed to talk other chips like ASICs and other XMOS. The ASIC is left doing its special magic, the XMOS handles everything else (including block submits etc.) and can be easily connected up into massive rigs as required.

I don't like this idea too much. I'd keep the ASIC/FPGA interface as simple as possible, but I'd prefer a bus for easy scalability. I²C springs to my mind.
This way you can keep the ASIC simple and don't waste precious space, don't waste the control processor circuitry when chaining multiple ASICs, and can use the same ASIC for both PCIe-based accelerator cards and standalone mining boards. The latter would get a simple ARM SoC with updatable firmware (based on linux?) and an ethernet interface.

Quote from: makomk on June 12, 2011, 04:12:05 PM

Edit 2: Also, a totally untested 90MHz/90 MHash/sec bitstream for the DE2-115 is now in a branch on my git repo. PowerPlay estimates 4.4W of heat, I think? Anyway, don't blame me if you blow up your expensive board. There's a reason I'm not including instructions here.

I'm fairly certain that this estimate is way too low.

makomk

hero member

Activity: 686

Merit: 564

Quote from: Pixie on June 12, 2011, 03:20:34 PM

Depending on expense, i'd look at adding a small XMOS processor to the board. It being transputer in essence was designed to talk other chips like ASICs and other XMOS. The ASIC is left doing its special magic, the XMOS handles everything else (including block submits etc.) and can be easily connected up into massive rigs as required.

They're kinda neat, fairly powerful, and the chips aren't massively expensive either. I actually have an XMOS XK-1 boards here flashing an LED at me accusingly. The caveats are that USB is a bit tricky, requiring external components and using up nearly all available I/O ports on that core (so you really need a more expensive dual core chip for that), the chip has some slightly interesting power sequencing and reset requirements, it has no internal Flash so you need a seperate SPI Flash chip for firmware, and it has no internal driver for a crystal oscillator. (Edit: oh, and you're limited to 64 kilobytes of RAM per core and your code needs to fit into that too.) An ARM microcontroller with integrated Ethernet MAC and USB might work out better.

Of course, there'll be some tricky board design and manufacture anyway for the ASIC, so that may not necessarily be a huge obstacle.

Edit 2: Also, a totally untested 90MHz/90 MHash/sec bitstream for the DE2-115 is now in a branch on my git repo. PowerPlay estimates 4.4W of heat, I think? Anyway, don't blame me if you blow up your expensive board. There's a reason I'm not including instructions here.

njloof

member

Activity: 73

Merit: 10

Quote from: fpgaminer on June 12, 2011, 02:52:48 PM

I made little to no modification to their code for this first commit. If you appreciate their hard work on this Open Source project, please send them your thanks and donations!

TheSeven: 14Jc8vWq1mPv7vWnP5VquZZgpLEtzW2vja
teknohog: 1HkL2iLLQe3KJuNCgKPc8ViZs83NJyyQDM

fpgaminer: 1NT4RyJMqtRuDRr6zHdXdKSpmX3SR5he6z

Thanks to the three of you for your work to date! Plonk, plonk, plonk.

Pixie

newbie

Activity: 17

Merit: 0

Quote from: makomk on June 12, 2011, 10:51:59 AM

If you're doing HardCopy-style structured ASICs, in theory you could put a fastish 32-bit processor and Ethernet MAC on the ASIC itself. It'd probably only take up a smallish proportion of the chip and you'd just need boot flash and Ethernet PHY chips externally. Not sure how much sense this would make though.

Depending on expense, i'd look at adding a small XMOS processor to the board. It being transputer in essence was designed to talk other chips like ASICs and other XMOS. The ASIC is left doing its special magic, the XMOS handles everything else (including block submits etc.) and can be easily connected up into massive rigs as required.

fpgaminer

hero member

Activity: 560

Merit: 517

June 12th, 2011 - Xilinx and VHDL Ports Added
With many thanks to TheSeven and teknohog, their code has been added to the public repo. TheSeven did a re-implementation in VHDL, with support for Xilinx and ISE. teknohog did a straight port of the Verilog code to simply support Xilinx and ISE. Both include Python miner control scripts, and serial port communication with the FPGA board.

I made little to no modification to their code for this first commit. If you appreciate their hard work on this Open Source project, please send them your thanks and donations!

TheSeven: 14Jc8vWq1mPv7vWnP5VquZZgpLEtzW2vja
teknohog: 1HkL2iLLQe3KJuNCgKPc8ViZs83NJyyQDM

Some notes on the current state of this project

As it stands now, the project makes lots of references to the DE2-115 board, and it being the preferred mining platform. Obviously this isn't the case, nor is it meant to be an advertisement. It was simply the first device supported, and the one that currently has a binary release available. In the near future, I will merge in full Xilinx compatibility changes into the main Verilog code and try to steer the project towards supporting many devices and boards in a Plug-and-Mine fashion (like the DE2-115 currently is); or at least Compile-Plug-and-Mine

Also, the directory structure in the repo is not optimal, but that will improve with time as I settle on a structure that fits the project's many needs (multiple code variations, and multiple device specific implementations).

I have many other promising patches to merge, including a few of my own Wink

So keep watching this thread!

makomk

hero member

Activity: 686

Merit: 564

(Edit: Altera Quartus II claims FMax = 90.16 MHz on EP4CE115 for the xilinx-shiftreg branch, with one or two tweaks to the build config that may not be necessary. Took over 3 hours to build and used pretty much all the FPGA though, so not terribly useful - you would be better off adding another mining core.)

Been messing around some more with fpgaminer's code. Users of largish Altera FPGAs might want to try this branch, which skips the last 3 rounds in the fully-unrolled version and allows optimisations based on the fact that part of data is constant. Xilinx users can additionally uncomment "`define USE_RAM_FOR_KS" and combine this with teknohog's serial miner, though this may not work too well. (There's also the xilinx-shiftreg branch which only works for fully-unrolled miners.)

Note that I don't have an actual FPGA to test any of this on, so be sure to double-check the thermal results to make sure you're not going to damage your expensive hardware, and make sure it's actually submitting blocks successfully. Also, these are more size improvements than speed improvements, and most people that could benefit have probably got their own better version already.

Brief explanation: with the original code, which is what you get with USE_RAM_FOR_KS disabled, Xilinx's xst was doing something daft involving shift registers for K[ s]. Without the xilinx-shiftreg changes, it also failed to use shift registers for W where it kinda made sense to do so; unfortunately with the changes Altera's Quartus tools no longer find the shift registers.

Quote from: TheSeven on June 11, 2011, 07:13:56 PM

I fully agree that ASIC is the long-term way to go, but this UART token ring thing seems to be rubbish to me. There are well-suited protocols for this, like for example I²C.

There are two possibilities:
- Build a PCIe mining accelerator card, with some PCIe to I²C (or whatever) bridge, possibly on a CPLD.
- Slap an ARM SoC and an ethernet adapter on the board as well and make it run autonomously.

If you're doing HardCopy-style structured ASICs, in theory you could put a fastish 32-bit processor and Ethernet MAC on the ASIC itself. It'd probably only take up a smallish proportion of the chip and you'd just need boot flash and Ethernet PHY chips externally. Not sure how much sense this would make though.

Quote from: TheSeven on June 11, 2011, 07:13:56 PM

How big was the 133MHz design? (How many KLEs?)
Could you share this design?

Unfortunately, OrphanedGland and a lot of the other posters in this thread can't reply to it anymore because they're too new. The forum admins have blocked users with a small number of previous posts from posting anywhere except the Newbie forum. Try this thread.

TheSeven

hero member

Activity: 504

Merit: 500

FPGA Mining LLC

Quote from: vx609e on June 11, 2011, 10:31:13 AM

Hi,

IMO, an ASIC implementation is the way to go. We already have decent RTL (those who contributed to this know who they are and I thank you guys for this). With little modifications to the currently RTL, we could easily daisy chain many "cores" (easiest implementation with current state of project is a token ring over UART...only need to assign a specific address to each core).

I fully agree that ASIC is the long-term way to go, but this UART token ring thing seems to be rubbish to me. There are well-suited protocols for this, like for example I²C.

There are two possibilities:
- Build a PCIe mining accelerator card, with some PCIe to I²C (or whatever) bridge, possibly on a CPLD.
- Slap an ARM SoC and an ethernet adapter on the board as well and make it run autonomously.

Quote from: vx609e on June 11, 2011, 10:31:13 AM

Let's say each manufactured chip would yield 100 MHash/s. We daisy chain 20 per boards (a board with 20 chips on it is not a big deal) That's 2 GHash/s right there. PCB design and manufacturing would be pretty straight forward. I volunteer for that.

Good to know, as I have never dealt with this area before. Could you provide an estimate for the non-ASIC cost? (PCB design, prototyping, manufacturing and assembly, voltage regulators, clock generation, ...)

Quote from: vx609e on June 11, 2011, 10:31:13 AM

The big question: how to we finance an ASIC project? And even more importantly: how do we get it done?

1) Outsource FPGA2ASIC flow to http://www.icnexus.com.tw/product.php?id=25 (first company I found...there's gotta be many others). Get a chips ASAP and limit the risks. With this forum, I'm sure we could get a small EE team together and do all the Synopsis, BIST, test scan, pads design, routing, etc. crap ourselves but there are specialists out there that will do it for us...and chances of success will be much higher with that approach. Being a 100% digital chip (+ regulator and PLL obviously) the project couldn't be easier for these guys (or whatever company that would get the contract)...now to mention they are already in the business of FPGA2ASIC conversion.

I've heard rumors that Altera would be doing there HardCopy process for as low as $150K for 1000 chips, which seems very low to me. No idea whether that's true though. We might want to request a quote.

Quote from: Bloody Bell on June 11, 2011, 04:10:03 PM

Quote from: vx609e on June 11, 2011, 10:31:13 AM

Let's say each manufactured chip would yield 100 MHash/s.

I am pretty sure they can do much more. If a single mid-range fpga can house an entire pipeline and get 50 MH/s, any ASIC must be able to overperform that at least with a factor of ten.

I'd expect the chips to run at 200-300MHz, and one of my co-workers said that he tried synthesizing the hardcopy process for my VHDL design, and that 20 of those would fit on a single chip. That's 4-6GH/s per chip.

Quote from: Bloody Bell on June 11, 2011, 04:10:03 PM

I am also not sure that hunderds of people would commit the neccessary amount. Buying a video card is a much lower risk, as it can be sold anytime and has uses for other purposes.

I fully agree on this point, this will probably be the biggest problem, and it sadly wouldn't be an issue for certain governments... Undecided

Quote from: Bloody Bell on June 11, 2011, 04:10:03 PM

btw, does anyone know why the "Will fund ASIC board for mining community. Need Hardware devs." topic has been closed?

Link to that: http://forum.bitcoin.org/index.php?topic=14910.0

Quote from: OrphanedGland on June 11, 2011, 04:27:06 PM

Just coded a fully unrolled SHA256 in VHDL using two different approaches to maximize clock rate, a simple approach that involves precalculating H + K + W, and a more advanced approach that further pipelines each stage. Initial compiles targetted Cyclone IV using web edition quartus (which sucks), with the simple version achieving 110MHz and the advanced version 133MHz. Will be interested to see maximum clock rate that can be achieved on Stratix IV.

How big was the 133MHz design? (How many KLEs?)
Could you share this design?

Quote from: LazarusLong on June 11, 2011, 02:42:45 PM

Quote from: TheSeven on June 06, 2011, 10:45:38 AM

Quote from: tantive on June 06, 2011, 10:29:45 AM

I now have a bitfile for the atlys board (spartan 6 - lx45) with depth:=2 and 50mhz

The only problem is, that miner.py refuses to communicate over the serial port.
It detects the core, but when it starts "Measuring FPGA performance..." it produces and timeout: "Timed out waiting for FPGA to accept work"

@TheSeven: any idea how to debug or solve the problem? is the miner.py code working for all depths and frequencies?

You'll need to adjust the pin locations for clk_in, rx and tx in the UCF file, and adjust the clock divider for the serial port for the 50MHz frequency.
Replace "10000010001" with "0110110010" and "11000011001" with "01010001011" in uart.vhd.
And I should probably publish the new version of my miner, it now supports multiple pools, long polling, etc.

TheSeven, can you give some lines on how to calculate the deviders, any formula?

the first one is (clock frequency / 115200), the second one is ((clock frequency / 115200) * 1.5)

fpgaminer

hero member

Activity: 560

Merit: 517

Quote

Hmmmm. It looks like Xilinx's synthesis tools aren't very good at inferring the right meaning from various constructs this uses. I'm now seeing if I can convince them to interpret it in more efficient ways, but to be honest it looks like they may just be slightly rubbish. It ought to be possible to cram a full hashing pipeline into a XC6SLX75 in theory, but the current code isn't even getting close.

Yeah, I'm still fighting with ISE. Using SmartXplorer I got ISE to P&R a fully unrolled mining core into my 6SLX150 chip at 50MHz. ~50% resource consumption all around, which is good. Two problems: A) results returned by the running core were erratic, and B) the chip ran very hot.

Slow progress, but progress none-the-less.

OrphanedGland

member

Activity: 70

Merit: 10

Just coded a fully unrolled SHA256 in VHDL using two different approaches to maximize clock rate, a simple approach that involves precalculating H + K + W, and a more advanced approach that further pipelines each stage. Initial compiles targetted Cyclone IV using web edition quartus (which sucks), with the simple version achieving 110MHz and the advanced version 133MHz. Will be interested to see maximum clock rate that can be achieved on Stratix IV.

Bloody Bell

newbie

Activity: 18

Merit: 0

Quote from: vx609e on June 11, 2011, 10:31:13 AM

Let's say each manufactured chip would yield 100 MHash/s.

I am pretty sure they can do much more. If a single mid-range fpga can house an entire pipeline and get 50 MH/s, any ASIC must be able to overperform that at least with a factor of ten.

Quote

there are specialists out there that will do it for us...and chances of success will be much higher with that approach.

Considering that the design is very simple, and we don't need to push any limits (as we can simply use more chips instead) probably the manufacturer's team could do it relatively cheaply, it's mostly an automatized process anyway.

Quote

2) Crowd funding with kickstarter.com -- If we can get 500 people to pre-order one 2 GHash/s board at 1000$ a piece (a truly good deal IMO), we get a 500k$ budget to do #1. We need 10,000 chips. I think the budget makes sense if we spend 250k$ on design, 100k$ on chips (10$ a piece), 50k$ for tape-out (might be included in design cost...we need to see with the contractor), 10k$ on PCBs and assembly + the rest for overhead. Once we get real quote from contractor, we can adjust the cost per board...I'll I'm putting here are ball park figure to show the potential of this approach.

I think the one-time costs are higher. Unless we go for some structured ASIC, which indeed can be done from a few $100K. But the problem with structured ASIC is very similar to the fpgas: we have to pay for all the unused stuff (memory blocks, hardware multipliers, etc) that we don't need, increasing the price and lowering performance. And the projects return would still be threatened by others who starts making real asics.

I am also not sure that hunderds of people would commit the neccessary amount. Buying a video card is a much lower risk, as it can be sold anytime and has uses for other purposes.

Quote

So far in my career all I've done is deal with PCB, FPGA and ASIC designs...this project seem very realistic to me. But maybe I'm day dreaming...please bring me back to earth if I'm doing so.

I have only worked with FPGAs, but I don't think you are daydreaming.

btw, does anyone know why the "Will fund ASIC board for mining community. Need Hardware devs." topic has been closed?

comboy

sr. member

Activity: 247

Merit: 252

I don't feel like doing this since it's not mine, but it would be really cool to put xilinx implementation on github too so that people can fork it for different boards optimizations and improve easily with pull requests. If, of course, they want to share their changes.

makomk

hero member

Activity: 686

Merit: 564

Hmmmm. It looks like Xilinx's synthesis tools aren't very good at inferring the right meaning from various constructs this uses. I'm now seeing if I can convince them to interpret it in more efficient ways, but to be honest it looks like they may just be slightly rubbish. It ought to be possible to cram a full hashing pipeline into a XC6SLX75 in theory, but the current code isn't even getting close.

LazarusLong

newbie

Activity: 16

Merit: 0

Quote from: TheSeven on June 06, 2011, 10:45:38 AM

Quote from: tantive on June 06, 2011, 10:29:45 AM

I now have a bitfile for the atlys board (spartan 6 - lx45) with depth:=2 and 50mhz

The only problem is, that miner.py refuses to communicate over the serial port.
It detects the core, but when it starts "Measuring FPGA performance..." it produces and timeout: "Timed out waiting for FPGA to accept work"

@TheSeven: any idea how to debug or solve the problem? is the miner.py code working for all depths and frequencies?

You'll need to adjust the pin locations for clk_in, rx and tx in the UCF file, and adjust the clock divider for the serial port for the 50MHz frequency.
Replace "10000010001" with "0110110010" and "11000011001" with "01010001011" in uart.vhd.
And I should probably publish the new version of my miner, it now supports multiple pools, long polling, etc.

TheSeven, can you give some lines on how to calculate the deviders, any formula?

FlappySocks

hero member

Activity: 546

Merit: 500

Unless someone can come up with a working prototype, then I would say outsource it. A company looking for work might do it for a very cost effective price if we can come up with the basic design, software, and cash. They can make their profits up on repeat sales, and improved designs.

vx609e

newbie

Activity: 29

Merit: 0

Hi,

IMO, an ASIC implementation is the way to go. We already have decent RTL (those who contributed to this know who they are and I thank you guys for this). With little modifications to the currently RTL, we could easily daisy chain many "cores" (easiest implementation with current state of project is a token ring over UART...only need to assign a specific address to each core).

Let's say each manufactured chip would yield 100 MHash/s. We daisy chain 20 per boards (a board with 20 chips on it is not a big deal) That's 2 GHash/s right there. PCB design and manufacturing would be pretty straight forward. I volunteer for that.

The big question: how to we finance an ASIC project? And even more importantly: how do we get it done?

1) Outsource FPGA2ASIC flow to http://www.icnexus.com.tw/product.php?id=25 (first company I found...there's gotta be many others). Get a chips ASAP and limit the risks. With this forum, I'm sure we could get a small EE team together and do all the Synopsis, BIST, test scan, pads design, routing, etc. crap ourselves but there are specialists out there that will do it for us...and chances of success will be much higher with that approach. Being a 100% digital chip (+ regulator and PLL obviously) the project couldn't be easier for these guys (or whatever company that would get the contract)...now to mention they are already in the business of FPGA2ASIC conversion.

2) Crowd funding with kickstarter.com -- If we can get 500 people to pre-order one 2 GHash/s board at 1000$ a piece (a truly good deal IMO), we get a 500k$ budget to do #1. We need 10,000 chips. I think the budget makes sense if we spend 250k$ on design, 100k$ on chips (10$ a piece), 50k$ for tape-out (might be included in design cost...we need to see with the contractor), 10k$ on PCBs and assembly + the rest for overhead. Once we get real quote from contractor, we can adjust the cost per board...I'll I'm putting here are ball park figure to show the potential of this approach.

So far in my career all I've done is deal with PCB, FPGA and ASIC designs...this project seem very realistic to me. But maybe I'm day dreaming...please bring me back to earth if I'm doing so.

Feedback, suggestions and comments very welcome.

pdki

newbie

Activity: 27

Merit: 0

I think with a real ASIC hardware implementation of sha-256 it should easily be possible to outrun GPUs by at least a factor of 100, because of better space efficiency and the simplicity of the logic involved.

Considering that
-you can manufacture one of these for ~2M€ and then get 1000s of these chips
-they will not consume much power
-they can be put on cheap boards, because no heavy IO is needed (graphic cards are expensive due to the heavy IO with ram)

I am sure this will happen, if Bitcoins really establish as a currency and USD exchange rates stay in the 10$ range. If not, this would be a cheap option for aggressors like governments to take over the network. Much easier and cheaper then trying to shut it down by law.

deftx

newbie

Activity: 9

Merit: 0

It's likely going to be impossible for these devices to reach economies of scale anywhere near GPUs. These particular devices are marketed towards higher end audiences anyhow, so I imagine there's more room for price to be charged.

I predict GPUs will almost always be more feasible because more are produced. The economic incentive to produce them for both gamers and miners will always be high, and drive the price down both due to efficiency and the price able to be charged.

There's always someone that can make that magic combination of components to drive the price down, so we'll see where it actually goes.

TheSeven

hero member

Activity: 504

Merit: 500

FPGA Mining LLC

TheSeven

hero member

Activity: 504

Merit: 500

FPGA Mining LLC

Quote from: makomk on June 09, 2011, 07:52:02 AM

Quote from: fpgaminer on May 19, 2011, 09:33:56 PM

Thanks to the patch submitted by Udif, the code now supports a configurable amount of loop unrolling. The original design was fully unrolled, with 128 total round modules. By adjusting the CONFIG_LOOP_LOG2 Verilog define, you can choose to unroll to 64 round modules, 32, 16, 8, or 4. This makes the design smaller, at the equivalent cost of speed, which should allow it to run on many more FPGAs.

Hi. Having looked at the code I've got a question about the configurable loop unrolling. It appears from looking at sha256_transform.v that feedback is feeding the saved W and state into every stage of the hashing pipeline, not just the first, and I can't seem to see why this is necessary. What's more, if I'm reading what Quartus II is telling me correctly, doing this is costing me several MHz of clock speed and more importantly appears to be using fairly large amounts of logic resources. Is there any way to avoid this?

Edit: Ah, having actually read the comments I now understand. You're doing feedback seperately at each stage of the pipeline, so each pipeline stage computes 2**DEPTH rounds and outputs at 1/(2**DEPTH) speed. Interesting. The trouble with this approach is that I don't think 512-bit wide muxes are exactly cheap.

This obviously doesn't make sense on devices where the fully unrolled design fits, but on smaller parts it's the only way to make it work at all.

rethaw

sr. member

Activity: 378

Merit: 255

Quote from: TheSeven on June 09, 2011, 03:53:33 AM

Quote from: rethaw on June 08, 2011, 07:10:03 PM

Quote from: TheSeven on June 03, 2011, 05:07:41 PM

Now I have: http://dl.dropbox.com/u/23683845/fpgaminer-virtex5.zip
You'll need to adjust the line "constant DEPTH : integer := 6;" (2^n pipeline stages) in top.vhd.

Hi All. I am trying to implement the following on a Virtex 6. DCMs are no longer used on the Virtex 6 and have been replaced with MMCMs. So far I have swapped the DCM for an MMCM and am able to implement the design. But when I try to run the python script, it fails. I get a "Got bad message from FPGA: 240". I would appreciate any guidance you could provide. Thanks.

As you're probably not running at 120MHz you'll need to adjust the UART clock divider.
If you provide your clock frequency I can calculate the correct values for you.

Oh, and it would be interesting which Virtex 6 model this is, which frequency you can reach and how many LUTs/slices/FFs are used.

The device is the xc6vlx240t. I'm currently using the 200MHz clock on the device and have the MMCM set to 100MHz. The MMCM supports up to 800MHz.

Here's the utilization with a depth of 5.

Device utilization summary:
---------------------------

Selected Device : 6vlx240tff1156-3

Slice Logic Utilization:
Number of Slice Registers: 50042 out of 301440 16%
Number of Slice LUTs: 86029 out of 150720 57%
Number used as Logic: 86028 out of 150720 57%
Number used as Memory: 1 out of 58400 0%
Number used as SRL: 1

Slice Logic Distribution:
Number of LUT Flip Flop pairs used: 86548
Number with an unused Flip Flop: 36506 out of 86548 42%
Number with an unused LUT: 519 out of 86548 0%
Number of fully used LUT-FF pairs: 49523 out of 86548 57%
Number of unique control sets: 12

IO Utilization:
Number of IOs: 3
Number of bonded IOBs: 3 out of 600 0%

Specific Feature Utilization:
Number of BUFG/BUFGCTRLs: 4 out of 32 12%

Topic: Official Open Source FPGA Bitcoin Miner (Last Update: April 14th, 2013) - page 39. (Read 432972 times)