Pages:
Author

Topic: Official Open Source FPGA Bitcoin Miner (Last Update: April 14th, 2013) - page 12. (Read 432965 times)

hero member
Activity: 1118
Merit: 541
I have just pushed the experimental KC705 code to the repo.

Thanks!

I ordered my AC701 today. I'm playing with the eval software now. The clocks run a little bit slower than the Kintex line, but it has nearly as many DSPs as the chip you're using. I have high hopes for a minimum of 600Mh/s and shooting for 800Mh/s. Initial compile showing 92% dsp usage, 43% lut usage, 67% memory lut usage and a clock of 345mhz or so. Should be able to squeeze another core in there.

I was wondering, did it really take them 2 weeks to process & ship your unit to you after ordering? That's a big yes. They're not going to ship my card for 2 weeks after ordering Sad . Maybe they've got a large order queue? Maybe each card is made to order? no idea. Seems a rather long time to wait though.

AJR,

If you're going to get into it I would highly recommend you get the 705 or the 701.

http://www.xilinx.com/products/boards-and-kits/EK-K7-KC705-G.htm
http://www.xilinx.com/products/boards-and-kits/EK-A7-AC701-G.htm

The 705 will have room for more hashers, but I believe the Artix chip may be more cost effective.


hero member
Activity: 767
Merit: 500
So, looking into this whole mining with fpga system, and this code you people are working on, what is the required Logic cells/gates required for a full roll out? also whats the smallest unit you can get it running on? (the bare minimal for a half roll out (what ever you call it?))

I just want to dip my toe into the FPGA mining with a cheap and nasty chip set Wink

just tell me to piddle off else where if its the wrong spot to ask

AJRGale,
    I think you'll want at least a Spartan6 LX150.  This is the cheapest device I would use.  I would only run a fully pipelined implementation -- one that can do one hash per clock cycle.  If you can get a hold of a Kintex7 or Virtex7 board you'll be a lot better because you can instantiate more miners. 
    fpgaminer has posted a lot of useful code on github.
    I don't speak Altera, so not sure on specific devices.


so 150K gates? like a Cyclone V? (no idea what gates to logic cells ratios really are) ...so that means 75K gates for half miner?

Ether way, cant find a Spartan6 LX150, can find a http://www.adafruit.com/products/451 "DE0-Nano - Altera Cyclone IV FPGA starter board "
a miner could run on it, buut, only the smallest one to what I've read out of here, at 5Mh/s...

maybe i should look at the code and work out how to use the dev suite, maybe it might tell me what it needs to run i have no idea what I'll be looking at though :/
newbie
Activity: 12
Merit: 0
So, looking into this whole mining with fpga system, and this code you people are working on, what is the required Logic cells/gates required for a full roll out? also whats the smallest unit you can get it running on? (the bare minimal for a half roll out (what ever you call it?))

I just want to dip my toe into the FPGA mining with a cheap and nasty chip set Wink

just tell me to piddle off else where if its the wrong spot to ask

AJRGale,
    I think you'll want at least a Spartan6 LX150.  This is the cheapest device I would use.  I would only run a fully pipelined implementation -- one that can do one hash per clock cycle.  If you can get a hold of a Kintex7 or Virtex7 board you'll be a lot better because you can instantiate more miners. 
    fpgaminer has posted a lot of useful code on github.
    I don't speak Altera, so not sure on specific devices.
hero member
Activity: 560
Merit: 517
Quote
fpgaminer: is there any advantage using "{a,b,c}<={x,y,z};" instead of "a<=x;b<=y;c<=z;"  ?
(My opinion is it only helps to make more readable code.)
No advantage, no.  As you pointed out, it would only be for readability.
hero member
Activity: 767
Merit: 500
So, looking into this whole mining with fpga system, and this code you people are working on, what is the required Logic cells/gates required for a full roll out? also whats the smallest unit you can get it running on? (the bare minimal for a half roll out (what ever you call it?))

I just want to dip my toe into the FPGA mining with a cheap and nasty chip set Wink

just tell me to piddle off else where if its the wrong spot to ask
newbie
Activity: 12
Merit: 0
IIDX,
 
    The addressing would be constant, so no decoding would be needed.  They would be tied off to constants.
    The 2.0ns is the clk-to-out time for a data output.  Since all outputs are in parallel, (each BRAM configured as x72, and grouped together to give very wide access),the individual BRAM bit delay would not change. No demuxing of outputs would be necessary.
    The number of BRAMs needed is only half what you show, since you can use both sides (Port A & Port B) independently (assign each side a fixed, but different address).
    Yes, you are right though re getting the data from the BRAMs to the LUTs needed for the computation.  There is a routing delay which is probably too large.
    Obviously this is not the optimum solution, only bringing it up as a last resort if available flip flops have expired.
Regards,
ihtfp


I think the problem is linking 11 BRAMs together requires a lot of LUTs for address decode/routing since the BRAMs are arranged in columns throughout the chip.  Plus linking 11 together would probably result in a minimum period much higher than 2.0ns (2.0 ns is for 1 BRAM I think).

So, you would need 128 (hashers) * 11 (BRAMs) for one pipeline stage = 1408 total BRAMs.  Of course, you're not suggesting you use BRAM for all the delay.  However, I think the slices you would sacrifice to connect the BRAMs and create their address logic would be more expensive than just using the built in FFs or DMEMs (plus the speed hit).

I'm hoping by floor planning each hashing module I can get to quick speeds.  Currently the logic delay I am facing is only around ~2.0 ns, with the routes taking the rest.  So with some nice routing I would hopefully meet my target.

The V6LX130 isn't even as big as the S6 150, but at least is has DSP48s.

I may also need to cut down the PCIe link from 4x to 1x and reduce its performance settings to regain some of the space that is being used up.

IIDX

Looks good!  I tried to do the same thing on a V6 LX130T (use almost all DSPs and pipeline the rest of the LUT adders), but there aren't enough registers in that device for tx_w and tx_state delays Sad.  so many 512 and 256 bit registers...


   If you are short on flip flops, have you considered using the BRAMs?  You would need 11 primitives (there are 264 in the LX130T) to make a by 792 bit wide memory.  You can set the BRAM to 'write first' mode, which will echo the data to the output.  The clk-to-out for unpipelined BRAM is ~2.0ns...slower than FF. 
   Since the BRAMs are dual port, you can use both sides of the memory (with different locked addresses), you can get enough storage for 48 stages of a fully unrolled algorithm.   
   I've never tried this, but was just thinking of how to make use of all the unused BRAM laying around.  I usually run out of LUTs, but need to rethink if this is worthwhile with the DSP48 implementation.


newbie
Activity: 35
Merit: 0
I think the problem is linking 11 BRAMs together requires a lot of LUTs for address decode/routing since the BRAMs are arranged in columns throughout the chip.  Plus linking 11 together would probably result in a minimum period much higher than 2.0ns (2.0 ns is for 1 BRAM I think).

So, you would need 128 (hashers) * 11 (BRAMs) for one pipeline stage = 1408 total BRAMs.  Of course, you're not suggesting you use BRAM for all the delay.  However, I think the slices you would sacrifice to connect the BRAMs and create their address logic would be more expensive than just using the built in FFs or DMEMs (plus the speed hit).

I'm hoping by floor planning each hashing module I can get to quick speeds.  Currently the logic delay I am facing is only around ~2.0 ns, with the routes taking the rest.  So with some nice routing I would hopefully meet my target.

The V6LX130 isn't even as big as the S6 150, but at least is has DSP48s.

I may also need to cut down the PCIe link from 4x to 1x and reduce its performance settings to regain some of the space that is being used up.

IIDX

Looks good!  I tried to do the same thing on a V6 LX130T (use almost all DSPs and pipeline the rest of the LUT adders), but there aren't enough registers in that device for tx_w and tx_state delays Sad.  so many 512 and 256 bit registers...


   If you are short on flip flops, have you considered using the BRAMs?  You would need 11 primitives (there are 264 in the LX130T) to make a by 792 bit wide memory.  You can set the BRAM to 'write first' mode, which will echo the data to the output.  The clk-to-out for unpipelined BRAM is ~2.0ns...slower than FF. 
   Since the BRAMs are dual port, you can use both sides of the memory (with different locked addresses), you can get enough storage for 48 stages of a fully unrolled algorithm.   
   I've never tried this, but was just thinking of how to make use of all the unused BRAM laying around.  I usually run out of LUTs, but need to rethink if this is worthwhile with the DSP48 implementation.


newbie
Activity: 13
Merit: 0
@ihftp: thanks for the info.. now i know why he didn't use it.  Grin
i got 5 of those though.. shame, can't be fully utilised it.
newbie
Activity: 12
Merit: 0
Looks good!  I tried to do the same thing on a V6 LX130T (use almost all DSPs and pipeline the rest of the LUT adders), but there aren't enough registers in that device for tx_w and tx_state delays Sad.  so many 512 and 256 bit registers...


   If you are short on flip flops, have you considered using the BRAMs?  You would need 11 primitives (there are 264 in the LX130T) to make a by 792 bit wide memory.  You can set the BRAM to 'write first' mode, which will echo the data to the output.  The clk-to-out for unpipelined BRAM is ~2.0ns...slower than FF. 
   Since the BRAMs are dual port, you can use both sides of the memory (with different locked addresses), you can get enough storage for 48 stages of a fully unrolled algorithm.   
   I've never tried this, but was just thinking of how to make use of all the unused BRAM laying around.  I usually run out of LUTs, but need to rethink if this is worthwhile with the DSP48 implementation.

newbie
Activity: 12
Merit: 0
hi, total newbs here.  Grin

just wanna ask since i got this fpga for free (my friends bought it and decide not use it for whatever reason), could i use this for BTC mining?

Genesys™ Virtex-5 FPGA Development Board
http://www.digilentinc.com/Products/Detail.cfm?Prod=GENESYS

thank you for your kind answer.

regards,

hi, total newbs here.  Grin

just wanna ask since i got this fpga for free (my friends bought it and decide not use it for whatever reason), could i use this for BTC mining?

Genesys™ Virtex-5 FPGA Development Board
http://www.digilentinc.com/Products/Detail.cfm?Prod=GENESYS

thank you for your kind answer.

regards,

Probably you can use it, but it will be slow, because 50k logic gate is not enough to use fully unrolled pipes. As i know Spartan-6 LX90T produces 90MH/s, and it has almost twice gates.
I wouldn't use it.  This FPGA only has 28k flip flops.  The Spartan6 LX150 has 184k for comparison.   As Reggie0 said, you wouldn't be able to use fully unrolled logic.

member
Activity: 107
Merit: 13
hi, total newbs here.  Grin

just wanna ask since i got this fpga for free (my friends bought it and decide not use it for whatever reason), could i use this for BTC mining?

Genesys™ Virtex-5 FPGA Development Board
http://www.digilentinc.com/Products/Detail.cfm?Prod=GENESYS

thank you for your kind answer.

regards,

Probably you can use it, but it will be slow, because 50k logic gate is not enough to use fully unrolled pipes. As i know Spartan-6 LX90T produces 90MH/s, and it has almost twice gates.
member
Activity: 107
Merit: 13
fpgaminer: is there any advantage using "{a,b,c}<={x,y,z};" instead of "a<=x;b<=y;c<=z;"  ?
(My opinion is it only helps to make more readable code.)
newbie
Activity: 13
Merit: 0
hi, total newbs here.  Grin

just wanna ask since i got this fpga for free (my friends bought it and decide not use it for whatever reason), could i use this for BTC mining?

Genesys™ Virtex-5 FPGA Development Board
http://www.digilentinc.com/Products/Detail.cfm?Prod=GENESYS

thank you for your kind answer.

regards,
sr. member
Activity: 262
Merit: 250
I have just pushed the experimental KC705 code to the repo.  Here is the project.  This is a DSP48E1 based design, and I have compiled and run it at 400MH/s.  I

Great! Thank you. I thought it would be interesting to browse the DSP48 code to see how you can archive the impressive performance.
hero member
Activity: 560
Merit: 517
Quote
BTW, what does Xpower report for that design at 400 MHz?
Vivado said ~8-9W, but I don't have it set up with the right information for it to make an accurate measurement.  Using my Kill-a-Watt I estimate about 15W.

I hacked support into MPBM for this new firmware, and she's happily mining away now.  Die temperature is 62C using just the stock cooling on the KC705.  Cool
newbie
Activity: 35
Merit: 0
Looks good!  I tried to do the same thing on a V6 LX130T (use almost all DSPs and pipeline the rest of the LUT adders), but there aren't enough registers in that device for tx_w and tx_state delays Sad.  so many 512 and 256 bit registers...

BTW, what does Xpower report for that design at 400 MHz?
hero member
Activity: 560
Merit: 517
Quick Note: I'm trying to move over to my fpgaminer github account.  The links in the OP should have been updated, but there are also a lot of people still following the older repo.  I will continue to push updates to both repos for awhile, but expect https://github.com/fpgaminer/Open-Source-FPGA-Bitcoin-Miner to receive the majority of my attention.
hero member
Activity: 560
Merit: 517
I have just pushed the experimental KC705 code to the repo.  Here is the project.  This is a DSP48E1 based design, and I have compiled and run it at 400MH/s.  Included with this new design is a UART interface, instead of JTAG, since the KC705 kit has an on-board USB-UART bridge.  See the README for more information on how to use the UART interface.  As an additional surprise, this code includes support for the Kintex's on-die temperature sensor.  Temperature readings are reported over UART, allowing external software to monitor the chip.  In the future I will add automatic shutdown on over-temp conditions.

Let me know if you run into any difficulty getting the project to compile with Vivado 2013.1 (or later).  I have never distributed a Vivado project before.  As usual, you will need an appropriate Xilinx license to compile the design.
member
Activity: 107
Merit: 13
-3 speed grade?

Whatever the highest speed grade available is I would assume. I haven't asked what the speed grade of the kit was.



OK, i've checked the link. It is assembled with -2 speedgrade. "AC701 evaluation board featuring the XC7A200T-2FBG676C FPGA"
hero member
Activity: 1118
Merit: 541
-3 speed grade?

Whatever the highest speed grade available is I would assume. I haven't asked what the speed grade of the kit was.



Pages:
Jump to: