Official Open Source FPGA Bitcoin Miner (Last Update: April 14th, 2013) - page 12.

senseless

hero member

Activity: 1118

Merit: 541

Quote from: fpgaminer on April 15, 2013, 12:40:10 AM

I have just pushed the experimental KC705 code to the repo.

Thanks!

I ordered my AC701 today. I'm playing with the eval software now. The clocks run a little bit slower than the Kintex line, but it has nearly as many DSPs as the chip you're using. I have high hopes for a minimum of 600Mh/s and shooting for 800Mh/s. Initial compile showing 92% dsp usage, 43% lut usage, 67% memory lut usage and a clock of 345mhz or so. Should be able to squeeze another core in there.

~~I was wondering, did it really take them 2 weeks to process & ship your unit to you after ordering?~~ That's a big yes. They're not going to ship my card for 2 weeks after ordering Sad

. Maybe they've got a large order queue? Maybe each card is made to order? no idea. Seems a rather long time to wait though.

AJR,

If you're going to get into it I would highly recommend you get the 705 or the 701.

http://www.xilinx.com/products/boards-and-kits/EK-K7-KC705-G.htm
http://www.xilinx.com/products/boards-and-kits/EK-A7-AC701-G.htm

The 705 will have room for more hashers, but I believe the Artix chip may be more cost effective.

AJRGale

hero member

Activity: 767

Merit: 500

Quote from: ihtfp on April 18, 2013, 11:56:34 AM

Quote from: AJRGale on April 17, 2013, 09:39:54 PM

So, looking into this whole mining with fpga system, and this code you people are working on, what is the required Logic cells/gates required for a full roll out? also whats the smallest unit you can get it running on? (the bare minimal for a half roll out (what ever you call it?))

I just want to dip my toe into the FPGA mining with a cheap and nasty chip set Wink

just tell me to piddle off else where if its the wrong spot to ask

AJRGale,
I think you'll want at least a Spartan6 LX150. This is the cheapest device I would use. I would only run a fully pipelined implementation -- one that can do one hash per clock cycle. If you can get a hold of a Kintex7 or Virtex7 board you'll be a lot better because you can instantiate more miners.
fpgaminer has posted a lot of useful code on github.
I don't speak Altera, so not sure on specific devices.

so 150K gates? like a Cyclone V? (no idea what gates to logic cells ratios really are) ...so that means 75K gates for half miner?

Ether way, cant find a Spartan6 LX150, can find a http://www.adafruit.com/products/451 "DE0-Nano - Altera Cyclone IV FPGA starter board "
a miner could run on it, buut, only the smallest one to what I've read out of here, at 5Mh/s...

maybe i should look at the code and work out how to use the dev suite, maybe it might tell me what it needs to run i have no idea what I'll be looking at though :/

ihtfp

newbie

Activity: 12

Merit: 0

Quote from: AJRGale on April 17, 2013, 09:39:54 PM

So, looking into this whole mining with fpga system, and this code you people are working on, what is the required Logic cells/gates required for a full roll out? also whats the smallest unit you can get it running on? (the bare minimal for a half roll out (what ever you call it?))

I just want to dip my toe into the FPGA mining with a cheap and nasty chip set Wink

just tell me to piddle off else where if its the wrong spot to ask

AJRGale,
I think you'll want at least a Spartan6 LX150. This is the cheapest device I would use. I would only run a fully pipelined implementation -- one that can do one hash per clock cycle. If you can get a hold of a Kintex7 or Virtex7 board you'll be a lot better because you can instantiate more miners.
fpgaminer has posted a lot of useful code on github.
I don't speak Altera, so not sure on specific devices.

fpgaminer

hero member

Activity: 560

Merit: 517

Quote

fpgaminer: is there any advantage using "{a,b,c}<={x,y,z};" instead of "a<=x;b<=y;c<=z;" ?
(My opinion is it only helps to make more readable code.)

No advantage, no. As you pointed out, it would only be for readability.

AJRGale

hero member

Activity: 767

Merit: 500

So, looking into this whole mining with fpga system, and this code you people are working on, what is the required Logic cells/gates required for a full roll out? also whats the smallest unit you can get it running on? (the bare minimal for a half roll out (what ever you call it?))

I just want to dip my toe into the FPGA mining with a cheap and nasty chip set Wink

just tell me to piddle off else where if its the wrong spot to ask

ihtfp

newbie

Activity: 12

Merit: 0

IIDX,

The addressing would be constant, so no decoding would be needed. They would be tied off to constants.
The 2.0ns is the clk-to-out time for a data output. Since all outputs are in parallel, (each BRAM configured as x72, and grouped together to give very wide access),the individual BRAM bit delay would not change. No demuxing of outputs would be necessary.
The number of BRAMs needed is only half what you show, since you can use both sides (Port A & Port B) independently (assign each side a fixed, but different address).
Yes, you are right though re getting the data from the BRAMs to the LUTs needed for the computation. There is a routing delay which is probably too large.
Obviously this is not the optimum solution, only bringing it up as a last resort if available flip flops have expired.
Regards,
ihtfp

Quote from: iidx on April 17, 2013, 04:18:10 PM

I think the problem is linking 11 BRAMs together requires a lot of LUTs for address decode/routing since the BRAMs are arranged in columns throughout the chip. Plus linking 11 together would probably result in a minimum period much higher than 2.0ns (2.0 ns is for 1 BRAM I think).

So, you would need 128 (hashers) * 11 (BRAMs) for one pipeline stage = 1408 total BRAMs. Of course, you're not suggesting you use BRAM for all the delay. However, I think the slices you would sacrifice to connect the BRAMs and create their address logic would be more expensive than just using the built in FFs or DMEMs (plus the speed hit).

I'm hoping by floor planning each hashing module I can get to quick speeds. Currently the logic delay I am facing is only around ~2.0 ns, with the routes taking the rest. So with some nice routing I would hopefully meet my target.

The V6LX130 isn't even as big as the S6 150, but at least is has DSP48s.

I may also need to cut down the PCIe link from 4x to 1x and reduce its performance settings to regain some of the space that is being used up.

IIDX

Quote from: ihtfp on April 17, 2013, 12:56:49 PM

Quote from: iidx on April 15, 2013, 02:24:50 AM

Looks good! I tried to do the same thing on a V6 LX130T (use almost all DSPs and pipeline the rest of the LUT adders), but there aren't enough registers in that device for tx_w and tx_state delays Sad

. so many 512 and 256 bit registers...

If you are short on flip flops, have you considered using the BRAMs? You would need 11 primitives (there are 264 in the LX130T) to make a by 792 bit wide memory. You can set the BRAM to 'write first' mode, which will echo the data to the output. The clk-to-out for unpipelined BRAM is ~2.0ns...slower than FF.
Since the BRAMs are dual port, you can use both sides of the memory (with different locked addresses), you can get enough storage for 48 stages of a fully unrolled algorithm.
I've never tried this, but was just thinking of how to make use of all the unused BRAM laying around. I usually run out of LUTs, but need to rethink if this is worthwhile with the DSP48 implementation.

iidx

newbie

Activity: 35

Merit: 0

I think the problem is linking 11 BRAMs together requires a lot of LUTs for address decode/routing since the BRAMs are arranged in columns throughout the chip. Plus linking 11 together would probably result in a minimum period much higher than 2.0ns (2.0 ns is for 1 BRAM I think).

So, you would need 128 (hashers) * 11 (BRAMs) for one pipeline stage = 1408 total BRAMs. Of course, you're not suggesting you use BRAM for all the delay. However, I think the slices you would sacrifice to connect the BRAMs and create their address logic would be more expensive than just using the built in FFs or DMEMs (plus the speed hit).

I'm hoping by floor planning each hashing module I can get to quick speeds. Currently the logic delay I am facing is only around ~2.0 ns, with the routes taking the rest. So with some nice routing I would hopefully meet my target.

The V6LX130 isn't even as big as the S6 150, but at least is has DSP48s.

I may also need to cut down the PCIe link from 4x to 1x and reduce its performance settings to regain some of the space that is being used up.

IIDX

Quote from: ihtfp on April 17, 2013, 12:56:49 PM

Quote from: iidx on April 15, 2013, 02:24:50 AM

Looks good! I tried to do the same thing on a V6 LX130T (use almost all DSPs and pipeline the rest of the LUT adders), but there aren't enough registers in that device for tx_w and tx_state delays Sad

. so many 512 and 256 bit registers...

If you are short on flip flops, have you considered using the BRAMs? You would need 11 primitives (there are 264 in the LX130T) to make a by 792 bit wide memory. You can set the BRAM to 'write first' mode, which will echo the data to the output. The clk-to-out for unpipelined BRAM is ~2.0ns...slower than FF.
Since the BRAMs are dual port, you can use both sides of the memory (with different locked addresses), you can get enough storage for 48 stages of a fully unrolled algorithm.
I've never tried this, but was just thinking of how to make use of all the unused BRAM laying around. I usually run out of LUTs, but need to rethink if this is worthwhile with the DSP48 implementation.

anomalies

newbie

Activity: 13

Merit: 0

@ihftp: thanks for the info.. now i know why he didn't use it. Grin

i got 5 of those though.. shame, can't be fully utilised it.

ihtfp

newbie

Activity: 12

Merit: 0

Quote from: iidx on April 15, 2013, 02:24:50 AM

Looks good! I tried to do the same thing on a V6 LX130T (use almost all DSPs and pipeline the rest of the LUT adders), but there aren't enough registers in that device for tx_w and tx_state delays Sad

. so many 512 and 256 bit registers...

If you are short on flip flops, have you considered using the BRAMs? You would need 11 primitives (there are 264 in the LX130T) to make a by 792 bit wide memory. You can set the BRAM to 'write first' mode, which will echo the data to the output. The clk-to-out for unpipelined BRAM is ~2.0ns...slower than FF.
Since the BRAMs are dual port, you can use both sides of the memory (with different locked addresses), you can get enough storage for 48 stages of a fully unrolled algorithm.
I've never tried this, but was just thinking of how to make use of all the unused BRAM laying around. I usually run out of LUTs, but need to rethink if this is worthwhile with the DSP48 implementation.

ihtfp

newbie

Activity: 12

Merit: 0

Quote from: anomalies on April 17, 2013, 08:36:43 AM

hi, total newbs here. Grin

just wanna ask since i got this fpga for free (my friends bought it and decide not use it for whatever reason), could i use this for BTC mining?

Genesys™ Virtex-5 FPGA Development Board
http://www.digilentinc.com/Products/Detail.cfm?Prod=GENESYS

thank you for your kind answer.

regards,

Quote from: Reggie0 on April 17, 2013, 12:20:49 PM

Quote from: anomalies on April 17, 2013, 08:36:43 AM

hi, total newbs here. Grin

just wanna ask since i got this fpga for free (my friends bought it and decide not use it for whatever reason), could i use this for BTC mining?

Genesys™ Virtex-5 FPGA Development Board
http://www.digilentinc.com/Products/Detail.cfm?Prod=GENESYS

thank you for your kind answer.

regards,

Probably you can use it, but it will be slow, because 50k logic gate is not enough to use fully unrolled pipes. As i know Spartan-6 LX90T produces 90MH/s, and it has almost twice gates.

I wouldn't use it. This FPGA only has 28k flip flops. The Spartan6 LX150 has 184k for comparison. As Reggie0 said, you wouldn't be able to use fully unrolled logic.

Reggie0

member

Activity: 107

Merit: 13

Quote from: anomalies on April 17, 2013, 08:36:43 AM

hi, total newbs here. Grin

just wanna ask since i got this fpga for free (my friends bought it and decide not use it for whatever reason), could i use this for BTC mining?

Genesys™ Virtex-5 FPGA Development Board
http://www.digilentinc.com/Products/Detail.cfm?Prod=GENESYS

thank you for your kind answer.

regards,

Probably you can use it, but it will be slow, because 50k logic gate is not enough to use fully unrolled pipes. As i know Spartan-6 LX90T produces 90MH/s, and it has almost twice gates.

Reggie0

member

Activity: 107

Merit: 13

fpgaminer: is there any advantage using "{a,b,c}<={x,y,z};" instead of "a<=x;b<=y;c<=z;" ?
(My opinion is it only helps to make more readable code.)

anomalies

newbie

Activity: 13

Merit: 0

hi, total newbs here. Grin

just wanna ask since i got this fpga for free (my friends bought it and decide not use it for whatever reason), could i use this for BTC mining?

Genesys™ Virtex-5 FPGA Development Board
http://www.digilentinc.com/Products/Detail.cfm?Prod=GENESYS

thank you for your kind answer.

regards,

kingcoin

sr. member

Activity: 262

Merit: 250

Quote from: fpgaminer on April 15, 2013, 12:40:10 AM

I have just pushed the experimental KC705 code to the repo. Here is the project. This is a DSP48E1 based design, and I have compiled and run it at 400MH/s. I

Great! Thank you. I thought it would be interesting to browse the DSP48 code to see how you can archive the impressive performance.

fpgaminer

hero member

Activity: 560

Merit: 517

Quote

BTW, what does Xpower report for that design at 400 MHz?

Vivado said ~8-9W, but I don't have it set up with the right information for it to make an accurate measurement. Using my Kill-a-Watt I estimate about 15W.

I hacked support into MPBM for this new firmware, and she's happily mining away now. Die temperature is 62C using just the stock cooling on the KC705. Cool

iidx

newbie

Activity: 35

Merit: 0

Looks good! I tried to do the same thing on a V6 LX130T (use almost all DSPs and pipeline the rest of the LUT adders), but there aren't enough registers in that device for tx_w and tx_state delays Sad

. so many 512 and 256 bit registers...

BTW, what does Xpower report for that design at 400 MHz?

fpgaminer

hero member

Activity: 560

Merit: 517

Quick Note: I'm trying to move over to my fpgaminer github account. The links in the OP should have been updated, but there are also a lot of people still following the older repo. I will continue to push updates to both repos for awhile, but expect https://github.com/fpgaminer/Open-Source-FPGA-Bitcoin-Miner to receive the majority of my attention.

fpgaminer

hero member

Activity: 560

Merit: 517

I have just pushed the experimental KC705 code to the repo. Here is the project. This is a DSP48E1 based design, and I have compiled and run it at 400MH/s. Included with this new design is a UART interface, instead of JTAG, since the KC705 kit has an on-board USB-UART bridge. See the README for more information on how to use the UART interface. As an additional surprise, this code includes support for the Kintex's on-die temperature sensor. Temperature readings are reported over UART, allowing external software to monitor the chip. In the future I will add automatic shutdown on over-temp conditions.

Let me know if you run into any difficulty getting the project to compile with Vivado 2013.1 (or later). I have never distributed a Vivado project before. As usual, you will need an appropriate Xilinx license to compile the design.

Reggie0

member

Activity: 107

Merit: 13

Quote from: senseless on April 14, 2013, 04:33:38 PM

Quote from: Reggie0 on April 14, 2013, 03:47:54 PM

-3 speed grade?

Whatever the highest speed grade available is I would assume. I haven't asked what the speed grade of the kit was.

OK, i've checked the link. It is assembled with -2 speedgrade. "AC701 evaluation board featuring the XC7A200T-2FBG676C FPGA"

senseless

hero member

Activity: 1118

Merit: 541

Quote from: Reggie0 on April 14, 2013, 03:47:54 PM

-3 speed grade?

Whatever the highest speed grade available is I would assume. I haven't asked what the speed grade of the kit was.

Topic: Official Open Source FPGA Bitcoin Miner (Last Update: April 14th, 2013) - page 12. (Read 432972 times)