Official Open Source FPGA Bitcoin Miner (Last Update: April 14th, 2013) - page 10.

xbaby

newbie

Activity: 16

Merit: 0

Quote from: iidx on May 09, 2013, 12:26:23 PM

Oh, what speed grade did you use for the V6? All my boards with 130s and 240s (ml605) are -1, so if you used -3 that could explain the big difference in the quality of the results.

my board have 2 pcs 130T devices. the speed grade is -2I. so your design with DSP48s really have some speed and power usage advantage. BTW, can your design be synthesized on XST?

iidx

newbie

Activity: 35

Merit: 0

Oh, what speed grade did you use for the V6? All my boards with 130s and 240s (ml605) are -1, so if you used -3 that could explain the big difference in the quality of the results.

iidx

newbie

Activity: 35

Merit: 0

That's really interesting, I am not familiar with the ztex projects (they don't compile in my preferred synthesis tool, synplify pro). I would not have expected it to run on the V6 with that usage @ 300 MHz without additional pipelining. Maybe I'll take a look at that project and see what the big difference is.

I used the old veriliog port to get to 300 MHz on mine (adding DSPs and pipelining), but the power usage is around 7W due to the use of the DSPs.

IIDX

Quote from: xbaby on May 09, 2013, 08:21:50 AM

Quote from: iidx on May 09, 2013, 01:25:56 AM

Quote from: fpgaminer on May 08, 2013, 07:27:46 PM

Quote

Thanks for you hint. I've already tried SmartXplorer with default 7 built-in strategies, but can't achieve above 160MHz result. so, you mean I should use the cost table method to brute force it? thanks.

Yup. For reference, the released bitstreams took days/weeks to compile.

xbaby,

You can also try to floorplan the DSP48s if you want to cut your runtime. To get the boards I have with V6 130Ts to run at 300 MHz, I had to constrain each of the DSP48s, otherwise there was no chance. This was based on the original verilog port, but I'm sure the problem with no pre-placement is the same.

Hi, thanks for your tips. I'm compiling the "X6000_ztex_comm4" project, which doesn't use any DSP48 block as I know. I also successfully compiled the same project on V6 130T device (with minor fix for MMCM, FIFO, JTAG core), just achieve at most 300MHz, same as yours, but no DSP48s. the compile time of V6 device is much less than spartan6 LX150. I guess the long-route resources of virtex6 make the difference.

next, I want to try difference implement options to go higher target, such as 350MHz.

BTW the power estimation given by ISE of V6 130T @ 300MHz is about 10W. below is the resource usage:

Code:

Device Utilization Summary:

Slice Logic Utilization:
Number of Slice Registers: 85,173 out of 160,000 53%
Number used as Flip Flops: 85,172
Number used as Latches: 1
Number used as Latch-thrus: 0
Number used as AND/OR logics: 0
Number of Slice LUTs: 57,385 out of 80,000 71%
Number used as logic: 34,910 out of 80,000 43%
Number using O6 output only: 14,978
Number using O5 output only: 539
Number using O5 and O6: 19,393
Number used as ROM: 0
Number used as Memory: 9,759 out of 27,840 35%
Number used as Dual Port RAM: 0
Number used as Single Port RAM: 0
Number used as Shift Register: 9,759
Number using O6 output only: 9,759
Number using O5 output only: 0
Number using O5 and O6: 0
Number used exclusively as route-thrus: 12,716
Number with same-slice register load: 12,452
Number with same-slice carry load: 264
Number with other load: 0

Slice Logic Distribution:
Number of occupied Slices: 15,859 out of 20,000 79%
Number of LUT Flip Flop pairs used: 62,383
Number with an unused Flip Flop: 1,382 out of 62,383 2%
Number with an unused LUT: 4,998 out of 62,383 8%
Number of fully used LUT-FF pairs: 56,003 out of 62,383 89%
Number of slice register sites lost
to control set restrictions: 0 out of 160,000 0%

xbaby

newbie

Activity: 16

Merit: 0

Quote from: iidx on May 09, 2013, 01:25:56 AM

Quote from: fpgaminer on May 08, 2013, 07:27:46 PM

Quote

Thanks for you hint. I've already tried SmartXplorer with default 7 built-in strategies, but can't achieve above 160MHz result. so, you mean I should use the cost table method to brute force it? thanks.

Yup. For reference, the released bitstreams took days/weeks to compile.

xbaby,

You can also try to floorplan the DSP48s if you want to cut your runtime. To get the boards I have with V6 130Ts to run at 300 MHz, I had to constrain each of the DSP48s, otherwise there was no chance. This was based on the original verilog port, but I'm sure the problem with no pre-placement is the same.

Hi, thanks for your tips. I'm compiling the "X6000_ztex_comm4" project, which doesn't use any DSP48 block as I know. I also successfully compiled the same project on V6 130T device (with minor fix for MMCM, FIFO, JTAG core), just achieve at most 300MHz, same as yours, but no DSP48s. the compile time of V6 device is much less than spartan6 LX150. I guess the long-route resources of virtex6 make the difference.

next, I want to try difference implement options to go higher target, such as 350MHz.

BTW the power estimation given by ISE of V6 130T @ 300MHz is about 10W. below is the resource usage:

Code:

Device Utilization Summary:

Slice Logic Utilization:
Number of Slice Registers: 85,173 out of 160,000 53%
Number used as Flip Flops: 85,172
Number used as Latches: 1
Number used as Latch-thrus: 0
Number used as AND/OR logics: 0
Number of Slice LUTs: 57,385 out of 80,000 71%
Number used as logic: 34,910 out of 80,000 43%
Number using O6 output only: 14,978
Number using O5 output only: 539
Number using O5 and O6: 19,393
Number used as ROM: 0
Number used as Memory: 9,759 out of 27,840 35%
Number used as Dual Port RAM: 0
Number used as Single Port RAM: 0
Number used as Shift Register: 9,759
Number using O6 output only: 9,759
Number using O5 output only: 0
Number using O5 and O6: 0
Number used exclusively as route-thrus: 12,716
Number with same-slice register load: 12,452
Number with same-slice carry load: 264
Number with other load: 0

Slice Logic Distribution:
Number of occupied Slices: 15,859 out of 20,000 79%
Number of LUT Flip Flop pairs used: 62,383
Number with an unused Flip Flop: 1,382 out of 62,383 2%
Number with an unused LUT: 4,998 out of 62,383 8%
Number of fully used LUT-FF pairs: 56,003 out of 62,383 89%
Number of slice register sites lost
to control set restrictions: 0 out of 160,000 0%

kramble

sr. member

Activity: 384

Merit: 250

Quote from: AJRGale on May 09, 2013, 01:29:38 AM

wow, now im in, 35MH/s = $5 a month... at max of 5W? now if i was going to replace my setup now thats pulling 200W, i need 100 of these, and that beating my 190MH/s setup! (35 x 100 = 3500MH/s!!) and thats just the DE0-nanos!!

now, wheres my $10,000...

Yes, quite! In the 6 months I've been tinkering I've mined the glorious sum of 0.4BTC. I was sort of hoping to get up to a whole bitcoin eventually (I rather fancied one of those shiny physical coins as a keepsake), but that now seems rather forlorn.

Still, I already had the kit, which (as I explained way back up the thread), I obtained in a fit of enthusiasm for rekindling the electronics hobbyist days of my youth, and all I've invested is my time (of which I have a lot spare at the moment), and a little electricity. It was fun though, so no regrets.

Anyway, this is rather derailing fpgaminer's thread with my chattering, so I'll shut up now.

TTFN
Mark

AJRGale

hero member

Activity: 767

Merit: 500

Quote from: kramble on May 08, 2013, 08:12:52 AM

Quote from: Khertan on May 08, 2013, 01:32:04 AM

I use very aggressive fitter settings, effort multiplier of 40, that's 2hours of fitting

Thanks for the tip, I've been using the default settings so far but I'll give the more aggressive ones a try.

Makomk's code did eventually compile (for 120MHz clock) and gave a fmax of 123MHz at 85C. This should be giving 30MHash/s, though I'm not convinced I'm seeing that in practice. Possibly the fpga is running a bit too hot, though I'm not seeing any bad hashes. I'll have to run it a bit longer to be certain.

[EDIT] Its actually working perfectly. I cranked it up to 140MHz and it seems quite stable, pushing out 35MHash/sec! Not bad at all for a DE0-Nano. Cheers makomk Cheesy

Regards
Mark

wow, now im in, 35MH/s = $5 a month... at max of 5W? now if i was going to replace my setup now thats pulling 200W, i need 100 of these, and that beating my 190MH/s setup! (35 x 100 = 3500MH/s!!) and thats just the DE0-nanos!!

now, wheres my $10,000...

iidx

newbie

Activity: 35

Merit: 0

Quote from: fpgaminer on May 08, 2013, 07:27:46 PM

Quote

Thanks for you hint. I've already tried SmartXplorer with default 7 built-in strategies, but can't achieve above 160MHz result. so, you mean I should use the cost table method to brute force it? thanks.

Yup. For reference, the released bitstreams took days/weeks to compile.

xbaby,

You can also try to floorplan the DSP48s if you want to cut your runtime. To get the boards I have with V6 130Ts to run at 300 MHz, I had to constrain each of the DSP48s, otherwise there was no chance. This was based on the original verilog port, but I'm sure the problem with no pre-placement is the same.

fpgaminer

hero member

Activity: 560

Merit: 517

Quote

Thanks for you hint. I've already tried SmartXplorer with default 7 built-in strategies, but can't achieve above 160MHz result. so, you mean I should use the cost table method to brute force it? thanks.

Yup. For reference, the released bitstreams took days/weeks to compile.

kramble

sr. member

Activity: 384

Merit: 250

Quote from: Khertan on May 08, 2013, 01:32:04 AM

I use very aggressive fitter settings, effort multiplier of 40, that's 2hours of fitting

Thanks for the tip, I've been using the default settings so far but I'll give the more aggressive ones a try.

Makomk's code did eventually compile (for 120MHz clock) and gave a fmax of 123MHz at 85C. This should be giving 30MHash/s, though I'm not convinced I'm seeing that in practice. Possibly the fpga is running a bit too hot, though I'm not seeing any bad hashes. I'll have to run it a bit longer to be certain.

[EDIT] Its actually working perfectly. I cranked it up to 140MHz and it seems quite stable, pushing out 35MHash/sec! Not bad at all for a DE0-Nano. Cheers makomk Cheesy

Regards
Mark

xbaby

newbie

Activity: 16

Merit: 0

Quote from: fpgaminer on May 07, 2013, 10:50:05 PM

Quote

I'd like to ask what optimization options need to use to achieve > 190MHz clock speed? please help me, thanks very much.

The project won't "just compile" and achieve >190MHz. Getting timing that high requires using Xilinx's SmartXplorer to brute force it.

Thanks for you hint. I've already tried SmartXplorer with default 7 built-in strategies, but can't achieve above 160MHz result. so, you mean I should use the cost table method to brute force it? thanks.

Khertan

full member

Activity: 193

Merit: 100

Quote from: kramble on May 07, 2013, 05:02:09 PM

Quote from: Khertan on May 07, 2013, 04:35:27 PM

two loop instead of 3 will increase design of 33 percen t, That's Incredible and Awesome boost ... Witch will give to my de0 nano 10mh/s instead of 6.66mh/s (at 40mhz to not fry it). that's amazing !!!

I concluded that the power consumption is pretty much proportional to the hash rate. So for example 10MHash/sec will consume the same power (and get just as hot) whether running at 50MHz or using half the resources at 100Mhz (to a first approximation anyway as faster clock should be slightly less efficient).

I've gone back to look at makomk's code (he's uploaded something recently to http://www.makomk.com/gitweb/?p=Open-Source-FPGA-Bitcoin-Miner.git;a=tree;h=refs/heads/de0-nano-usb;hb=de0-nano-usb ), so I thought I'd give it a try (swapping out his usb interface for my serial code), its still compiling after 2 hours (only just failing to route at the last attempt, just one signal short!) It will be interesting to see how fast it will run (assuming it does finish compiling!)

PowerPlay estimate less power usage to use two loop at 40mhz than 3 at 50mhz. and i think you will not be able to get more than 100mhz with two loop.

I use very aggressive fitter settings, effort multiplier of 40, that's 2hours of fitting

fpgaminer

hero member

Activity: 560

Merit: 517

Quote

I'd like to ask what optimization options need to use to achieve > 190MHz clock speed? please help me, thanks very much.

The project won't "just compile" and achieve >190MHz. Getting timing that high requires using Xilinx's SmartXplorer to brute force it.

kramble

sr. member

Activity: 384

Merit: 250

Quote from: Khertan on May 07, 2013, 04:35:27 PM

two loop instead of 3 will increase design of 33 percen t, That's Incredible and Awesome boost ... Witch will give to my de0 nano 10mh/s instead of 6.66mh/s (at 40mhz to not fry it). that's amazing !!!

I concluded that the power consumption is pretty much proportional to the hash rate. So for example 10MHash/sec will consume the same power (and get just as hot) whether running at 50MHz or using half the resources at 100Mhz (to a first approximation anyway as faster clock should be slightly less efficient).

I've gone back to look at makomk's code (he's uploaded something recently to http://www.makomk.com/gitweb/?p=Open-Source-FPGA-Bitcoin-Miner.git;a=tree;h=refs/heads/de0-nano-usb;hb=de0-nano-usb ), so I thought I'd give it a try (swapping out his usb interface for my serial code), its still compiling after 2 hours (only just failing to route at the last attempt, just one signal short!) It will be interesting to see how fast it will run (assuming it does finish compiling!)

Khertan

full member

Activity: 193

Merit: 100

two loop instead of 3 will increase design of 33 percen t, That's Incredible and Awesome boost ... Witch will give to my de0 nano 10mh/s instead of 6.66mh/s (at 40mhz to not fry it). that's amazing !!!

:p of course that's more for fun and to learn

kramble

sr. member

Activity: 384

Merit: 250

Quote from: Khertan on May 07, 2013, 10:11:23 AM

I've tryed to fit a 2 loop with a 32 hasher, this could be fit in a DE0 Nano, after some auto magic Quartus Area optimization, but with a far less fmax (120Mhz).
That s fit with only few 1xx lut free

Unfortunatly i mess up the things, as trying to convert things to two loop i break something in the cnt or feedback ...

I was not able to get the LOOP_LOG2=2 code to fit myself but makomk achieved 27.5MH/s on a Nano (https://bitcointalksearch.org/topic/m.847182 (EDIT updated to a better link)), so I guess that with some expert tweaking it does indeed work. I decided to go a different route and try to fit 22 hashers (which nicely gives 66 stages in three rounds, so just discarding the last two to give the 64 needed) using a variant of sha256_transform from makomk's github (since the makomk branch in the official distribution does not work unless LOOP_LOG2=1). It did take a fair bit of tinkering in the simulator to get the timing right (and I ended up discarding makomk's pipelining of the K values since it was too confusing, so there is an opportunity for some further gain by putting it back in).

Interestingly this 66 round core generalized quite well as I was able to use it on a EP4CE10 as 6 rounds of 11 hashers and on an LX9 as 11 rounds of 6 hashers (rather disappointing utilization, but I'm even more of a novice at Xilinx ISE as I am at Quartus). Anyway this was just playing around for the sake of it rather than a serious attempt to build a miner on these devices, though I did construct one of each using TQFP devices built on breakout adapter's, which are currently hashing away at the majestic rates of ~~12.7MH/s~~ 11.7MH/s (140MHz) and 5MH/s (110MHz) respectively.

Best of luck
Mark

Khertan

full member

Activity: 193

Merit: 100

Quote from: kramble on May 03, 2013, 03:32:43 PM

Given the tiny returns from mining on the Nano, my opinion was that its not worth risking the boards at the higher speeds. I'm happy with my current setup (as described above) as nothing is getting above 60C, but its your call on your own stuff.

Regards
Mark

I've tryed to fit a 2 loop with a 32 hasher, this could be fit in a DE0 Nano, after some auto magic Quartus Area optimization, but with a far less fmax (120Mhz).
That s fit with only few 1xx lut free

Unfortunatly i mess up the things, as trying to convert things to two loop i break something in the cnt or feedback ...

xbaby

newbie

Activity: 16

Merit: 0

I'm trying to compile the "projects/X6000_ztex_comm4" myself, for devices "xc6slx150, speed -3", under Xilinx ISE v13.4, and code from Github without any modification.

using default compiling option from "xilinx_fpgaminer.xise", under the goal of "Timing Performance", the placement failed. after change goal to "Minimum Runtime", the project compiled successfully, but the timing constrains can't be met. from the PAR report, the clock speed is only 153MHz (cycle 6.54ns). I'd like to ask what optimization options need to use to achieve > 190MHz clock speed? please help me, thanks very much.

Code:

+-------------------------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+
|                               |   Period    |       Actual Period       |      Timing Errors        |      Paths Analyzed       |
|           Constraint          | Requirement |-------------+-------------|-------------+-------------|-------------+-------------|
|                               |             |   Direct    | Derivative  |   Direct    | Derivative  |   Direct    | Derivative  |
+-------------------------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+
|TS_CLK_100MHZ                  |     10.000ns|      9.689ns|     13.082ns|            0|          633|         1456|      3690036|
| TS_dynamic_clk_blk_clkfx      |      5.000ns|      6.541ns|          N/A|          633|            0|      3690036|            0|
+-------------------------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+

Slice Logic Utilization:
  Number of Slice Registers:                84,129 out of 184,304   45%
    Number used as Flip Flops:              84,129
    Number used as Latches:                      0
    Number used as Latch-thrus:                  0
    Number used as AND/OR logics:                0
  Number of Slice LUTs:                     50,798 out of  92,152   55%
    Number used as logic:                   35,040 out of  92,152   38%
      Number using O6 output only:          15,507
      Number using O5 output only:             581
      Number using O5 and O6:               18,952
      Number used as ROM:                        0
    Number used as Memory:                   3,297 out of  21,680   15%
      Number used as Dual Port RAM:              0
      Number used as Single Port RAM:            0
      Number used as Shift Register:         3,297
        Number using O6 output only:           449
        Number using O5 output only:             0
        Number using O5 and O6:              2,848
    Number used exclusively as route-thrus: 12,461
      Number with same-slice register load: 12,036
      Number with same-slice carry load:       425
      Number with other load:                    0

Slice Logic Distribution:
  Number of occupied Slices:                15,049 out of  23,038   65%
  Nummber of MUXCYs used:                   22,144 out of  46,076   48%
  Number of LUT Flip Flop pairs used:       58,734
    Number with an unused Flip Flop:           959 out of  58,734    1%
    Number with an unused LUT:               7,936 out of  58,734   13%
    Number of fully used LUT-FF pairs:      49,839 out of  58,734   84%
    Number of slice register sites lost
      to control set restrictions:               0 out of 184,304    0%

Khertan

full member

Activity: 193

Merit: 100

Quote from: kramble on May 03, 2013, 03:32:43 PM

Given the tiny returns from mining on the Nano, my opinion was that its not worth risking the boards at the higher speeds. I'm happy with my current setup (as described above) as nothing is getting above 60C, but its your call on your own stuff.

Regards
Mark

Thanks, indeed for bitcoin mining i ll not risk to burn mine little nano, i'm asking because i'm working on a other project, i want to understand things to not burn it.

I ll try to monitor the usb power used and temperature.

At 40Mhz PowerPlay estimate 296mA ... for the fpga only of course. But i've play with settings to reduce power usage from your original code / project settings.
So look like powerplay underestimate power usage

Thanks a lot for your explanation.

fpgaminer

hero member

Activity: 560

Merit: 517

For those with a VC707 devkit (Virtex 7), I've done a blind port of the KC705_experimental project:

https://github.com/fpgaminer/Open-Source-FPGA-Bitcoin-Miner/tree/master/projects/VC707_experimental

Bitstream: https://mega.co.nz/#!7x4nkS4b!O2aEv0Khp541jwY8FIwpiUeYstoXAOSyMqUKxhBMwKY

Completely untested. Let me know if it works, or doesn't!

gingernuts

member

Activity: 89

Merit: 10

Looking at Digikey right now,for the chips you could actually buy today,

The Small Kintex XC7K160T is $230 ish in -1 grade and $280 ish in -2 grade
The Biggest Artix XC7A200T is $200 ish in -1 grade and $270 ish in -2 grade and both of these can be developed with the free Webpack software

The Kintex used on the KC705, XC7K325T is $1000 ish in the -1 grade, and $1500 odd in the -2 grade (They have a $1200 one, but not in stock), and needs a full Vivado/ISE license to play with - even if I were to buy a KC705 dev-kit, I can't see how the 325T device is going to be good bang for the buck...

Interestingly in a Kintex -> Artix migration guide Xilinx seem to reckon that a -1 grade Kintex is 1.6x as fast as a -1 Artix so while the 7A200T looks like a winner in terms of price and slices/DPS modules, I'm wondering whether the Kintex XC7K160 might not be the best value overall...

Topic: Official Open Source FPGA Bitcoin Miner (Last Update: April 14th, 2013) - page 10. (Read 432965 times)