Pages:
Author

Topic: Official Open Source FPGA Bitcoin Miner (Last Update: April 14th, 2013) - page 41. (Read 432967 times)

member
Activity: 73
Merit: 10
Yeah, I know about the random seed, but ... I dunno, that seems too far fetched that the random seed would let it go from completely un-routable, to routing in a few minutes. Then again, I haven't used ISE in a few years, so it may be a lot more temperamental than I remember.

I don't know about today, but back in the day the Xilinx router used simulated annealing, and if that algorithm gets caught in a local minimum, it gets stuck and never recovers. Minor irrelevant tweaks to the input could sometimes shake things out.
hero member
Activity: 560
Merit: 517
Quote
If nothing else changed in the design, it may have just been that the seed the tool started with when running it's algorithm changed.
Yeah, I know about the random seed, but ... I dunno, that seems too far fetched that the random seed would let it go from completely un-routable, to routing in a few minutes. Then again, I haven't used ISE in a few years, so it may be a lot more temperamental than I remember.

I only did one thing different. I ran each stage one at a time, instead of telling it to P&R and have it automatically invoke all the necessary steps.  Huh

Quote
I've been trying to run the verilog version of the code through Synplify Pro
Ughhh. I've got an old version of Synplify Pro that refuses to synthesize the design. First, it didn't like the use of block names  Huh and last I left it, it was optimizing out the entire design, for no particular reason. I might install their latest eval version on a different machine and see if that version has better luck.
legendary
Activity: 2940
Merit: 1090
All this seems to be about using dev-kit boards, what is involved, now that the designs have been tested on devkit boards, in doing it on presumably cheaper maybe simpler (no extra I/O types just the one you actually want or whatever other optimisations) "production" boards?

Are the devkit ones the only ones you can simply plug into a usb port and play?

Although they might seem kind of expensive per MHash upfront cost, low power usage is for some people not merely a savings of money on power bill but maybe even a case of keeping power usage low enough that landlords or employers or whatever won't see drastic spike in power bill thus decide to no longer provide it "free"...

-MarkM-
hero member
Activity: 504
Merit: 500
FPGA Mining LLC
Make sure you connect all the inputs and outputs to something that can't be optimized away. For fpgaminer's Altera verilog design that would be this virtual wire thing, or for my VHDL design (and also the Xilinx verilog design) it would be the UART, which is itself connected to some I/O pins, so it can't be optimized away.
Oh, and I fully agree with your attitude towards verilog. That's why I translated it to VHDL Smiley
sr. member
Activity: 401
Merit: 250
As a side note, for reasons I cannot understand, my design passed P&R yesterday, even though it failed the last time I tried it.

Welcome to the wonderful world of P&R tools. Wink If nothing else changed in the design, it may have just been that the seed the tool started with when running it's algorithm changed. I see this a lot, good timing constraints help guide the tools to better solutions, and the tools are becoming more deterministic, but there's still a random element to them at times. Some vendors are worse than others. *cough*Actel*cough*

I've been trying to run the verilog version of the code through Synplify Pro just to see how it looks for utilization on various devices, but have been having problems getting it to assign the parameter correctly... Seems to be ignoring my assignment and optimizing everything away. Haven't had a chance to try out the parameterized VHDL code yet, I'm a lot more comfortable with VHDL so even if it doesn't work right out of the gate with Synplify, I should be able to get it working. Verilog has too many idiosyncrasies for my tastes.
hero member
Activity: 504
Merit: 500
FPGA Mining LLC
@TheSeven: Just interested, but you fitted a complete unit (depth=6) into a v5lx110t?
Exactly, but with no additional registers, just the code that I have uploaded.

Gave it a try, but the default strategy already shows 250% LUTs after synthesis...
For the Virtex5 LX110?
I'm getting an area constraint ratio of ~200, but it fits at ~96% utilization in the end.
newbie
Activity: 10
Merit: 0
@TheSeven: Just interested, but you fitted a complete unit (depth=6) into a v5lx110t?
Gave it a try, but the default strategy already shows 250% LUTs after synthesis...
hero member
Activity: 686
Merit: 564
I don't know how much better the performance of the Stratix parts are. I'd just guess 1.5x faster, so you'd get ~1.5 MH/s per 1,000 LEs compared to the Cyclone series. Maybe more. Maybe less.

The Stratix IV architecture looks interesting. From the Altera docs on it, it seems you basically have the equivalent of a free full-adder attached to the output of each 4LUT. (Of course, it doesn't actually have 4LUTs as such, instead having 8-input 2-output ALMs that can be configured as 2 4LUTs, a 5LUT and a 3LUT, or a 6LUT.) Not that useful to me, since I neither have one nor the software to synthesise designs for one, but interesting nonetheless and should reduce LE usage compared to other FPGAs.

You'll probably gain more by cutting the pipeline stages into halves, as you seem to have lots of spare flipflops around. Sadly this is not the case on my Virtex5. Sad You can usually do that by just adding a second register on the output, the synthesis tools will move over things from the preceding pipeline stage to this unused one. (You'll see this in the synthesis log as "register balancing".) I know that 190MHz are possible this way!

Ah yes, Xilinx under-equipped the Virtex 5 series with flipflops for some daft reason. Why do I keep getting the impression their FPGAs are designed to look good on paper as much as they are to actually function well?
hero member
Activity: 504
Merit: 500
FPGA Mining LLC
And yes, I've run it solo before, against namecoind  Smiley It found a couple blocks!  Grin
Guess what I tried with mine last night... Smiley
TheSeven: That is some great work you're doing there! I'm glad someone is making a lot of progress with Xilinx devices and Python mining scripts. I haven't had the chance to push your code into the public repo yet, but it is most certainly on my list of things to do.
Someone else also ported your code to Xilinx, while keeping it in Verilog. While I personally don't understand how one could ever chose Verilog over VHDL, you might possibly like that one better.
As a side note, for reasons I cannot understand, my design passed P&R yesterday, even though it failed the last time I tried it. This was a half-mining core (only one SHA-256 pass), at 50MHz. I haven't tested it or anything, but ... I'm just bewildered why it routed without any problems this time. I must have done something wrong the first time. Hopefully it isn't just messing with my head, and I can finally start mining on my SLX150. I'd also like to start testing the DSP48A1 slices, which are rated for 250MHz operation and will perform 48bit + 48bit addition  Cool
What did you change? There must have been something...

Regarding those DSP slices, I'm not sure if they will pay off. 250MHz for the DSP slice alone is not that fast, regular LUT-based adders might be even faster! Also, you couldn't use the 48bit width, and would have only 2 of them to spend per round. This might reduce LUT utilization a bit, but probably won't help performance.
You'll probably gain more by cutting the pipeline stages into halves, as you seem to have lots of spare flipflops around. Sadly this is not the case on my Virtex5. Sad You can usually do that by just adding a second register on the output, the synthesis tools will move over things from the preceding pipeline stage to this unused one. (You'll see this in the synthesis log as "register balancing".) I know that 190MHz are possible this way!
hero member
Activity: 560
Merit: 517
Quote
or possibly an Xtreme Data XD-PCIE3000 with three Stratix IVs per card?
What size Stratix IVs are they? I don't know how much better the performance of the Stratix parts are. I'd just guess 1.5x faster, so you'd get ~1.5 MH/s per 1,000 LEs compared to the Cyclone series. Maybe more. Maybe less.


Quote
* DATA is the block header for which a hash must be found. It does contain the unix timestamp. It also contains the current target value, so that's probably where the FPGA learns it (or it doesn't care at all and this is checked on the tcl-side). The nonce is set to 0x00000000.
The FPGA doesn't care, it just returns nonces that make a hash meet the Difficulty 1 target (H == 0). And no, it isn't checked on the tcl side either. All pools currently operate on Difficulty==1. For solo mining, the script will submit the data, bitcoind will check it, and return an error if it wasn't below the target. So, not too much harm done there.

And yes, I've run it solo before, against namecoind  Smiley It found a couple blocks!  Grin

TheSeven: That is some great work you're doing there! I'm glad someone is making a lot of progress with Xilinx devices and Python mining scripts. I haven't had the chance to push your code into the public repo yet, but it is most certainly on my list of things to do.

As a side note, for reasons I cannot understand, my design passed P&R yesterday, even though it failed the last time I tried it. This was a half-mining core (only one SHA-256 pass), at 50MHz. I haven't tested it or anything, but ... I'm just bewildered why it routed without any problems this time. I must have done something wrong the first time. Hopefully it isn't just messing with my head, and I can finally start mining on my SLX150. I'd also like to start testing the DSP48A1 slices, which are rated for 250MHz operation and will perform 48bit + 48bit addition  Cool
hero member
Activity: 504
Merit: 500
FPGA Mining LLC
trying to build a depth:=3 version right now.

slice luts: 54% (53% used as logic)
slice registers: 26%
occupied slices: 66%

estimates after synthesis.
Sounds like depth:=4 might be achievable
with a targeted 50mhz clock p&r takes forever and finally fails with setup violations.
problem is congestion/routing, not available ressources in terms of FFs or LUTs...
Sounds like Spartan6 routing is just crap.
You might want to try depth:=2 and depth:=3 with doubled registers in the pipeline stages to allow for retiming and thus hitting higher frequencys, at the expense of a couple of flipflops, which you seem to have plenty of.
if you have the time, then just give it a try for xc6slx45-2csg324 with 50mhz and depth:=3
No, being busy synthesizing a XC6VLX760 design, this will take a while.
increasing the frequency is not an option, with depth:=2 the timing performance design goal reports just 55mhz  after p&r.
This sounds like you might want to try the following:
- Split the sha256 rounds into two pipeline stages, as stated above (retiming)
- Experiment with various design strategies. For some reason "Runtime optimized" seems to yield the best results for this design. If you have the time, try SmartXplorer
- If all this doesn't work out, run it at 55MHz instead of 50, should bring it to 3.6MH/s Smiley
newbie
Activity: 10
Merit: 0
trying to build a depth:=3 version right now.

slice luts: 54% (53% used as logic)
slice registers: 26%
occupied slices: 66%

estimates after synthesis.
with a targeted 50mhz clock p&r takes forever and finally fails with setup violations.
problem is congestion/routing, not available ressources in terms of FFs or LUTs...

if you have the time, then just give it a try for xc6slx45-2csg324 with 50mhz and depth:=3

increasing the frequency is not an option, with depth:=2 the timing performance design goal reports just 55mhz  after p&r.
hero member
Activity: 504
Merit: 500
FPGA Mining LLC
Hm, sounds like a bug in the pool's RPC service. It's supposed to return True if it accepts the share. I'll probably have to try that one.
It's working fine with ContinuumPool, Swepool.net, Bitcoins.lc, BTC Guild, Eligius, Slush's pool and DeepBit.

EDIT:
Code:
Found long polling URL for BitClockers: http://pool.bitclockers.com:8332/LP
Mining: BitClockers:cd1aa9fa22321dd0489e32e7090a601bd9735152cf5d64fcdd05b7e7342d741d:112d8c994ded19371a1d932f
Found long polling URL for BTC Guild: http://btcguild.com:8332/LP
Found share: BitClockers:cd1aa9fa22321dd0489e32e7090a601bd9735152cf5d64fcdd05b7e7342d741d:112d8c994ded19371a1d932f:a580871a
BitClockers accepted share a580871a
Seems to work fine for me.
newbie
Activity: 10
Merit: 0
another minor isse:
after a share is found (shown in green) it gets uploaded and I get a
"... rejected share ..." while the pool I am using (bitclockers) shows the share a valid.
hero member
Activity: 504
Merit: 500
FPGA Mining LLC
2.6.5 on openSuse 11.3

Changing it to
delta = (endtime - starttime).seconds - 0.0145
fixes the problem.
...and broke the hashrate calculation for everything that's taking more than 60 seconds to measure, so probably everything <0.8MH/s.
Congrats for PyFPGAMiner, it is really nice Wink

ATM my Atlys with 50Mhz and depth:=2 is giving 3.2MH/s and I'm curious what performance I can reach.
I was thinking about using BlockRAM instead of Slice-FFs to squeeze in more logic and maybe ease the congestion problems of spartan 6 fpgas. A first glimpse showed that ISE is complaining about asynchronous reads in the current hw version.
I think it should help a lot to move all pipeline registers to BRAMs.
This is not likely to work out. You can only use one address of every dual-port BRAM, so the pipeline stages alone would use up almost all the BRAMs for the depth=2 version. (Even if it could use the BRAMs 100% efficiently it would need 88 BRAMs for depth=2)
BRAMs are also slower than slice flipflops, and as more signals would need to be routed to/from those centralized memories, congestion might get even worse.

At how much LUT/FF/Slice usage are you? I think you might be better off squeezing another depth=0 miner into it, or trying to increase the clock frequency.
newbie
Activity: 10
Merit: 0
2.6.5 on openSuse 11.3

Changing it to
delta = (endtime - starttime).seconds - 0.0145
fixes the problem.

Congrats for PyFPGAMiner, it is really nice Wink

ATM my Atlys with 50Mhz and depth:=2 is giving 3.2MH/s and I'm curious what performance I can reach.
I was thinking about using BlockRAM instead of Slice-FFs to squeeze in more logic and maybe ease the congestion problems of spartan 6 fpgas. A first glimpse showed that ISE is complaining about asynchronous reads in the current hw version.
I think it should help a lot to move all pipeline registers to BRAMS.
hero member
Activity: 504
Merit: 500
FPGA Mining LLC
Which python version is that? It's running fine for me with 2.7, but it's probably not 3.0-ready yet.
newbie
Activity: 10
Merit: 0
ok, gave it a try, after measuring the fpga performance it crashes:

miner.py line 385
 delta = (endtime - starttime).total_seconds() - 0.0145
AttributeError: 'datetime.timedelta' object has no attribute 'total_seconds'
hero member
Activity: 504
Merit: 500
FPGA Mining LLC
great, will have a look once the bitfile finally runs...

communication is working now, but "Measuring FPGA performance..." takes a lot longer than the specified 120 secs until it timeouts...

I calculated around 4200 secs if the hw achieves 1 MH/s ... can that be?
Yes, in the old miner version. The new one will take about a minute and it will also check whether the FPGA is working correctly.
newbie
Activity: 10
Merit: 0
great, will have a look once the bitfile finally runs...

communication is working now, but "Measuring FPGA performance..." takes a lot longer than the specified 120 secs until it timeouts...

I calculated around 4200 secs if the hw achieves 1 MH/s ... can that be?
Pages:
Jump to: