Pages:
Author

Topic: Algorithmically placed FPGA miner: 255MH/s/chip, supports all known boards - page 47. (Read 119429 times)

legendary
Activity: 1029
Merit: 1000
Are you using licensed ISE? WebPack ISE only supports up to SLX75.
hero member
Activity: 784
Merit: 1009
firstbits:1MinerQ
I've been trying all day to get the Ztex code to build in ISE. One time it went all the way to PAR (6 hours) but then my flaky usb drive (all I have available right now, my sys drive is an SSD and not so big) flaked on me and crashed ISE. I don't know why it doesn't remember the state of each process, as when I tried again it started from the beginning.

Now every time I try it fails MAP apparently due to not fitting but it doesn't say that. It just says "failed". If I look higher up at the Synth report detail there is a message about using more than 100% resources (LUT Mem slices). Not sure why but seems to not fit in SLX150 now and won't go past MAP stage. Ho hum...

Oh, original setting was for "Speed", but then I tried "Area", and the result was the same. Must be some tuning needed to get this to go.
hero member
Activity: 784
Merit: 1009
firstbits:1MinerQ
If there's any convenient way to do it, I'll give you some processing power of mine. I'm sure other people would offer the same.
Thanks. I'll have to think about how that might work. I use EC2 for my web serving so I'm familiar enough with it to make that option easy. For now I need to get hands on here to learn how to use the tools and whether I can build a default hash core. There's several available now but for the moment I'm trying to build the Ztex core.
hero member
Activity: 714
Merit: 504
^SEM img of Si wafer edge, scanned 2012-3-12.
If there's any convenient way to do it, I'll give you some processing power of mine. I'm sure other people would offer the same.
hero member
Activity: 784
Merit: 1009
firstbits:1MinerQ
Are you seriously running all those instances? I hope not for too long...

They're spot instances; it's about $7/hr to run 25 of them & they're started/stopped on demand.
Definitely worth it in terms of build time reduction.

-rph

How do you split up the job into multiple parts for each instance? I'm just running my first implementation now on my laptop. C2D T5450 2GB RAM, needless to say it's quite slow. So far 3 hours and still 14,000 unrouted. I've dug up some docs on using cmd line and could probably setup an instance to get me onto a fast spot instance. Just not sure how it can work on multiple. It looks like the "place and route, par" that really needs the muscle.

Edit: Whoa. I guess I should have expected it slows down as it gets harder to route the end.
hero member
Activity: 784
Merit: 1009
firstbits:1MinerQ
Take a look at the original fpgaminer code on github - it uses serial communication to communicate the nonce and 'golden hashes'..

VHDL https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/tree/master/projects/VHDL_Xilinx_Port
Verilog https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/tree/master/projects/Verilog_Xilinx_Port
Features: * Uses RS232 for communication with PC. * Compatible with ISE and Xilinx devices. * Python scripts act as the controller on the PC.

Enigma
Thanks! I'll do that. I only took a cursory look thru the Ztex core to get an idea how it gets data in and out. Is there any noticeable difference in performance/compactness between VHDL and Verilog? I haven't done either for several years and so I have to brush up but I always tended to favour the Verilog as I found it easier to follow and write. So that would be my preference. I was looking today at the interface code and it seems like it'll be easy to alter it to use serial I/O. My worry is about synthesis and placement being sub-optimal afterwards. Anyway, I should probably have my own thread now.
full member
Activity: 180
Merit: 100
Typically, what limits FPGA timing is the routing of the interconnects.  An FPGA is configurable, but not infinitely so.  There are only so many possible paths from one LUT to the next..  When people speak of PAR, that's the Placement and Routing of these interconnects.

Each interconnect introduces some type of delay - there is no such thing as a zero latency interconnect.  There is some path delay, some rise and fall time of the signal, etc.

The design max speed will be limited by the slowest of all the interconnects.  If PAR manages to place and route them all with 5ns delay (200MHz), but there is one single connection that has a 20ns delay (50MHz), then the max speed of the entire design will be 50Mhz.  eldentyrell is manually placing and routing the entire design to try and avoid there being a weak link - automatic PAR is pretty good, but it isn't perfect.  I have no doubt that eldentyrell will be able to out-route the automated PAR, but it's a LOT of work.  I can't even imagine the number of hours he has into this.

For reference, the ztex design is currently limited to about 200MHz, but is using just about the entire chip for one double SHA-256 core.  eldentyrell is up to (I think) about 160MHz, but is using far less of the chip - hopefully leaving room for another single SHA-256 round.  The work he has done is really impressive - I honestly didn't think he would get as far as he has.  He must be an incredibly capable FPGA designer.

Enigma
hero member
Activity: 714
Merit: 504
^SEM img of Si wafer edge, scanned 2012-3-12.
What limits the clock speed? Is it unreliable performance when it's too high? Can that be solved with higher voltage, like with overclocking?
full member
Activity: 180
Merit: 100
I've done my 2 Layer board design now. Just waiting before re-checking and sending it off to make a few. Size is 50mm x 50mm (2"x2") and is modular so many can plug together in a chain/tree. Wouldn't mind feedback from experts (I'm not one! Just a hobbyist) if they'd like to see design.

I'm wondering how much spare space is generally left over on the Ztex design and others. I want to add a couple 8 bit registers and shift the nonce data in/out serially so would have to modify a working hash core. I just going to embark on the details of this now. D/L and install Xilinx DS.



Take a look at the original fpgaminer code on github - it uses serial communication to communicate the nonce and 'golden hashes'..

VHDL https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/tree/master/projects/VHDL_Xilinx_Port
Verilog https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/tree/master/projects/Verilog_Xilinx_Port
Features: * Uses RS232 for communication with PC. * Compatible with ISE and Xilinx devices. * Python scripts act as the controller on the PC.

Enigma
hero member
Activity: 784
Merit: 1009
firstbits:1MinerQ
I've done my 2 Layer board design now. Just waiting before re-checking and sending it off to make a few. Size is 50mm x 50mm (2"x2") and is modular so many can plug together in a chain/tree. Wouldn't mind feedback from experts (I'm not one! Just a hobbyist) if they'd like to see design.

I'm wondering how much spare space is generally left over on the Ztex design and others. I want to add a couple 8 bit registers and shift the nonce data in/out serially so would have to modify a working hash core. I'm just going to embark on the details of this now. D/L and install Xilinx DS.

hero member
Activity: 504
Merit: 500

So if this works out, running at 200MHz would yield ~300 MH/s, right? 150% of the device's operating frquency.

- Zed

If he can get the rings to run at 200, sure. otherwise,

So the calculation is hash_rate = num_rings*clock_rate*0.5.

~241.5 @ 161MHz

A very worthwhile endeavour even at that rate though.
sr. member
Activity: 475
Merit: 265
Ooh La La, C'est Zoom!
is there any benefit of using the first ring to feed the second ring?

Not really.  And it would add more special cases... if I get to three rings, I'd have one ring that expects to feed somebody else, one ring that expects to be fed by somebody else, and one ring that expects to feed itself -- three different designs!  Increased debugging/design effort.

OK, makes sense.

I was thinking three identical rings with one input and some selector logic. Each ring would always be fed by the selector, and would always output to the selector. The selector could use:

  • In from External (new share)
  • In from Internal (from another ring, 1st sha256 complete)
  • Out to External (2nd sha256 complete)

The selector would need to know when there is an available ring to route the next share, and whether the share that is being routed has 0, 1, or 2 sha256 operations computed.

But the more I think about this, it really boils down to each ring computes the first hash, then feeds itself that result and computes the hash, which it then reports as complete. So the selector logic would add overhead (delay) and complexity, and provides nothing useful. Right?


Is there any benefit to, or possibility of, moving the blue ring "up" so that the part that jogs up and to the left is in the top left corner?

That's what I'm working on right now.  You'll notice I left a "divot" in the top row right where that funny chunk of empty black space is (I think that's where Xilinx puts the JTAG and configuration logic, which is why you can't use that area).

Cool. I saw the divot and it makes complete sense.


How about rotating the green part such that the part that jogs up and to the left is jogging down and to the right and then located in the lower right corner?

Yep, that's the other part I'm working on.

Sweet!

So if this works out, running at 200MHz would yield ~300 MH/s, right? 150% of the device's operating frquency.

- Zed
legendary
Activity: 1029
Merit: 1000
You have to be good in chess Wink
donator
Activity: 980
Merit: 1004
felonious vagrancy, personified
is there any benefit of using the first ring to feed the second ring?

Not really.  And it would add more special cases... if I get to three rings, I'd have one ring that expects to feed somebody else, one ring that expects to be fed by somebody else, and one ring that expects to feed itself -- three different designs!  Increased debugging/design effort.

Is there any benefit to, or possibility of, moving the blue ring "up" so that the part that jogs up and to the left is in the top left corner?

That's what I'm working on right now.  You'll notice I left a "divot" in the top row right where that funny chunk of empty black space is (I think that's where Xilinx puts the JTAG and configuration logic, which is why you can't use that area).

How about rotating the green part such that the part that jogs up and to the left is jogging down and to the right and then located in the lower right corner?

Yep, that's the other part I'm working on.
hero member
Activity: 714
Merit: 504
^SEM img of Si wafer edge, scanned 2012-3-12.
Would it be possible to have 2 rings which compute once per 2 cycles, like you have now, and one ring that computes once per 4 cycles? I imagine one that computes once per 4 cycles might be smaller, so you may be able to get it on there?

Well, it would be smaller, but significantly larger than half-size.  Remember, if you unroll less than 64 stages, you can't hardwire the K-values.  So the 4-cycles-per-hash ring would be unrolled only 32 stages.  Each stage would have to know to switch K-value on odd and even cycles, which adds logic, and I wouldn't be able to precompute nearly as much stuff.  It would also take a lot of effort to rework the design.  I don't think it's a net win.
Ah, fair enough. It may be something to keep in mind for if you really can't cram a third one on there though, assuming a 4-cycle is still smaller than a 2-cycle. Or go with your earlier idea of putting half of one on there, and then using in conjunction with another FPGA, perhaps.
donator
Activity: 980
Merit: 1004
felonious vagrancy, personified
Would it be possible to have 2 rings which compute once per 2 cycles, like you have now, and one ring that computes once per 4 cycles? I imagine one that computes once per 4 cycles might be smaller, so you may be able to get it on there?

Well, it would be smaller, but significantly larger than half-size.  Remember, if you unroll less than 64 stages, you can't hardwire the K-values.  So the 4-cycles-per-hash ring would be unrolled only 32 stages.  Each stage would have to know to switch K-value on odd and even cycles, which adds logic, and I wouldn't be able to precompute nearly as much stuff.  It would also take a lot of effort to rework the design.  I don't think it's a net win.
sr. member
Activity: 475
Merit: 265
Ooh La La, C'est Zoom!
Secondly, each ring computes a hash every two clock cycles -- each nonce goes through the ring twice before we know if it is a share or not.

I'm new enough to both Bitcoin and FPGA design (I know some folks who design, but do not design myself) this that I'm probably missing something pretty obvious, but is there any benefit of using the first ring to feed the second ring?

Is there any benefit to, or possibility of, moving the blue ring "up" so that the part that jogs up and to the left is in the top left corner? How about rotating the green part such that the part that jogs up and to the left is jogging down and to the right and then located in the lower right corner?

Not knowing the architecture and layout of the target device is driving the second set of questions.

As I said I'm new to FPGA design, but I find it very interesting, and I'm interested in learning. If the questions are "stupid noob" questions, tell me and point me in a direction to go read so I can learn, and I'll go back to lurking. I understand the basic low level components, flip-flops, LUT, logic, etc., but not the FPGA design and layout specifics.

Thanks,

- Zed
hero member
Activity: 714
Merit: 504
^SEM img of Si wafer edge, scanned 2012-3-12.
Would it be possible to have 2 rings which compute once per 2 cycles, like you have now, and one ring that computes once per 4 cycles? I imagine one that computes once per 4 cycles might be smaller, so you may be able to get it on there?
donator
Activity: 980
Merit: 1004
felonious vagrancy, personified
New plot.  Two rings, 161mhz.  As you can see I'm getting closer to being able to cram that third ring in there.
Does it means that you are getting 320 MH/s per chip ?

No.

First off, I haven't yet succeeded in cramming in the third ring, so this is still hypothetical.  I want to be very clear about that, although as you can see I'm obviously making major progress in that direction.

Secondly, each ring computes a hash every two clock cycles -- each nonce goes through the ring twice before we know if it is a share or not.  This is because the "sweet spot" in unrolling is 64 stages -- unroll less than that and you can't hardwire the K-values into the LUTs.  Unrolling any more than that adds no advantage, and reduces the "granularity" -- greater chance of being left with lots of empty space but still not quite enough for another ring.

So the calculation is hash_rate = num_rings*clock_rate*0.5.

donator
Activity: 532
Merit: 501
We have cookies
New plot.  Two rings, 161mhz.  As you can see I'm getting closer to being able to cram that third ring in there.
Does it means that you are getting 320 MH/s per chip ?
Pages:
Jump to: