Algorithmically placed FPGA miner: 255MH/s/chip, supports all known boards - page 47.

Dexter770221

legendary

Activity: 1029

Merit: 1000

Are you using licensed ISE? WebPack ISE only supports up to SLX75.

BkkCoins

hero member

Activity: 784

Merit: 1009

firstbits:1MinerQ

I've been trying all day to get the Ztex code to build in ISE. One time it went all the way to PAR (6 hours) but then my flaky usb drive (all I have available right now, my sys drive is an SSD and not so big) flaked on me and crashed ISE. I don't know why it doesn't remember the state of each process, as when I tried again it started from the beginning.

Now every time I try it fails MAP apparently due to not fitting but it doesn't say that. It just says "failed". If I look higher up at the Synth report detail there is a message about using more than 100% resources (LUT Mem slices). Not sure why but seems to not fit in SLX150 now and won't go past MAP stage. Ho hum...

Oh, original setting was for "Speed", but then I tried "Area", and the result was the same. Must be some tuning needed to get this to go.

BkkCoins

hero member

Activity: 784

Merit: 1009

firstbits:1MinerQ

Quote from: BTCurious on January 05, 2012, 04:40:26 AM

If there's any convenient way to do it, I'll give you some processing power of mine. I'm sure other people would offer the same.

Thanks. I'll have to think about how that might work. I use EC2 for my web serving so I'm familiar enough with it to make that option easy. For now I need to get hands on here to learn how to use the tools and whether I can build a default hash core. There's several available now but for the moment I'm trying to build the Ztex core.

BTCurious

hero member

Activity: 714

Merit: 504

^SEM img of Si wafer edge, scanned 2012-3-12.

If there's any convenient way to do it, I'll give you some processing power of mine. I'm sure other people would offer the same.

BkkCoins

hero member

Activity: 784

Merit: 1009

firstbits:1MinerQ

Quote from: rph on January 01, 2012, 02:47:17 PM

Quote from: BkkCoins on January 01, 2012, 01:24:31 AM

Are you seriously running all those instances? I hope not for too long...

They're spot instances; it's about $7/hr to run 25 of them & they're started/stopped on demand.
Definitely worth it in terms of build time reduction.

-rph

How do you split up the job into multiple parts for each instance? I'm just running my first implementation now on my laptop. C2D T5450 2GB RAM, needless to say it's quite slow. So far 3 hours and still 14,000 unrouted. I've dug up some docs on using cmd line and could probably setup an instance to get me onto a fast spot instance. Just not sure how it can work on multiple. It looks like the "place and route, par" that really needs the muscle.

Edit: Whoa. I guess I should have expected it slows down as it gets harder to route the end.

BkkCoins

hero member

Activity: 784

Merit: 1009

firstbits:1MinerQ

Quote from: Enigma81 on January 04, 2012, 12:37:35 AM

Take a look at the original fpgaminer code on github - it uses serial communication to communicate the nonce and 'golden hashes'..

VHDL https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/tree/master/projects/VHDL_Xilinx_Port
Verilog https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/tree/master/projects/Verilog_Xilinx_Port
Features: * Uses RS232 for communication with PC. * Compatible with ISE and Xilinx devices. * Python scripts act as the controller on the PC.

Enigma

Thanks! I'll do that. I only took a cursory look thru the Ztex core to get an idea how it gets data in and out. Is there any noticeable difference in performance/compactness between VHDL and Verilog? I haven't done either for several years and so I have to brush up but I always tended to favour the Verilog as I found it easier to follow and write. So that would be my preference. I was looking today at the interface code and it seems like it'll be easy to alter it to use serial I/O. My worry is about synthesis and placement being sub-optimal afterwards. Anyway, I should probably have my own thread now.

Enigma81

full member

Activity: 180

Merit: 100

Typically, what limits FPGA timing is the routing of the interconnects. An FPGA is configurable, but not infinitely so. There are only so many possible paths from one LUT to the next.. When people speak of PAR, that's the Placement and Routing of these interconnects.

Each interconnect introduces some type of delay - there is no such thing as a zero latency interconnect. There is some path delay, some rise and fall time of the signal, etc.

The design max speed will be limited by the slowest of all the interconnects. If PAR manages to place and route them all with 5ns delay (200MHz), but there is one single connection that has a 20ns delay (50MHz), then the max speed of the entire design will be 50Mhz. eldentyrell is manually placing and routing the entire design to try and avoid there being a weak link - automatic PAR is pretty good, but it isn't perfect. I have no doubt that eldentyrell will be able to out-route the automated PAR, but it's a LOT of work. I can't even imagine the number of hours he has into this.

For reference, the ztex design is currently limited to about 200MHz, but is using just about the entire chip for one double SHA-256 core. eldentyrell is up to (I think) about 160MHz, but is using far less of the chip - hopefully leaving room for another single SHA-256 round. The work he has done is really impressive - I honestly didn't think he would get as far as he has. He must be an incredibly capable FPGA designer.

Enigma

BTCurious

hero member

Activity: 714

Merit: 504

^SEM img of Si wafer edge, scanned 2012-3-12.

What limits the clock speed? Is it unreliable performance when it's too high? Can that be solved with higher voltage, like with overclocking?

Enigma81

full member

Activity: 180

Merit: 100

Quote from: BkkCoins on January 04, 2012, 12:11:33 AM

I've done my 2 Layer board design now. Just waiting before re-checking and sending it off to make a few. Size is 50mm x 50mm (2"x2") and is modular so many can plug together in a chain/tree. Wouldn't mind feedback from experts (I'm not one! Just a hobbyist) if they'd like to see design.

I'm wondering how much spare space is generally left over on the Ztex design and others. I want to add a couple 8 bit registers and shift the nonce data in/out serially so would have to modify a working hash core. I just going to embark on the details of this now. D/L and install Xilinx DS.

Take a look at the original fpgaminer code on github - it uses serial communication to communicate the nonce and 'golden hashes'..

VHDL https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/tree/master/projects/VHDL_Xilinx_Port
Verilog https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/tree/master/projects/Verilog_Xilinx_Port
Features: * Uses RS232 for communication with PC. * Compatible with ISE and Xilinx devices. * Python scripts act as the controller on the PC.

Enigma

BkkCoins

hero member

Activity: 784

Merit: 1009

firstbits:1MinerQ

I've done my 2 Layer board design now. Just waiting before re-checking and sending it off to make a few. Size is 50mm x 50mm (2"x2") and is modular so many can plug together in a chain/tree. Wouldn't mind feedback from experts (I'm not one! Just a hobbyist) if they'd like to see design.

I'm wondering how much spare space is generally left over on the Ztex design and others. I want to add a couple 8 bit registers and shift the nonce data in/out serially so would have to modify a working hash core. I'm just going to embark on the details of this now. D/L and install Xilinx DS.

sadpandatech

hero member

Activity: 504

Merit: 500

Quote from: ZedZedNova on January 03, 2012, 08:34:14 PM

So if this works out, running at 200MHz would yield ~300 MH/s, right? 150% of the device's operating frquency.

- Zed

If he can get the rings to run at 200, sure. otherwise,

So the calculation is hash_rate = num_rings*clock_rate*0.5.

~241.5 @ 161MHz

A very worthwhile endeavour even at that rate though.

ZedZedNova

sr. member

Activity: 475

Merit: 265

Ooh La La, C'est Zoom!

Quote from: eldentyrell on January 02, 2012, 05:07:22 PM

Quote from: ZedZedNova on January 02, 2012, 04:58:21 PM

is there any benefit of using the first ring to feed the second ring?

Not really. And it would add more special cases... if I get to three rings, I'd have one ring that expects to feed somebody else, one ring that expects to be fed by somebody else, and one ring that expects to feed itself -- three different designs! Increased debugging/design effort.

OK, makes sense.

I was thinking three identical rings with one input and some selector logic. Each ring would always be fed by the selector, and would always output to the selector. The selector could use:

In from External (new share)
In from Internal (from another ring, 1st sha256 complete)
Out to External (2nd sha256 complete)

The selector would need to know when there is an available ring to route the next share, and whether the share that is being routed has 0, 1, or 2 sha256 operations computed.

But the more I think about this, it really boils down to each ring computes the first hash, then feeds itself that result and computes the hash, which it then reports as complete. So the selector logic would add overhead (delay) and complexity, and provides nothing useful. Right?

Quote from: eldentyrell on January 02, 2012, 05:07:22 PM

Quote from: ZedZedNova on January 02, 2012, 04:58:21 PM

Is there any benefit to, or possibility of, moving the blue ring "up" so that the part that jogs up and to the left is in the top left corner?

That's what I'm working on right now. You'll notice I left a "divot" in the top row right where that funny chunk of empty black space is (I think that's where Xilinx puts the JTAG and configuration logic, which is why you can't use that area).

Cool. I saw the divot and it makes complete sense.

Quote from: eldentyrell on January 02, 2012, 05:07:22 PM

Quote from: ZedZedNova on January 02, 2012, 04:58:21 PM

How about rotating the green part such that the part that jogs up and to the left is jogging down and to the right and then located in the lower right corner?

Yep, that's the other part I'm working on.

Sweet!

So if this works out, running at 200MHz would yield ~300 MH/s, right? 150% of the device's operating frquency.

- Zed

Dexter770221

legendary

Activity: 1029

Merit: 1000

You have to be good in chess Wink

eldentyrell

donator

Activity: 980

Merit: 1004

felonious vagrancy, personified

Quote from: ZedZedNova on January 02, 2012, 04:58:21 PM

is there any benefit of using the first ring to feed the second ring?

Not really. And it would add more special cases... if I get to three rings, I'd have one ring that expects to feed somebody else, one ring that expects to be fed by somebody else, and one ring that expects to feed itself -- three different designs! Increased debugging/design effort.

Quote from: ZedZedNova on January 02, 2012, 04:58:21 PM

Is there any benefit to, or possibility of, moving the blue ring "up" so that the part that jogs up and to the left is in the top left corner?

That's what I'm working on right now. You'll notice I left a "divot" in the top row right where that funny chunk of empty black space is (I think that's where Xilinx puts the JTAG and configuration logic, which is why you can't use that area).

Quote from: ZedZedNova on January 02, 2012, 04:58:21 PM

How about rotating the green part such that the part that jogs up and to the left is jogging down and to the right and then located in the lower right corner?

Yep, that's the other part I'm working on.

BTCurious

hero member

Activity: 714

Merit: 504

^SEM img of Si wafer edge, scanned 2012-3-12.

Quote from: eldentyrell on January 02, 2012, 05:03:49 PM

Quote from: BTCurious on January 02, 2012, 04:14:33 PM

Would it be possible to have 2 rings which compute once per 2 cycles, like you have now, and one ring that computes once per 4 cycles? I imagine one that computes once per 4 cycles might be smaller, so you may be able to get it on there?

Well, it would be smaller, but significantly larger than half-size. Remember, if you unroll less than 64 stages, you can't hardwire the K-values. So the 4-cycles-per-hash ring would be unrolled only 32 stages. Each stage would have to know to switch K-value on odd and even cycles, which adds logic, and I wouldn't be able to precompute nearly as much stuff. It would also take a lot of effort to rework the design. I don't think it's a net win.

Ah, fair enough. It may be something to keep in mind for if you really can't cram a third one on there though, assuming a 4-cycle is still smaller than a 2-cycle. Or go with your earlier idea of putting half of one on there, and then using in conjunction with another FPGA, perhaps.

eldentyrell

donator

Activity: 980

Merit: 1004

felonious vagrancy, personified

Quote from: BTCurious on January 02, 2012, 04:14:33 PM

Would it be possible to have 2 rings which compute once per 2 cycles, like you have now, and one ring that computes once per 4 cycles? I imagine one that computes once per 4 cycles might be smaller, so you may be able to get it on there?

Well, it would be smaller, but significantly larger than half-size. Remember, if you unroll less than 64 stages, you can't hardwire the K-values. So the 4-cycles-per-hash ring would be unrolled only 32 stages. Each stage would have to know to switch K-value on odd and even cycles, which adds logic, and I wouldn't be able to precompute nearly as much stuff. It would also take a lot of effort to rework the design. I don't think it's a net win.

ZedZedNova

sr. member

Activity: 475

Merit: 265

Ooh La La, C'est Zoom!

Quote from: eldentyrell on January 02, 2012, 03:54:13 PM

Secondly, each ring computes a hash every two clock cycles -- each nonce goes through the ring twice before we know if it is a share or not.

I'm new enough to both Bitcoin and FPGA design (I know some folks who design, but do not design myself) this that I'm probably missing something pretty obvious, but is there any benefit of using the first ring to feed the second ring?

Is there any benefit to, or possibility of, moving the blue ring "up" so that the part that jogs up and to the left is in the top left corner? How about rotating the green part such that the part that jogs up and to the left is jogging down and to the right and then located in the lower right corner?

Not knowing the architecture and layout of the target device is driving the second set of questions.

As I said I'm new to FPGA design, but I find it very interesting, and I'm interested in learning. If the questions are "stupid noob" questions, tell me and point me in a direction to go read so I can learn, and I'll go back to lurking. I understand the basic low level components, flip-flops, LUT, logic, etc., but not the FPGA design and layout specifics.

Thanks,

- Zed

BTCurious

hero member

Activity: 714

Merit: 504

^SEM img of Si wafer edge, scanned 2012-3-12.

Would it be possible to have 2 rings which compute once per 2 cycles, like you have now, and one ring that computes once per 4 cycles? I imagine one that computes once per 4 cycles might be smaller, so you may be able to get it on there?

eldentyrell

donator

Activity: 980

Merit: 1004

felonious vagrancy, personified

Quote from: DeepBit on January 02, 2012, 03:28:21 PM

Quote from: eldentyrell on January 02, 2012, 03:25:38 PM

New plot. Two rings, 161mhz. As you can see I'm getting closer to being able to cram that third ring in there.

Does it means that you are getting 320 MH/s per chip ?

No.

First off, I haven't yet succeeded in cramming in the third ring, so this is still hypothetical. I want to be very clear about that, although as you can see I'm obviously making major progress in that direction.

Secondly, each ring computes a hash every two clock cycles -- each nonce goes through the ring twice before we know if it is a share or not. This is because the "sweet spot" in unrolling is 64 stages -- unroll less than that and you can't hardwire the K-values into the LUTs. Unrolling any more than that adds no advantage, and reduces the "granularity" -- greater chance of being left with lots of empty space but still not quite enough for another ring.

So the calculation is hash_rate = num_rings*clock_rate*0.5.

DeepBit

donator

Activity: 532

Merit: 501

We have cookies

Quote from: eldentyrell on January 02, 2012, 03:25:38 PM

New plot. Two rings, 161mhz. As you can see I'm getting closer to being able to cram that third ring in there.

Does it means that you are getting 320 MH/s per chip ?

Topic: Algorithmically placed FPGA miner: 255MH/s/chip, supports all known boards - page 47. (Read 119468 times)