Pages:
Author

Topic: Algorithmically placed FPGA miner: 255MH/s/chip, supports all known boards - page 38. (Read 119440 times)

legendary
Activity: 1162
Merit: 1000
DiabloMiner author
With ASICs it will do same mess BTW Smiley lots of wires for round expander Smiley + lots of clock problems.

With (fully custom) ASICs, however, you can just match your exact routing needs with wires, which should take care of the routing problems.
I'm certainly not an expert on that area, but I'd expect the overhead of intermediate result storage (in a rolled design) to outweigh the routing overhead (in an unrolled deisn).
As I stated above already a rolled design might still be useful to increase yield by containing defects into smaller functional units.

Thats only if you get real ASIC. SASIC still screws you the same way since its just a hardwired version of the FPGA.

I'm definitely not an ASIC developer, so correct me if I'm wrong here. From the small number of simple designs I've done and layed out in Cadence, routing in an ASIC is definitely not free, especially if you're really trying to push the boundaries of all your well, gate, pad, etc keepouts to maximize density on the silicon. If you're not careful with your planning and design your number of metal layers can jump way up which definitely adds to the cost of the design even outside of the possible performance penalties from haphazard routing. I've only ever done work at 90nm and above so I don't know how difficult the routing would be at 45nm or whatever a BTC ASIC would end up getting designed at, but small rolled cores might be more effective in an ASIC as well as in an FPGA. Someone would actually have to look into it to know. Maybe Vladimir could shed some light onto the subject.

ASIC gives you maximum flexibility in the design. The biggest problem with FPGAs is the fact FPGA designs must use the DSP blocks and the BRAM for storage of the constants. Routing is still a problem on ASICs but _you_ design the routing. You no longer have to worry about routing around things on FPGAs, and you no longer have to worry about paying for hardware you'll never use (like, for example, that high speed serial IO fabric isn't cheap, or is the onboard Ethernet controller and such).

ASIC has a huge upfront design cost, but if we could sell 250k ASICs (or, approximately more chips than all the FPGAs currently in use for mining put together) it would be cheaper per mhash over the next 10 years by an order of magnitude.
legendary
Activity: 1274
Merit: 1004
With ASICs it will do same mess BTW Smiley lots of wires for round expander Smiley + lots of clock problems.

With (fully custom) ASICs, however, you can just match your exact routing needs with wires, which should take care of the routing problems.
I'm certainly not an expert on that area, but I'd expect the overhead of intermediate result storage (in a rolled design) to outweigh the routing overhead (in an unrolled deisn).
As I stated above already a rolled design might still be useful to increase yield by containing defects into smaller functional units.

Thats only if you get real ASIC. SASIC still screws you the same way since its just a hardwired version of the FPGA.

I'm definitely not an ASIC developer, so correct me if I'm wrong here. From the small number of simple designs I've done and layed out in Cadence, routing in an ASIC is definitely not free, especially if you're really trying to push the boundaries of all your well, gate, pad, etc keepouts to maximize density on the silicon. If you're not careful with your planning and design your number of metal layers can jump way up which definitely adds to the cost of the design even outside of the possible performance penalties from haphazard routing. I've only ever done work at 90nm and above so I don't know how difficult the routing would be at 45nm or whatever a BTC ASIC would end up getting designed at, but small rolled cores might be more effective in an ASIC as well as in an FPGA. Someone would actually have to look into it to know. Maybe Vladimir could shed some light onto the subject.
legendary
Activity: 1162
Merit: 1000
DiabloMiner author
With ASICs it will do same mess BTW Smiley lots of wires for round expander Smiley + lots of clock problems.

With (fully custom) ASICs, however, you can just match your exact routing needs with wires, which should take care of the routing problems.
I'm certainly not an expert on that area, but I'd expect the overhead of intermediate result storage (in a rolled design) to outweigh the routing overhead (in an unrolled deisn).
As I stated above already a rolled design might still be useful to increase yield by containing defects into smaller functional units.

Thats only if you get real ASIC. SASIC still screws you the same way since its just a hardwired version of the FPGA.
hero member
Activity: 504
Merit: 500
FPGA Mining LLC
With ASICs it will do same mess BTW Smiley lots of wires for round expander Smiley + lots of clock problems.

With (fully custom) ASICs, however, you can just match your exact routing needs with wires, which should take care of the routing problems.
I'm certainly not an expert on that area, but I'd expect the overhead of intermediate result storage (in a rolled design) to outweigh the routing overhead (in an unrolled deisn).
As I stated above already a rolled design might still be useful to increase yield by containing defects into smaller functional units.
hero member
Activity: 504
Merit: 500
FPGA Mining LLC
Theres a small difference, though. There technically is enough room to fit 2 full hashes on a Spartan 6, but due to how the leftover space is arranged, it probably will never fit (so eldentyrell fit 1 and a half). However, a shitload of tiny rolled engines would easily fit into weirdly shaped unused space. I think someone did the math and said they're almost at the equiv of 2 full hashes.

Not quite. Due to additional overhead of each core, it is only equivalent to ~1.3 fully unrolled cores hashes-per-clock wise. What bumps this to >1.5 times the total hashing speed is the higher speed those little cores can run at.
legendary
Activity: 1162
Merit: 1000
DiabloMiner author
You have all the clues... Turn on your head and just guess using data you have - print screen from PlanAhead - I certify that it is correct one... Try placing some BRAM and watch your timings... Why would you ask then ?
I'm asking because I'm not fully up-to-speed on possible space-time tradeoffs on the current Xilinx platforms. When I worked on them professionally we had the information about the routing and bitstream format available directly from Xilinx (maybe under NDA, I'm not sure, it was years ago).

I've also remember the comments from a poster who implemented the bitcoin hashers on Virtex-6 and quick-and-dirty solution was to use DSP48s for some fraction of the adders in SHA-256 mixing steps.

In theory at least it should be possible to fill every BRAM with multiple copies of the constants and use those constants at least in those hashing cells that are close to the BRAMs. As far as I understand your design you currently have just one class/macro of hashing cell, but have plans on implementing another class/macro to fill out the space that currently remains unused.

Overall, I'll venture to guess that the ultimate Spartan-6 bitstream will use the sea-of-hashers concept and the hashers will be a heterogenous mixture: close-to-DSP48, close-to-BRAM and far-from-DSP-and-BRAM. I occasionally talk to my friends who do digital design and they always mention "don't leave any FPGA resource unused, even at the expense of partially mangling the original algorithm".

I guess the ultimate way to express all the above is that the design space tradeoffs are multidimensional space of clock-freq*number-of-gates*time-to-market.

Thats pretty much my analysis of this too. Everything that can lead to faster hashing is on the table no matter how insane or ugly.
legendary
Activity: 2128
Merit: 1073
You have all the clues... Turn on your head and just guess using data you have - print screen from PlanAhead - I certify that it is correct one... Try placing some BRAM and watch your timings... Why would you ask then ?
I'm asking because I'm not fully up-to-speed on possible space-time tradeoffs on the current Xilinx platforms. When I worked on them professionally we had the information about the routing and bitstream format available directly from Xilinx (maybe under NDA, I'm not sure, it was years ago).

I've also remember the comments from a poster who implemented the bitcoin hashers on Virtex-6 and quick-and-dirty solution was to use DSP48s for some fraction of the adders in SHA-256 mixing steps.

In theory at least it should be possible to fill every BRAM with multiple copies of the constants and use those constants at least in those hashing cells that are close to the BRAMs. As far as I understand your design you currently have just one class/macro of hashing cell, but have plans on implementing another class/macro to fill out the space that currently remains unused.

Overall, I'll venture to guess that the ultimate Spartan-6 bitstream will use the sea-of-hashers concept and the hashers will be a heterogenous mixture: close-to-DSP48, close-to-BRAM and far-from-DSP-and-BRAM. I occasionally talk to my friends who do digital design and they always mention "don't leave any FPGA resource unused, even at the expense of partially mangling the original algorithm".

I guess the ultimate way to express all the above is that the design space tradeoffs are multidimensional space of clock-freq*number-of-gates*time-to-market.
sr. member
Activity: 266
Merit: 251
In a completely unrolled design, there are no long lines.
The start vector is fed in on the left side, then the calculations percolate down to the right, and at the right a "matching" circuit determines if a "golden nonce" was found. There is no feedback from the right side to the left side.
Thus, while I do think that Bitfury's approach is EASIER (as one only has to worry about a few hundred wires and their associated delays, and not tens of thousands), I fail to see why it is inherently faster. I don't think it is inherently faster.
Maybe the Xilinx router goofs up wires that would be short and local and sends them the long way like a crooked cab driver an out-of-town tourist. But, to reiterate, a fully unrolled miner does not involve a feedback from the right side to the left side.

TheSeven said correctly - Spartan routing resources are ugly. no handy BENTQUADs etc.... plus 50% of Slices.X. adds up problems. With Artix my highest expectation 2x Spartan.... but I am afraid to make such predictions, because I've heard that on 28-nm chips there's even more problems with power distribution..... Do not want to make again troubles, like having estimation of 500 Mh/s per chip, then target of 400 Mh/s and finishing with 300 Mh/s.

About "there's no long lines" - I've already commented, but will try to draw it, where epic fail for parallel expander is exactly....

say computing w0+w1 and feeding to w9:

                                        ---+---------------------------------
                                   ---+---------------------------------
                              ---+--------------------------------
                          ---+-------------------------------
                     ---+------------------------------
                ---+-----------------------------
           ---+----------------------------
      ---+----------------------------
 ---+---------------------------
w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15 w16

How many wires ? biggest cross-section just for that ? 9x32 bits :-)
The same happens when pushing w9 to w16... and w14 to w16...
Lazy to calculate - but near 512 bits cross-section...

And in Spartan-6 there's difficult to pass more than 256-bit cross-section in 8 slices height long-way (there's
32 QUAD routes per each switch - so 256-bits would use QUAD routes in horizontal case for 8 slices height).

Then what will happen - it will go to DOUBLE route, and will go wide outside of your round expander area slowing
down interconnect for other parts of design....

I've started with that :-( Plus it is a question how this design would survive reality that sha256 is VERY TOUGH TEST for bit error rates. even small infrequent errors are amplified by avalanche effect through rounds.

with unrolled rounds however it is true - no problem there - it works like charm... unrolled design is also more compact than rolled one.... and rolled design within 240 slices is very difficult... even 248 would be easier. as in 240 I had to fight for each register, and reuse parts of logics to do other things.... in my design rounds only looks similar, but in reality there's 3 kinds of rounds with special cases. and they are different.

PS. You've answered before I written post... Anyway I think this will be helpful for those who try with parallel rounds... With ASICs it will do same mess BTW Smiley lots of wires for round expander Smiley + lots of clock problems.

PPS. So getting quick and dense parallel design is tough task - that's why I respect this work!

sr. member
Activity: 448
Merit: 250
In SHA-256 round expander kills routing, as taking that w[0], w[1] and w[9] requires a lot of routing, because you basically pulling data from N rounds behind...

Oh yeah, I totally forgot about that.
Now you got me almost convinced that such a sea of small blocks is the better way to do it.
Live and learn...
legendary
Activity: 1162
Merit: 1000
DiabloMiner author
Finally I would say that implementing FPGA design mostly about placement and routing... Do not even start trying it, if you are not prepared to waste weeks figuring all of that things, or use only simple designs, when you have about clocks 2-3 times smaller than chip's maximums... designs @ 50 - 100 Mhz would be easy....

I completely agree. I currently have the most optimized OpenCL kernel for GPUs out there, and the most recent version took me 2 weeks of 6-8 hour a day fiddling to get it done, after 1+ year of working on previous versions.

FPGA design is about 2-3 times harder.
sr. member
Activity: 266
Merit: 251
This is most likely due to the Spartan6's awful long distance routing fabric,
This isn't Spartan's fault. This is a property of any modern FPGA: most of the delay and energy loss occurs in the routing fabric. So the easiest way to speed up the design is to minimize the demand on routing resources.

I was always perplexed why everyone here was focusing on unrolling the combinatorial logic. After gaining some experience with the currently available EDA tool suites for FPGA it became obvious: they make the place and route of repetitive designs very difficult.

The "sea of tight hashers" approach will probably be also beneficial for the future ASIC designs, although not by such a wide margin.

Does anyone know if bitfury's design stores the SHA-256 constants in BRAMs or has them spread over through the SLICEs?

You have all the clues... Turn on your head and just guess using data you have - print screen from PlanAhead - I certify that it is correct one... Try placing some BRAM and watch your timings... Why would you ask then ?

With routing fabric - it is the same... Open FPGA Editor, and start placing routes manually, understood how QUAD, DOUBLE, SINGLE routes works within spartan, what are costs of switch to switch hop, and switch to logic entry etc. It is interesting, believe me :-) Most pity however with them - is that P&R tool is far from ideal, and less routing resources left - worse design it produces. In SHA-256 round expander kills routing, as taking that w[0], w[1] and w[9] requires a lot of routing, because you basically pulling data from N rounds behind... so you basically put either SRL or BRAM to do that... near end of game... however if working really hard on it - spartan has barely enough resources just to route these parallel rounds - if you find right placement schema to use more adequately vertical and horizontal interconnect. Also interconnect works in one direction only, so if rounds placed in smart way, you'll get more efficiency in routing resources usage ( i.e. A,B  <---> C,D while A --> C and B <--- D are interconnected and placed into same regions).

So I really respect author's work of fitting 1.5 parallel rounds into Spartan 6 - it is tough and very nice work. And probably Spartan is showing his bad temper in error rates. In case of rolled rounds - only single round failures, in case of unrolled rounds - if some part of chip fails more frequently than other - you get higher performance degradation. In my experience during debug runs - it starts to degrade from central slices to peripheral, when you rise clocks. It is interesting indeed if design performance would actually match performance that tools display.

Finally I would say that implementing FPGA design mostly about placement and routing... Do not even start trying it, if you are not prepared to waste weeks figuring all of that things, or use only simple designs, when you have about clocks 2-3 times smaller than chip's maximums... designs @ 50 - 100 Mhz would be easy....
legendary
Activity: 1162
Merit: 1000
DiabloMiner author
This is most likely due to the Spartan6's awful long distance routing fabric,
This isn't Spartan's fault. This is a property of any modern FPGA: most of the delay and energy loss occurs in the routing fabric. So the easiest way to speed up the design is to minimize the demand on routing resources.

I was always perplexed why everyone here was focusing on unrolling the combinatorial logic. After gaining some experience with the currently available EDA tool suites for FPGA it became obvious: they make the place and route of repetitive designs very difficult.

The "sea of tight hashers" approach will probably be also beneficial for the future ASIC designs, although not by such a wide margin.

Does anyone know if bitfury's design stores the SHA-256 constants in BRAMs or has them spread over through the SLICEs?

In a completely unrolled design, there are no long lines.
The start vector is fed in on the left side, then the calculations percolate down to the right, and at the right a "matching" circuit determines if a "golden nonce" was found. There is no feedback from the right side to the left side.
Thus, while I do think that Bitfury's approach is EASIER (as one only has to worry about a few hundred wires and their associated delays, and not tens of thousands), I fail to see why it is inherently faster. I don't think it is inherently faster.
Maybe the Xilinx router goofs up wires that would be short and local and sends them the long way like a crooked cab driver an out-of-town tourist. But, to reiterate, a fully unrolled miner does not involve a feedback from the right side to the left side.

Theres a small difference, though. There technically is enough room to fit 2 full hashes on a Spartan 6, but due to how the leftover space is arranged, it probably will never fit (so eldentyrell fit 1 and a half). However, a shitload of tiny rolled engines would easily fit into weirdly shaped unused space. I think someone did the math and said they're almost at the equiv of 2 full hashes.
hero member
Activity: 504
Merit: 500
FPGA Mining LLC
This is most likely due to the Spartan6's awful long distance routing fabric,
This isn't Spartan's fault. This is a property of any modern FPGA: most of the delay and energy loss occurs in the routing fabric.

That one has a particularly bad routing fabric though. Virtex, Kintex or even Artix are all much better.

And as pointed out above already, most of your other claims don't really apply here, especially for ASICs I think a pipelined design is likely to perform better for several reasons. The only downside that I can think of right now is that a sea of small cores approach has much better damage containment properties, thus increasing yield.
sr. member
Activity: 448
Merit: 250
This is most likely due to the Spartan6's awful long distance routing fabric,
This isn't Spartan's fault. This is a property of any modern FPGA: most of the delay and energy loss occurs in the routing fabric. So the easiest way to speed up the design is to minimize the demand on routing resources.

I was always perplexed why everyone here was focusing on unrolling the combinatorial logic. After gaining some experience with the currently available EDA tool suites for FPGA it became obvious: they make the place and route of repetitive designs very difficult.

The "sea of tight hashers" approach will probably be also beneficial for the future ASIC designs, although not by such a wide margin.

Does anyone know if bitfury's design stores the SHA-256 constants in BRAMs or has them spread over through the SLICEs?

In a completely unrolled design, there are no long lines.
The start vector is fed in on the left side, then the calculations percolate down to the right, and at the right a "matching" circuit determines if a "golden nonce" was found. There is no feedback from the right side to the left side.
Thus, while I do think that Bitfury's approach is EASIER (as one only has to worry about a few hundred wires and their associated delays, and not tens of thousands), I fail to see why it is inherently faster. I don't think it is inherently faster.
Maybe the Xilinx router goofs up wires that would be short and local and sends them the long way like a crooked cab driver an out-of-town tourist. But, to reiterate, a fully unrolled miner does not involve a feedback from the right side to the left side.
legendary
Activity: 2128
Merit: 1073
This is most likely due to the Spartan6's awful long distance routing fabric,
This isn't Spartan's fault. This is a property of any modern FPGA: most of the delay and energy loss occurs in the routing fabric. So the easiest way to speed up the design is to minimize the demand on routing resources.

I was always perplexed why everyone here was focusing on unrolling the combinatorial logic. After gaining some experience with the currently available EDA tool suites for FPGA it became obvious: they make the place and route of repetitive designs very difficult.

The "sea of tight hashers" approach will probably be also beneficial for the future ASIC designs, although not by such a wide margin.

Does anyone know if bitfury's design stores the SHA-256 constants in BRAMs or has them spread over through the SLICEs?
legendary
Activity: 1162
Merit: 1000
DiabloMiner author
Well, the device itself (GPU, FPGA) does 2 sha256 rounds of 64 in length.
However, there is a VERY simple optimisation of that to remove 8 rounds
(4 at the beginning of the 1st and 4 at the end of the 2nd) that is done by GPUs (and most? FPGAs?)
(i.e. a 6.25% gain) that is not available with this at all.
(so you also need to subtract 6.25% from any gain)
Maybe that is what he is referring to?

That is already "subtracted" from the results, and apparently both the MH/s and MH/J are still better for the rolled version. This is most likely due to the Spartan6's awful long distance routing fabric, which means that keeping things very close to each other pays off (which is one reason why 85 small, 64-clocks-per-hash cores together are faster than just three 2-clocks-per-hash cores, you can just clock them at much higher frequencies, and you can utilize more area on the chip).

Thats an interesting hack. Thats exactly the same reason why GPUs unroll the entire thing, just so the in registers are kept in registers instead of pushed back to local or global RAM.
hero member
Activity: 504
Merit: 500
FPGA Mining LLC
Well, the device itself (GPU, FPGA) does 2 sha256 rounds of 64 in length.
However, there is a VERY simple optimisation of that to remove 8 rounds
(4 at the beginning of the 1st and 4 at the end of the 2nd) that is done by GPUs (and most? FPGAs?)
(i.e. a 6.25% gain) that is not available with this at all.
(so you also need to subtract 6.25% from any gain)
Maybe that is what he is referring to?

That is already "subtracted" from the results, and apparently both the MH/s and MH/J are still better for the rolled version. This is most likely due to the Spartan6's awful long distance routing fabric, which means that keeping things very close to each other pays off (which is one reason why 85 small, 64-clocks-per-hash cores together are faster than just three 2-clocks-per-hash cores, you can just clock them at much higher frequencies, and you can utilize more area on the chip).
legendary
Activity: 4592
Merit: 1851
Linux since 1997 RedHat 4
Well, the device itself (GPU, FPGA) does 2 sha256 rounds of 64 in length.
However, there is a VERY simple optimisation of that to remove 8 rounds
(4 at the beginning of the 1st and 4 at the end of the 2nd) that is done by GPUs (and most? FPGAs?)
(i.e. a 6.25% gain) that is not available with this at all.
(so you also need to subtract 6.25% from any gain)
Maybe that is what he is referring to?
rjk
sr. member
Activity: 448
Merit: 250
1ngldh
on the personal front, i suggest stop doing any effort on these pipelined architecture.

wha chew talkin' bout, Willis?

It sounds to me that he is saying several Spartan 6 each doing a small part of the work in parallel would do a better job than the same number of Spartan 6 each doing their own thing.


I think what he means is that partially rolled hashers would be faster than fully unrolled hashers, on this architecture. It coincides with the info from http://bitfury.org/
hero member
Activity: 489
Merit: 500
Immersionist
on the personal front, i suggest stop doing any effort on these pipelined architecture.

wha chew talkin' bout, Willis?

It sounds to me that he is saying several Spartan 6 each doing a small part of the work in parallel would do a better job than the same number of Spartan 6 each doing their own thing.

Pages:
Jump to: