Pages:
Author

Topic: Algorithmically placed FPGA miner: 255MH/s/chip, supports all known boards - page 44. (Read 119440 times)

full member
Activity: 227
Merit: 100
  Number of DSP48A1s:                           30 out of     180   16%
Aha! Interesting. When uncle Moshe (Gavrielov) gives you DSPs, make DSPeade. Wink

Thank you for providing an important puzzle piece on how Dr. Tyrell does it.

The multiplier in the DSP48-block is not needed in SHA-256, hence what he obviously uses is the 18-bit adder
BCOUT = B + D.
He uses 30 DSP blocks, 10 per red / green / blue SHA-256 instance.
For a 32 bit adder, two 18-bit adders BCOUT=B+D are needed.
Thus, he can implement five 32-bit adders per SHA instance.

So, why not just use [slow] 32-bit ripple adders everywhere, and use a few [very fast] DSP adders in some places?

The answer is, IMHO, that he uses the fast DSP adders only where they feed into longlines.
Were he to use normal ripple adders where he feeds into longlines, the aggregate delay would limit
the design to a 5 ns clock cycle.
Using the fast DSP adders will allow this design, when properly fine-tuned, to march into 4 ns clock cycle
territory, for a total MH/s number of approximately 125 MH/s or approximately 375 MH/s per Spartan6-150.

BFL Single, watch out below.



I remember nghzang mentioned that going to 200MHz on chips was not suggested (chips got so hot), and he gave
out a bitstream with a "Use at your own risk". Three loops on the same chip suggests far greater number of
Registers is being used. Since each stage toggle rate approaches 50% (This idea behind Digest functions is that their toggle-rate
must approach 50% in each stage to be effective, and so is the case in SHA256), I wonder how hot the chips will get in high
frequencies, approaching 180MHz or 190MHz...


Good Luck,
sr. member
Activity: 448
Merit: 250
That is what I meant. It seems this guy has found a way to speed up the hashrate using DSPs so what is so hard to understand Turbor and gigavps ?

He doesn't use DSPs throughout (because there are not enough DSPs to go around), but only at the most critical spots, i.e. where the adders feed into longlines. That's the brilliance of it. That's the design idea I had completely missed before.

I was asking why couldn't BFL also do this "trick" and a valid question indeed. One bitstream or FPGA "trick" likely could be applied on a range of different FPGA hardware because the basic operating principles are the same for all FPGAs etc.

Agreed.

I'm no expert but I understand ( reasonably well ) how FPGA works and this DSP trick allows you to do 3 loops of SHA256 in the same chip ( cheap Spartan 6 ones ) that previously only allowed us to do 2 loops etc.

No, using this DSP trick has nothing to do with being able to squeeze three SHA-256 instances into a FPGA.
You can do that with the plain old stream-powered ripple carry adders.
Using DSPs in a few strategic places, however, ensures that the critical path (a deadly combination of two 32-bit adder stages and one longline path) stays well below 5 ns, when otherwise (with ripple carry adders) it can barely achieve 5 ns.

hero member
Activity: 518
Merit: 500
Quote from: Inspector 2211
BFL Single, watch out below.

What makes you think this cannot similarly be applied to the single ( even after a hardware modification ) Huh

1. Consensus on this forum is, that the BFL Single uses Altera FPGAs of an unknown type (Stratix?) and one would first want to determine
    the exact FPGA being used, before speculating whether DSP blocks could be used to a similarly beneficial effect.
    Without knowing the exact FPGA make/model, it's way too premature to state that DSP blocks could be used there -
    maybe that particular FPGA make/model does not even have DSP blocks.

or

2. Maybe they are already using this trick, maybe that's their secret sauce which allows them to reach 830 MH/s with but two
    FPGAs.


Just my 2 cents.

That is what I meant. It seems this guy has found a way to speed up the hashrate using DSPs so what is so hard to understand Turbor and gigavps ?

I was asking why couldn't BFL also do this "trick" and a valid question indeed. One bitstream or FPGA "trick" likely could be applied on a range of different FPGA hardware because the basic operating principles are the same for all FPGAs etc.

I'm no expert but I understand ( reasonably well ) how FPGA works and this DSP trick allows you to do 3 loops of SHA256 in the same chip ( cheap Spartan 6 ones ) that previously only allowed us to do 2 loops etc.
sr. member
Activity: 448
Merit: 250
Quote from: Inspector 2211
BFL Single, watch out below.

What makes you think this cannot similarly be applied to the single ( even after a hardware modification ) Huh

1. Consensus on this forum is, that the BFL Single uses Altera FPGAs of an unknown type (Stratix?) and one would first want to determine
    the exact FPGA being used, before speculating whether DSP blocks could be used to a similarly beneficial effect.
    Without knowing the exact FPGA make/model, it's way too premature to state that DSP blocks could be used there -
    maybe that particular FPGA make/model does not even have DSP blocks.

or

2. Maybe they are already using this trick, maybe that's their secret sauce which allows them to reach 830 MH/s with but two
    FPGAs.

Just my 2 cents.
legendary
Activity: 1022
Merit: 1000
BitMinter
What makes you think this cannot similarly be applied to the single ( even after a hardware modification ) Huh

  Grin
vip
Activity: 1358
Merit: 1000
AKA: gigavps
Quote from: Inspector 2211
BFL Single, watch out below.

What makes you think this cannot similarly be applied to the single ( even after a hardware modification ) Huh

Bulanula,

Slow down. Please read his post more carefully. He is suggesting that $$$/Mh is in competition with the BFL single and his math is pretty close. I am getting 830 mh/s for $600 or $.072/Mh which is pretty darn close.
hero member
Activity: 518
Merit: 500
Quote from: Inspector 2211
BFL Single, watch out below.

What makes you think this cannot similarly be applied to the single ( even after a hardware modification ) Huh
hero member
Activity: 504
Merit: 500
FPGA Mining LLC
BFL Single, watch out below.

Oh yeah! Grin

750MH/s on X6500, at $550 bulk that's <0.74$/MH or >1.36MH/$. Wow! This can blow away GPUs! Smiley And probably LargeCoin as well...
sr. member
Activity: 448
Merit: 250
 Number of DSP48A1s:                           30 out of     180   16%
Aha! Interesting. When uncle Moshe (Gavrielov) gives you DSPs, make DSPeade. Wink

Thank you for providing an important puzzle piece on how Dr. Tyrell does it.

The multiplier in the DSP48-block is not needed in SHA-256, hence what he obviously uses is the 18-bit adder
BCOUT = B + D.
He uses 30 DSP blocks, 10 per red / green / blue SHA-256 instance.
For a 32 bit adder, two 18-bit adders BCOUT=B+D are needed.
Thus, he can implement five 32-bit adders per SHA instance.

So, why not just use [slow] 32-bit ripple adders everywhere, and use a few [very fast] DSP adders in some places?

The answer is, IMHO, that he uses the fast DSP adders only where they feed into longlines.
Were he to use normal ripple adders where he feeds into longlines, the aggregate delay would limit
the design to a 5 ns clock cycle.
Using the fast DSP adders will allow this design, when properly fine-tuned, to march into 4 ns clock cycle
territory, for a total MH/s number of approximately 125 MH/s or approximately 375 MH/s per Spartan6-150.

BFL Single, watch out below.
hero member
Activity: 784
Merit: 500
is there a way to port it to Ztex  or other FPGA board's?
legendary
Activity: 2128
Merit: 1073
 Number of DSP48A1s:                           30 out of     180   16%
Aha! Interesting. When uncle Moshe (Gavrielov) gives you DSPs, make DSPeade. Wink
hero member
Activity: 1596
Merit: 502
I you put this at kickstarter or sell it or what ever, how much do you want for it?
Is it around $500 or more around $2500 or even $50,000 ?
How many hours did you spend roughly?
sr. member
Activity: 402
Merit: 250
Really cool work, for what i understand this already offers around 30% more per cycle? That's simply awesome.
If i were a miner with a significant any scale and investment into FPGAs i would definitely throw some BTC to your direction, especially if that meant i get unlimited access to the bitstream Smiley

legendary
Activity: 4592
Merit: 1851
Linux since 1997 RedHat 4
I did put that in sarcasm brackets for a reason Smiley

Simply coz you give the impression that it's a "get paid lots or no one will be allowed to ever see it."
If it's a "I wrote and did it all from scratch without any help from looking at anything anyone else has ever done" then I guess that MAY be justified ...

If you haven't looked at sha256() optimisations then you are somewhere in the ball-park of 5% slower than it could be.

The 2 simplest and most effective optimisations are:
(ignoring the midstate as being the real first sha256())
The first 3 of 64 stages in the 1st of the double sha256() are only needed to be done once per 2^32 hashes (per full nonce range)
The last 3.5 stages of the 2nd of the double sha256() are not required since you already know the answer at that point.
There are quite a few other optimisations of W calculations that are constant over a full nonce range
Then there are the partial calculations of some of the W that are constant over a full nonce range
Quite a few parts of the early stages of the 2nd double sha256() are reduced to fixed constants also.

Edit: some of that may not be FPGA related but some of it certainly also is.
hero member
Activity: 504
Merit: 500
FPGA Mining LLC
I think I have created some confusion, and have inadvertently offended you (and others).  Please accept my apologies.

I didn't feel offended, and I still don't do. But I have the impression that the bitcoin community in general is very generous as far as donations are concerned Smiley
It isn't so much the number of people, but rather the amounts of money some people have to spare...

Everything I wrote about "miners" was meant to refer only to the part of the code that runs on the CPU: fetching work from the pool and submitting shares.  I did not mean to imply that writing the OpenCL code that runs on the GPU itself is easy or trivial!  I know that is quite difficult, and no, I have never tried to write GPU hashing code.

Please understand that my response was in the context of what I interpreted (perhaps incorrectly) to be an accusation that any attempt to raise funds for my efforts would somehow be cheating the authors of cgminer/mpbm/etc.  The point I was trying to make is that (1) I am not using any of this software; I wrote my own and (2) if somebody does modify cgminer to act as a front end to my bitstream they won't be using the part of cgminer that was hard to write -- they'll only be using the CPU part.

You apparently have no idea what kind of effort that is, as much as others have no idea how hard it is to optimize an FPGA design.
Writing good miner software isn't trivial either (MPBM is approaching 10000 lines of code, and there's no OpenCL involved at all).


To get back to my original question: Do you think that it might be possible to community fund your effort? I wouldn't put too much hope on the FPGA board vendors here (at the current production volumes those are also people who'll never earn any adequate profits for the time that they've spent designing, testing, fixing and organizing things).
So if we do some fundraising to pay you semi-adequately, would you agree to completely open source this project?
And we might need a ballpark number of what you would consider an adequate reward...
hero member
Activity: 714
Merit: 500
Psi laju, karavani prolaze.

    • Assuming the bitcoin FPGA community (and possibly some board vendors) would want you to optimize this design until you're hitting real roadblocks (300MH/s maybe?), and release everything that's neccessary to regenerate and further improve it under an open source license, roughly how much money would we need?


    This has been mislooked?
    donator
    Activity: 980
    Merit: 1004
    felonious vagrancy, personified
    Edit: So you wrote the fully optimised CL code yourself also without taking that from someone else?
    And you worked out the 61 + 61 sha256 optimisation yourself also?
    (and all the other optimisations in there) for the stream you've done here?

    I think I have created some confusion, and have inadvertently offended you (and others).  Please accept my apologies.

    Everything I wrote about "miners" was meant to refer only to the part of the code that runs on the CPU: fetching work from the pool and submitting shares.  I did not mean to imply that writing the OpenCL code that runs on the GPU itself is easy or trivial!  I know that is quite difficult, and no, I have never tried to write GPU hashing code.

    Please understand that my response was in the context of what I interpreted (perhaps incorrectly) to be an accusation that any attempt to raise funds for my efforts would somehow be cheating the authors of cgminer/mpbm/etc.  The point I was trying to make is that (1) I am not using any of this software; I wrote my own and (2) if somebody does modify cgminer to act as a front end to my bitstream they won't be using the part of cgminer that was hard to write -- they'll only be using the CPU part.
    legendary
    Activity: 4592
    Merit: 1851
    Linux since 1997 RedHat 4
    [sarcasm]just make sure you don't use free miners like cgminer where many many hundreds of hours have been spent without the requirement of payment[/sarcasm]

    Duh.

    I wrote my own miner from scratch; it has longpoll and multipool support.  Just ask Luke-Jr, who has graciously suffered through the pool side of the debugging process Smiley

    I can tell you from first-hand experience that writing a miner requires about 1% of the effort I put into the HDL design.  That's not an exaggeration; I kept a (very coarse) log of how I spent my time and it really does work out to about 100:1.  I suspect ztex has had a similar experience.

    I don't mean any disrespect to the authors of cgminer/mpbm/etc.  They've done a great thing for the bitcoin mining community.  But these things aren't even in the same league in terms of time commitment.
    Yeah if you write a total piece of shit miner Tongue

    Edit: So you wrote the fully optimised CL code yourself also without taking that from someone else?
    And you worked out the 61 + 61 sha256 optimisation yourself also?
    (and all the other optimisations in there) for the stream you've done here?
    hero member
    Activity: 714
    Merit: 504
    ^SEM img of Si wafer edge, scanned 2012-3-12.
    *notices the topic title*
    Grats on your recent 10MH/s advancement Smiley
    donator
    Activity: 980
    Merit: 1004
    felonious vagrancy, personified
    Interesting that you think your design could be easy forward-ported to the new xilinx 28nm FPGA's.

    Well, feature size isn't something you can detect using Verilog code...

    This surprise me a litter bit, because I always thought your design is so highly spartan 6 LX150 optimized/specific. How deep did you already look into the Artix architecture

    Xilinx UG474 says that the 7-series slices (both M+L) are identical to the Virtex-6 slice, which is a strict superset of the Spartan-6 slice.  I verified this by looking at the diagram.  Then I opened up each of the Artix devices in fpga_editor to look at the geometry.  That's about the extent of my investigation.   Mostly stuff just switches faster, uses less power, more SLICEL's, and you get more routing -- but the routing is basically undocumented anyways.

    I have to say I am baffled by the bizzarre shape of the Artix fabric.  One of their devices looks like a rectangle with a chunk hacked out of the right hand side and shoved over.  WTF?

    I do need the device to be at least 128 slices wide to get a "zero effort" port.  So, Artix200 or higher.  There's a huge hole in the middle of the Artix200, but (unlike the holes in the Spartan6) you get wires that run "over the top of" whatever circuitry is in the hole.  And there are still more than 128 columns even after leaving out the hole.

    If there is enough demand for Artix100 I may be able to re-arrange things to fit the narrower device -- we'll see.  I'm hoping the Artix200 comes out very quickly after the 100; if so it should attract the bitcoin miners (unless something crazy happens it should be cheaper $/LUT than the 100).

    Artix, but it doesn't look like the first chips will be available <6-8 month :-(  

    Yeah, I hear Xilinx's availability estimates are pretty much worthless.
    Pages:
    Jump to: