Algorithmically placed FPGA miner: 255MH/s/chip, supports all known boards - page 44.

BFL-Engineer

full member

Activity: 227

Merit: 100

Quote from: Inspector 2211 on March 10, 2012, 11:19:36 AM

Quote from: 2112 on March 10, 2012, 03:09:18 AM

Quote from: eldentyrell on March 08, 2012, 08:12:50 PM

Number of DSP48A1s: 30 out of 180 16%

Aha! Interesting. When uncle Moshe (Gavrielov) gives you DSPs, make DSPeade. Wink

Thank you for providing an important puzzle piece on how Dr. Tyrell does it.

The multiplier in the DSP48-block is not needed in SHA-256, hence what he obviously uses is the 18-bit adder
BCOUT = B + D.
He uses 30 DSP blocks, 10 per red / green / blue SHA-256 instance.
For a 32 bit adder, two 18-bit adders BCOUT=B+D are needed.
Thus, he can implement five 32-bit adders per SHA instance.

So, why not just use [slow] 32-bit ripple adders everywhere, and use a few [very fast] DSP adders in some places?

The answer is, IMHO, that he uses the fast DSP adders only where they feed into longlines.
Were he to use normal ripple adders where he feeds into longlines, the aggregate delay would limit
the design to a 5 ns clock cycle.
Using the fast DSP adders will allow this design, when properly fine-tuned, to march into 4 ns clock cycle
territory, for a total MH/s number of approximately 125 MH/s or approximately 375 MH/s per Spartan6-150.

BFL Single, watch out below.

I remember nghzang mentioned that going to 200MHz on chips was not suggested (chips got so hot), and he gave
out a bitstream with a "Use at your own risk". Three loops on the same chip suggests far greater number of
Registers is being used. Since each stage toggle rate approaches 50% (This idea behind Digest functions is that their toggle-rate
must approach 50% in each stage to be effective, and so is the case in SHA256), I wonder how hot the chips will get in high
frequencies, approaching 180MHz or 190MHz...

Good Luck,

Inspector 2211

sr. member

Activity: 448

Merit: 250

Quote from: bulanula on March 10, 2012, 12:25:51 PM

That is what I meant. It seems this guy has found a way to speed up the hashrate using DSPs so what is so hard to understand Turbor and gigavps ?

He doesn't use DSPs throughout (because there are not enough DSPs to go around), but only at the most critical spots, i.e. where the adders feed into longlines. That's the brilliance of it. That's the design idea I had completely missed before.

Quote from: bulanula on March 10, 2012, 12:25:51 PM

I was asking why couldn't BFL also do this "trick" and a valid question indeed. One bitstream or FPGA "trick" likely could be applied on a range of different FPGA hardware because the basic operating principles are the same for all FPGAs etc.

Agreed.

Quote from: bulanula on March 10, 2012, 12:25:51 PM

I'm no expert but I understand ( reasonably well ) how FPGA works and this DSP trick allows you to do 3 loops of SHA256 in the same chip ( cheap Spartan 6 ones ) that previously only allowed us to do 2 loops etc.

No, using this DSP trick has nothing to do with being able to squeeze three SHA-256 instances into a FPGA.
You can do that with the plain old stream-powered ripple carry adders.
Using DSPs in a few strategic places, however, ensures that the critical path (a deadly combination of two 32-bit adder stages and one longline path) stays well below 5 ns, when otherwise (with ripple carry adders) it can barely achieve 5 ns.

bulanula

hero member

Activity: 518

Merit: 500

Quote from: Inspector 2211 on March 10, 2012, 12:23:58 PM

Quote from: bulanula on March 10, 2012, 11:58:50 AM

Quote from: Inspector 2211

BFL Single, watch out below.

What makes you think this cannot similarly be applied to the single ( even after a hardware modification ) Huh

1. Consensus on this forum is, that the BFL Single uses Altera FPGAs of an unknown type (Stratix?) and one would first want to determine
   the exact FPGA being used, before speculating whether DSP blocks could be used to a similarly beneficial effect.
   Without knowing the exact FPGA make/model, it's way too premature to state that DSP blocks could be used there -
   maybe that particular FPGA make/model does not even have DSP blocks.

or

2. Maybe they are already using this trick, maybe that's their secret sauce which allows them to reach 830 MH/s with but two
   FPGAs.

Just my 2 cents.

That is what I meant. It seems this guy has found a way to speed up the hashrate using DSPs so what is so hard to understand Turbor and gigavps ?

I was asking why couldn't BFL also do this "trick" and a valid question indeed. One bitstream or FPGA "trick" likely could be applied on a range of different FPGA hardware because the basic operating principles are the same for all FPGAs etc.

I'm no expert but I understand ( reasonably well ) how FPGA works and this DSP trick allows you to do 3 loops of SHA256 in the same chip ( cheap Spartan 6 ones ) that previously only allowed us to do 2 loops etc.

Inspector 2211

sr. member

Activity: 448

Merit: 250

Quote from: bulanula on March 10, 2012, 11:58:50 AM

Quote from: Inspector 2211

BFL Single, watch out below.

What makes you think this cannot similarly be applied to the single ( even after a hardware modification ) Huh

1. Consensus on this forum is, that the BFL Single uses Altera FPGAs of an unknown type (Stratix?) and one would first want to determine
the exact FPGA being used, before speculating whether DSP blocks could be used to a similarly beneficial effect.
Without knowing the exact FPGA make/model, it's way too premature to state that DSP blocks could be used there -
maybe that particular FPGA make/model does not even have DSP blocks.

or

2. Maybe they are already using this trick, maybe that's their secret sauce which allows them to reach 830 MH/s with but two
FPGAs.

Just my 2 cents.

Turbor

legendary

Activity: 1022

Merit: 1000

BitMinter

Quote from: bulanula on March 10, 2012, 11:58:50 AM

What makes you think this cannot similarly be applied to the single ( even after a hardware modification ) Huh

jamesg

vip

Activity: 1358

Merit: 1000

AKA: gigavps

Quote from: bulanula on March 10, 2012, 11:58:50 AM

Quote from: Inspector 2211

BFL Single, watch out below.

What makes you think this cannot similarly be applied to the single ( even after a hardware modification ) Huh

Bulanula,

Slow down. Please read his post more carefully. He is suggesting that $$$/Mh is in competition with the BFL single and his math is pretty close. I am getting 830 mh/s for $600 or $.072/Mh which is pretty darn close.

bulanula

hero member

Activity: 518

Merit: 500

Quote from: Inspector 2211

BFL Single, watch out below.

What makes you think this cannot similarly be applied to the single ( even after a hardware modification ) Huh

TheSeven

hero member

Activity: 504

Merit: 500

FPGA Mining LLC

Quote from: Inspector 2211 on March 10, 2012, 11:19:36 AM

BFL Single, watch out below.

Oh yeah!

750MH/s on X6500, at $550 bulk that's <0.74$/MH or >1.36MH/$. Wow! This can blow away GPUs!

And probably LargeCoin as well...

Inspector 2211

sr. member

Activity: 448

Merit: 250

Quote from: 2112 on March 10, 2012, 03:09:18 AM

Quote from: eldentyrell on March 08, 2012, 08:12:50 PM

Number of DSP48A1s: 30 out of 180 16%

Aha! Interesting. When uncle Moshe (Gavrielov) gives you DSPs, make DSPeade. Wink

Thank you for providing an important puzzle piece on how Dr. Tyrell does it.

The multiplier in the DSP48-block is not needed in SHA-256, hence what he obviously uses is the 18-bit adder
BCOUT = B + D.
He uses 30 DSP blocks, 10 per red / green / blue SHA-256 instance.
For a 32 bit adder, two 18-bit adders BCOUT=B+D are needed.
Thus, he can implement five 32-bit adders per SHA instance.

So, why not just use [slow] 32-bit ripple adders everywhere, and use a few [very fast] DSP adders in some places?

The answer is, IMHO, that he uses the fast DSP adders only where they feed into longlines.
Were he to use normal ripple adders where he feeds into longlines, the aggregate delay would limit
the design to a 5 ns clock cycle.
Using the fast DSP adders will allow this design, when properly fine-tuned, to march into 4 ns clock cycle
territory, for a total MH/s number of approximately 125 MH/s or approximately 375 MH/s per Spartan6-150.

BFL Single, watch out below.

BR0KK

hero member

Activity: 784

Merit: 500

is there a way to port it to Ztex or other FPGA board's?

2112

legendary

Activity: 2128

Merit: 1074

Quote from: eldentyrell on March 08, 2012, 08:12:50 PM

Number of DSP48A1s: 30 out of 180 16%

Aha! Interesting. When uncle Moshe (Gavrielov) gives you DSPs, make DSPeade. Wink

pieppiep

hero member

Activity: 1596

Merit: 502

I you put this at kickstarter or sell it or what ever, how much do you want for it?
Is it around $500 or more around $2500 or even $50,000 ?
How many hours did you spend roughly?

PulsedMedia

sr. member

Activity: 402

Merit: 250

Really cool work, for what i understand this already offers around 30% more per cycle? That's simply awesome.
If i were a miner with a significant any scale and investment into FPGAs i would definitely throw some BTC to your direction, especially if that meant i get unlimited access to the bitstream

kano

legendary

Activity: 4634

Merit: 1851

Linux since 1997 RedHat 4

I did put that in sarcasm brackets for a reason

Simply coz you give the impression that it's a "get paid lots or no one will be allowed to ever see it."
If it's a "I wrote and did it all from scratch without any help from looking at anything anyone else has ever done" then I guess that MAY be justified ...

If you haven't looked at sha256() optimisations then you are somewhere in the ball-park of 5% slower than it could be.

The 2 simplest and most effective optimisations are:
(ignoring the midstate as being the real first sha256())
The first 3 of 64 stages in the 1st of the double sha256() are only needed to be done once per 2^32 hashes (per full nonce range)
The last 3.5 stages of the 2nd of the double sha256() are not required since you already know the answer at that point.
There are quite a few other optimisations of W calculations that are constant over a full nonce range
Then there are the partial calculations of some of the W that are constant over a full nonce range
Quite a few parts of the early stages of the 2nd double sha256() are reduced to fixed constants also.

Edit: some of that may not be FPGA related but some of it certainly also is.

TheSeven

hero member

Activity: 504

Merit: 500

FPGA Mining LLC

Quote from: eldentyrell on March 09, 2012, 06:51:11 PM

I think I have created some confusion, and have inadvertently offended you (and others). Please accept my apologies.

I didn't feel offended, and I still don't do. But I have the impression that the bitcoin community in general is very generous as far as donations are concerned

It isn't so much the number of people, but rather the amounts of money some people have to spare...

Quote from: eldentyrell on March 09, 2012, 06:51:11 PM

Everything I wrote about "miners" was meant to refer only to the part of the code that runs on the CPU: fetching work from the pool and submitting shares. I did not mean to imply that writing the OpenCL code that runs on the GPU itself is easy or trivial! I know that is quite difficult, and no, I have never tried to write GPU hashing code.

Please understand that my response was in the context of what I interpreted (perhaps incorrectly) to be an accusation that any attempt to raise funds for my efforts would somehow be cheating the authors of cgminer/mpbm/etc. The point I was trying to make is that (1) I am not using any of this software; I wrote my own and (2) if somebody does modify cgminer to act as a front end to my bitstream they won't be using the part of cgminer that was hard to write -- they'll only be using the CPU part.

You apparently have no idea what kind of effort that is, as much as others have no idea how hard it is to optimize an FPGA design.
Writing good miner software isn't trivial either (MPBM is approaching 10000 lines of code, and there's no OpenCL involved at all).

To get back to my original question: Do you think that it might be possible to community fund your effort? I wouldn't put too much hope on the FPGA board vendors here (at the current production volumes those are also people who'll never earn any adequate profits for the time that they've spent designing, testing, fixing and organizing things).
So if we do some fundraising to pay you semi-adequately, would you agree to completely open source this project?
And we might need a ballpark number of what you would consider an adequate reward...

kakobrekla

hero member

Activity: 714

Merit: 500

Psi laju, karavani prolaze.

Quote from: TheSeven on March 09, 2012, 05:30:06 PM

Assuming the bitcoin FPGA community (and possibly some board vendors) would want you to optimize this design until you're hitting real roadblocks (300MH/s maybe?), and release everything that's neccessary to regenerate and further improve it under an open source license, roughly how much money would we need?

This has been mislooked?

eldentyrell

donator

Activity: 980

Merit: 1004

felonious vagrancy, personified

Quote from: kano on March 09, 2012, 06:11:24 PM

Edit: So you wrote the fully optimised CL code yourself also without taking that from someone else?
And you worked out the 61 + 61 sha256 optimisation yourself also?
(and all the other optimisations in there) for the stream you've done here?

I think I have created some confusion, and have inadvertently offended you (and others). Please accept my apologies.

Everything I wrote about "miners" was meant to refer only to the part of the code that runs on the CPU: fetching work from the pool and submitting shares. I did not mean to imply that writing the OpenCL code that runs on the GPU itself is easy or trivial! I know that is quite difficult, and no, I have never tried to write GPU hashing code.

Please understand that my response was in the context of what I interpreted (perhaps incorrectly) to be an accusation that any attempt to raise funds for my efforts would somehow be cheating the authors of cgminer/mpbm/etc. The point I was trying to make is that (1) I am not using any of this software; I wrote my own and (2) if somebody does modify cgminer to act as a front end to my bitstream they won't be using the part of cgminer that was hard to write -- they'll only be using the CPU part.

kano

legendary

Activity: 4634

Merit: 1851

Linux since 1997 RedHat 4

Quote from: eldentyrell on March 09, 2012, 05:10:06 PM

Quote from: kano on March 09, 2012, 12:37:23 AM

[sarcasm]just make sure you don't use free miners like cgminer where many many hundreds of hours have been spent without the requirement of payment[/sarcasm]

Duh.

I wrote my own miner from scratch; it has longpoll and multipool support. Just ask Luke-Jr, who has graciously suffered through the pool side of the debugging process

I can tell you from first-hand experience that writing a miner requires about 1% of the effort I put into the HDL design. That's not an exaggeration; I kept a (very coarse) log of how I spent my time and it really does work out to about 100:1. I suspect ztex has had a similar experience.

I don't mean any disrespect to the authors of cgminer/mpbm/etc. They've done a great thing for the bitcoin mining community. But these things aren't even in the same league in terms of time commitment.

Yeah if you write a total piece of shit miner Tongue

Edit: So you wrote the fully optimised CL code yourself also without taking that from someone else?
And you worked out the 61 + 61 sha256 optimisation yourself also?
(and all the other optimisations in there) for the stream you've done here?

BTCurious

hero member

Activity: 714

Merit: 504

^SEM img of Si wafer edge, scanned 2012-3-12.

*notices the topic title*
Grats on your recent 10MH/s advancement

eldentyrell

donator

Activity: 980

Merit: 1004

felonious vagrancy, personified

Quote from: BTC-engineer on March 09, 2012, 05:24:49 PM

Interesting that you think your design could be easy forward-ported to the new xilinx 28nm FPGA's.

Well, feature size isn't something you can detect using Verilog code...

Quote from: BTC-engineer on March 09, 2012, 05:24:49 PM

This surprise me a litter bit, because I always thought your design is so highly spartan 6 LX150 optimized/specific. How deep did you already look into the Artix architecture

Xilinx UG474 says that the 7-series slices (both M+L) are identical to the Virtex-6 slice, which is a strict superset of the Spartan-6 slice. I verified this by looking at the diagram. Then I opened up each of the Artix devices in fpga_editor to look at the geometry. That's about the extent of my investigation. Mostly stuff just switches faster, uses less power, more SLICEL's, and you get more routing -- but the routing is basically undocumented anyways.

I have to say I am baffled by the bizzarre shape of the Artix fabric. One of their devices looks like a rectangle with a chunk hacked out of the right hand side and shoved over. WTF?

I do need the device to be at least 128 slices wide to get a "zero effort" port. So, Artix200 or higher. There's a huge hole in the middle of the Artix200, but (unlike the holes in the Spartan6) you get wires that run "over the top of" whatever circuitry is in the hole. And there are still more than 128 columns even after leaving out the hole.

If there is enough demand for Artix100 I may be able to re-arrange things to fit the narrower device -- we'll see. I'm hoping the Artix200 comes out very quickly after the 100; if so it should attract the bitcoin miners (unless something crazy happens it should be cheaper $/LUT than the 100).

Quote from: BTC-engineer on March 09, 2012, 05:24:49 PM

Artix, but it doesn't look like the first chips will be available <6-8 month :-(

Yeah, I hear Xilinx's availability estimates are pretty much worthless.

Topic: Algorithmically placed FPGA miner: 255MH/s/chip, supports all known boards - page 44. (Read 119468 times)