1GH/s, 20w, $700 (was $500) — Butterflylabs, is it for real? (Part 2) - page 29.

kano

legendary

Activity: 4634

Merit: 1851

Linux since 1997 RedHat 4

lulz - OK I need to repeat the question ...

But firstly, I know the sha256 code very well - otherwise I wouldn't have mentioned P() ... here's a fully unrolled, optimised as high as possibly needed for gcc's -O2, C sha256 that I generated myself quite a while back ... and yes it works.
It is generated and optimised by code I wrote to produce that entire file.
You cannot actually optimise it any better in C and gain anything but an extremely minor performance increase when you use -O2 with gcc on this.

http://pastebin.com/sxdVSJF1

Also as I said, 122, not 128 (or 178) coz: the 1st 64 is constant over a nonce range (commonly the midstate), you don't need to do the first 3 of the 2nd 64 inside the loop (also in the midstate) nor the last 3 ever of the 3rd 64 with bitcoin

My question is how do FPGA's work internally (if that youtube video really does answer this, oh well, but I've yet to see anything useful on youtube in my life that couldn't be replaced by a TINY web page of text so I ignore youtube links)

As I asked before, do they execute in a manner like dominoes where the clock process advances data through the FPGA in steps?

My googling on the subject suggests this is correct - but I was curious if anyone here knew that much about the internal workings of FPGA and could confirm or otherwise explain that.

Also, that would mean that inside the FPGA there would be something like the discrete steps to hash a nonce range and thus I was wondering if they do really actually hash multiple nonce at the same time each one 1 or 2 discrete steps behind the previous nonce.
Thus if the clock steps data through at X cycles per second, and the process is 1300 steps (a random number of my choosing), you aren't just waiting 1300 cycles for each nonce calculation, you are actually waiting (assuming each nonce is 2 steps apart) 2 cycles for each nonce result but with a startup time of 1300 cycles before the first nonce result comes out.

heavyb

full member

Activity: 217

Merit: 100

Anyone have tracking numbers or the singles in hand yet? I am anxious awaiting to hear about this, if it is legit I am going to buy.

makomk

hero member

Activity: 686

Merit: 564

Quote from: DeathAndTaxes on January 31, 2012, 07:46:58 AM

It is executed 64 times. So a fully looped version of SHA-256 code (in C# or on a FPGA it doesn't matter) would be something like this in psuedo code:

Initialize all variables, and inputs, round = 1
while round <=64
(
perform SHA-256 round
round++
)
record output

Which is actually a reasonably sensible way of implementing SHA-256 in an FPGA if you just want to have an efficient way to hash arbitrary pieces of data rather than do something like Bitcoin mining or password cracking. If you're only hashing a single message there's no parallelism that can be exploited - you can't start work on any chunk of the message until you've completely hashed the previous chunks - and no reason to unroll the hashing. That's partly why off the shelf SHA-256 cores aren't much use for Bitcoin mining.

DeathAndTaxes

donator

Activity: 1218

Merit: 1079

Gerald Davis

Quote from: kano on January 31, 2012, 04:45:40 AM

Actually - every time I see comments about unrolling the sha256 code I wonder how you would do it any way but unrolled.

This is 1 round of the SHA-256 loop.

It is executed 64 times. So a fully looped version of SHA-256 code (in C# or on a FPGA it doesn't matter) would be something like this in psuedo code:

Initialize all variables, and inputs, round = 1
while round <=64
(
perform SHA-256 round
round++
)
record output

Unrolling in any programming language is the process of converting a looping structure to a flat structure. Even many high level programming languages do it for speed/optimization.

Fully unrolled involves no looping structure at all. In Bitcoin since there is a double hash fully unrolled means input -> flat logic -> double hash output.

The only 'rolled' option I can think of is to make a very small part of the FPGA just be P() and use it 122 times (in 2 loops) yet that would be senseless since I'd imagine it would be a MUCH slower way to do it? ...
[/quote]

Quote from: kano on January 31, 2012, 05:08:41 AM

... and an FPGA question

Is the FPGA process somewhat similar to something like dominoes falling where the data steps through the FPGA and presents an answer at the other end?
If so, could that stepping process effectively always be active at each step - i.e. input data is fed into the step process once each step (or each 2nd step if there is an overlap issue), so thus there would be output data happening each step time (or each 2x step time)?
Is that how it actually works? Or is it a once through per input then when it outputs, another input?
(yes I guess I really know nothing about how these things actually process internally)

Coz if it does step but the current implementation is once through per input, but it was actually possible to do one input per step (or per 2 steps), that would effectively multiply the processing power almost by the number of steps (or half the number of steps) in the process.

Energizer

sr. member

Activity: 273

Merit: 250

Quote from: kano on January 31, 2012, 05:08:41 AM

... and an FPGA question

Is the FPGA process somewhat similar to something like dominoes falling where the data steps through the FPGA and presents an answer at the other end?
If so, could that stepping process effectively always be active at each step - i.e. input data is fed into the step process once each step (or each 2nd step if there is an overlap issue), so thus there would be output data happening each step time (or each 2x step time)?
Is that how it actually works? Or is it a once through per input then when it outputs, another input?
(yes I guess I really know nothing about how these things actually process internally)

Coz if it does step but the current implementation is once through per input, but it was actually possible to do one input per step (or per 2 steps), that would effectively multiply the processing power almost by the number of steps (or half the number of steps) in the process.

It seems you are new to FPGA programming. You would find this video helpful: "Intro to FPGAs for Software Engineers":

http://www.youtube.com/watch?v=gsTpLtEEobE&feature=related

kano

legendary

Activity: 4634

Merit: 1851

Linux since 1997 RedHat 4

... and an FPGA question

Is the FPGA process somewhat similar to something like dominoes falling where the data steps through the FPGA and presents an answer at the other end?
If so, could that stepping process effectively always be active at each step - i.e. input data is fed into the step process once each step (or each 2nd step if there is an overlap issue), so thus there would be output data happening each step time (or each 2x step time)?
Is that how it actually works? Or is it a once through per input then when it outputs, another input?
(yes I guess I really know nothing about how these things actually process internally)

Coz if it does step but the current implementation is once through per input, but it was actually possible to do one input per step (or per 2 steps), that would effectively multiply the processing power almost by the number of steps (or half the number of steps) in the process.

DiabloD3

legendary

Activity: 1162

Merit: 1000

DiabloMiner author

Quote from: kano on January 31, 2012, 04:45:40 AM

Quote from: DiabloD3 on January 31, 2012, 04:08:21 AM

Quote from: RandyFolds on January 30, 2012, 06:39:12 PM

To the FPGA guys here: Why is it 'rolled' and not 'furled'? It seems way more appropriate.

Because its always been unrolling loops. Ears are unfurled, loops are unrolled.

Actually - every time I see comments about unrolling the sha256 code I wonder how you would do it any way but unrolled.

The only 'rolled' option I can think of is to make a very small part of the FPGA just be P() and use it 122 times (in 2 loops) yet that would be senseless since I'd imagine it would be a MUCH slower way to do it? ...

The "dumb" way is to have one function (in the case of an FPGA, one circuit), and rotate the variables/registers in and out of the function. You have the code compiled/circuit implemented exactly once. This would actually be superior for FPGA _if_ they had enough registers, but they don't.

This is extremely slow for basically any implementation, and it also screws over the fact we essentially have 5 or more parallel ops at any given time in the way Bitcoin can optimize* the first, oh, 250 ops (depending on how certain things are implemented, of course).

* In OpenCL, due to all the shortcuts calculating stuff in the host, it starts out as two unrelated chains that eventually merge. The ability to pack VLIW5 here is pretty goddamned handy, makes optimization a much easier job.

kano

legendary

Activity: 4634

Merit: 1851

Linux since 1997 RedHat 4

Quote from: DiabloD3 on January 31, 2012, 04:08:21 AM

Quote from: RandyFolds on January 30, 2012, 06:39:12 PM

To the FPGA guys here: Why is it 'rolled' and not 'furled'? It seems way more appropriate.

Because its always been unrolling loops. Ears are unfurled, loops are unrolled.

Actually - every time I see comments about unrolling the sha256 code I wonder how you would do it any way but unrolled.

The only 'rolled' option I can think of is to make a very small part of the FPGA just be P() and use it 122 times (in 2 loops) yet that would be senseless since I'd imagine it would be a MUCH slower way to do it? ...

DiabloD3

legendary

Activity: 1162

Merit: 1000

DiabloMiner author

Quote from: RandyFolds on January 30, 2012, 06:39:12 PM

To the FPGA guys here: Why is it 'rolled' and not 'furled'? It seems way more appropriate.

Because its always been unrolling loops. Ears are unfurled, loops are unrolled.

kimmeriets

legendary

Activity: 1064

Merit: 1000

Quote from: bulanula on December 22, 2011, 08:37:33 AM

We need to know what the chip under the hood is.

Couple of reasons :

-maybe these were sourced from Libya / Egypt and there may be some ethical issues there

I wonder what ethical issues you have in Libya and Egypt? ))))) not fun my sneakers

kano

legendary

Activity: 4634

Merit: 1851

Linux since 1997 RedHat 4

... and what is the general opinion of Sony due to this?

So what should the general opinion of BFL be?

antirack

hero member

Activity: 489

Merit: 500

Immersionist

You guys are just too spoiled I guess Wink

Companies intentionally promise stuff that they cannot deliver and they know it up front that they will not. Look at Sony. They do it all the time and they are certainly not an exception. One example:

When they announced the PSP years ago, they said WE GONNA RULE THE WORLD WITH THIS SHIT AND GRAN TOURISMO WILL BE AVAILABLE FROM DAY ONE. THROW AWAY YOUR STUPID NINTENDOS. They used this title for months and months to push the PSP to the masses, always saying 'ok it didn't come out at launch day but it will be out soon, get ready'. Even after a year they still said 'coming soon'.

That stupid title (I mean how difficult can it be to port a game that existed for years on other consoles) was delayed for more than 5 years. It didn't even come out for the original PSP after all.

Sony Releases Stupid Piece Of Shit That Doesn't Fucking Work (Onion News Network)
http://www.youtube.com/watch?v=8AyVh1_vWYQ

DeathAndTaxes

donator

Activity: 1218

Merit: 1079

Gerald Davis

Quote from: makomk on January 30, 2012, 06:18:31 PM

Quote from: DeathAndTaxes on January 29, 2012, 12:38:43 PM

[Look the Statix III is a 65nm chip. Alterra isn't even making it anymore. They want to sell Stratix IV and soon they will need to sell Stratix V. They want that inventory gone. Gone from their website, gone from wholesalers, gone from retail outlets. when Stratix V is in full production it becomes the high end part and the Statix IV becomes the value segment. The Stratix III is just a third wheel.

I'm not sure it works that way, though - there's always customers out there who've built designs around older chips and for whatever reason either can't or don't want to move over to the latest and greatest. From what I've seen the FPGA vendors are really slow at taking older chips out of production for this exact reason.

They are slow which is why the Stratix III is still around despite being released almost 5 years ago. Still you notice Stratix II isn't available. Companies tend not to like to keep 3+ generations going at the same time. As the Stratix V nears volume production the company will try to transition customers to the newer products. For large customers they will even provide incentives, demo units, and design assistance to move their bitstreams to newer products.

RandyFolds

sr. member

Activity: 448

Merit: 250

Quote from: DiabloD3 on January 30, 2012, 03:40:29 PM

Quote from: RandyFolds on January 30, 2012, 02:17:43 PM

Quote from: n0ne on January 30, 2012, 05:29:28 AM

Quote from: RandyFolds on January 30, 2012, 02:42:15 AM

But they avoided that issue. Interesting info though. So what's this further unexplained delay?

*yawn* Reported. Tired of all the trolling. We can't even have a conversation in here without you butting in and being a dick. Don't you have a job, a hobby or something?

Nice. Did you put "I'm a whiny little bitch" in the comments box?

Go fuck yourself, fella. Feel free to report this post as well.

Ahh, a fine afternoon in off-topic, indeed.

...and an education in LUTs, bitstreams, and chip stats. It really can't be beat.

To the FPGA guys here: Why is it 'rolled' and not 'furled'? It seems way more appropriate.

makomk

hero member

Activity: 686

Merit: 564

Quote from: DeathAndTaxes on January 29, 2012, 12:38:43 PM

[Look the Statix III is a 65nm chip. Alterra isn't even making it anymore. They want to sell Stratix IV and soon they will need to sell Stratix V. They want that inventory gone. Gone from their website, gone from wholesalers, gone from retail outlets. when Stratix V is in full production it becomes the high end part and the Statix IV becomes the value segment. The Stratix III is just a third wheel.

I'm not sure it works that way, though - there's always customers out there who've built designs around older chips and for whatever reason either can't or don't want to move over to the latest and greatest. From what I've seen the FPGA vendors are really slow at taking older chips out of production for this exact reason.

Quote from: Inspector 2211 on January 29, 2012, 02:49:54 PM

Case in point: The CEO (and probably sole proprietor) of ZTEX did.

Before the 1.15x module, with its 8 Amp core voltage supply, came the 1.15d module, and the originally recommended supply for it
http://www.ztex.de/usb-fpga-1/pwr-1.0.e.html only sported a 3 Amp core voltage supply!

He found out the hard way that all these unrolled loops of SHA-2 cause something like 50% of all flip-flops on the FPGA to switch simultaneously, and
thus blowing even the most conservative power estimations out of the water.

That board wasn't really designed for Bitcoin mining in the first place - he's been selling them on eBay for ages as generic FPGA development boards.

ZodiacDragon84

sr. member

Activity: 266

Merit: 250

The king and the pawn go in the same box @ endgame

to /b/, or not to /b/, that is the question? Huh

DiabloD3

legendary

Activity: 1162

Merit: 1000

DiabloMiner author

Quote from: RandyFolds on January 30, 2012, 02:17:43 PM

Quote from: n0ne on January 30, 2012, 05:29:28 AM

Quote from: RandyFolds on January 30, 2012, 02:42:15 AM

But they avoided that issue. Interesting info though. So what's this further unexplained delay?

*yawn* Reported. Tired of all the trolling. We can't even have a conversation in here without you butting in and being a dick. Don't you have a job, a hobby or something?

Nice. Did you put "I'm a whiny little bitch" in the comments box?

Go fuck yourself, fella. Feel free to report this post as well.

Ahh, a fine afternoon in off-topic, indeed.

bulanula

hero member

Activity: 518

Merit: 500

Yeah. I am pretty sure they launched this like last year and with some impossible specs to ward off any potential competitors from developing anymore.

But that fully custom ASIC miner that gets 10 ghash/s at 10W and costs $100 is still on the horizon so BFL beware Grin

!

RandyFolds

sr. member

Activity: 448

Merit: 250

Quote from: n0ne on January 30, 2012, 05:29:28 AM

Quote from: RandyFolds on January 30, 2012, 02:42:15 AM

But they avoided that issue. Interesting info though. So what's this further unexplained delay?

*yawn* Reported. Tired of all the trolling. We can't even have a conversation in here without you butting in and being a dick. Don't you have a job, a hobby or something?

Nice. Did you put "I'm a whiny little bitch" in the comments box?

Go fuck yourself, fella. Feel free to report this post as well.

P4man

hero member

Activity: 518

Merit: 500

Quote from: DeathAndTaxes on January 30, 2012, 11:40:12 AM

AMD doesn't
a) public guaranteed specs based on simulations
b) provide official word on launch dates
c) indicate product will ship in 4-6 weeks for 4 months.

Often data will leak about a product and product will either not live up to that leak or meet the launch timeline.

It is far different if the company puts out an official spec or date and then fails to live up to their own promise.

AMD does most of the above to its customers, which is not us, but OEMs and OBMs. Not too pick on AMD, nVidia does the same, and I happen to know almost literally what you wrote above happened for both Tegra2 and 3. I know that from someone at a large mobile phone manufacturer. Of course large companies are a bit more experienced and therefore somewhat more careful in their wording, if for no other reason than huge contractual damages, but essentially they drop the ball the same way BFL did quite often.

Topic: 1GH/s, 20w, $700 (was $500) — Butterflylabs, is it for real? (Part 2) - page 29. (Read 146946 times)