Pages:
Author

Topic: 1GH/s, 20w, $500 — Butterflylabs, is it a scam? - page 32. (Read 123107 times)

legendary
Activity: 2128
Merit: 1073
sorry, virtex-6 is highly different to spartan-6. in fact, spartan-6's routing resource is much less than virtex-6, so we all meet place & route difficulties even the fully pipeline design only take approx. 65% CLBs of the FPGA. still, about half of luts in spartan-6 is SLICEXs, these luts are like shit.
I mostly agree with you, with the exception where you call SLICEXs shit.

I don't know your HDL code and your design, but I've took a look at the ones published by ZTEX and fpgaminer. They both don't balance the use of available resources.

Three points worth checking:

1) use the adders in DSP48's
2) use the BlockRAM as ROM for the storage of the SHA-256 magic numbers instead of spreading them all over the implementation in LUTs
3) stream the pipeline stages: no matter how many stages, produce one result per clock (in honor of Mr. Cray)  

Both Virtex-6 and Spartan-6 have some DSP48 and BlockRAM. Obviously they are different. But the two things I learned about FPGA design in school were:

1) you don't leave available resources unused, even if you have to mangle your source design.
2) watch the floor-plan for unnecessary long wires.

I had a long break in FPGA design, so I can't come up with quick specific answers. But I downloaded ISE evaluation, played a little with it and will most likely buy a ML605 or KC705 next year when they become available.
hero member
Activity: 592
Merit: 501
We will stand and fight.
while (i=0 i<4; i++)
{
   print i
}

print 0
print 1
print 2
print 3

Rolled (looped) 80 clock = 1 hash.
Unrolled 1 clock = 1 hash.
What you writing about unrolling is both 100% true and 100% not on the subject of digital design.

What you write about SHA-256 implementation is simply incorrect. The full SHA-256 of the fixed 256-bit string takes exactly 64 rounds, not 80. This is the minor thing. The most important thing is that all the published rolled designs are sub-optimal in that they aren't fully pipelined.

Example: 32-way unrolled single SHA-256 they always take 2 clocks between taking and delivering the results, 16-way: 4-clocks; 8-way: 8-clocks; and so forth.

What I remember from my digital design class was the example of Seymour Cray. At that time it took 3 clocks to do a floating point add. Everyone was convinced that to add N floating point numbers it will take N*3 clocks. Well, Seymour Cray proved them wrong: his Cray-1 machine added 64 floats in 66 clocks and made the history.

It seems like Bitcoin hasn't yet met somebody with full understanding of the digital design. There was a guy who came and went with only 8 posts without openly publishing his FPGA design:

https://bitcointalksearch.org/topic/m.454555

So the Bitcoin is still waiting for somebody to fully exploit the apparent parallelism in its hashing algorithm.


sorry, virtex-6 is highly different to spartan-6. in fact, spartan-6's routing resource is much less than virtex-6, so we all meet place & route difficulties even the fully pipeline design only take approx. 65% CLBs of the FPGA. still, about half of luts in spartan-6 is SLICEXs, these luts are like shit.
legendary
Activity: 2128
Merit: 1073
while (i=0 i<4; i++)
{
   print i
}

print 0
print 1
print 2
print 3

Rolled (looped) 80 clock = 1 hash.
Unrolled 1 clock = 1 hash.
What you writing about unrolling is both 100% true and 100% not on the subject of digital design.

What you write about SHA-256 implementation is simply incorrect. The full SHA-256 of the fixed 256-bit string takes exactly 64 rounds, not 80. This is the minor thing. The most important thing is that all the published rolled designs are sub-optimal in that they aren't fully pipelined.

Example: 32-way unrolled single SHA-256 they always take 2 clocks between taking and delivering the results, 16-way: 4-clocks; 8-way: 8-clocks; and so forth.

What I remember from my digital design class was the example of Seymour Cray. At that time it took 3 clocks to do a floating point add. Everyone was convinced that to add N floating point numbers it will take N*3 clocks. Well, Seymour Cray proved them wrong: his Cray-1 machine added 64 floats in 66 clocks and made the history.

It seems like Bitcoin hasn't yet met somebody with full understanding of the digital design. There was a guy who came and went with only 8 posts without openly publishing his FPGA design:

https://bitcointalksearch.org/topic/m.454555

So the Bitcoin is still waiting for somebody to fully exploit the apparent parallelism in its hashing algorithm.
donator
Activity: 1218
Merit: 1079
Gerald Davis
 Their design achieved 1.4Gbps which is only 1400 / 512 /2 = 1.4MH/s.

Per core
Each core using 7% of their FPGA. WHich is a virtex 2, stone age FPGA.
The point here is not absolute performance, that document is I believe 6 years old,  the point is using a GPP they achieved significant speedup (actually, die size savings which can be used for the same) over partially or fully unrolled approaches. I think that validates that a hybrid approach can work.

But they didn't acheive significant speed up compared to Bitcoin unrolled designs.  

Also you can't say they had die savings because they used 7% of the die.    By using 7% of the die they were able to operate at a higher frequency.  Had they used more of the die space they would have had to clock the chip lower.  The choice of using 7% of die or 97% of die space is already accounted for in the throughput.   They went w/ a fast & light design using the power of embedded microprocessor to handle inter-block overhead (something which doesn't exist in Bitcoin).

Virtex-II Pro (XC2PV-7) has 11K LUT and their output is ~1.4MH/s.
Spartan 6 (LX150) has 150K LUT but the output achieved is 190MH/s.

ztek design works on a chip w/ 13x the LUT but achieved 135X the performance.  Even on a relative basis they didn't acheive superior performance.  No doubt using Virtex-7 one could get higher than 1.4MH/s however you aren' getting a Virtex-7 for $500.
hero member
Activity: 518
Merit: 500
 Their design achieved 1.4Gbps which is only 1400 / 512 /2 = 1.4MH/s.

Per core
Each core using 7% of their FPGA. WHich is a virtex 2, stone age FPGA.
The point here is not absolute performance, that document is I believe 6 years old,  the point is using a GPP they achieved significant speedup (actually, die size savings which can be used for the same) over partially or fully unrolled pure fpga approaches. I think that validates that a hybrid approach can work.

Anyway, Ill just leave it at that, and we shall see in a few weeks.
donator
Activity: 1218
Merit: 1079
Gerald Davis
Yes, Im aware (I do have large feet Cheesy). But in the case of SHA hashing both the while/do loop and your "print" statement is slightly more complicated, and I do not see why it could not be beneficial to optimize chips for specific tasks. To use your example, you could have one FPGA that only does "print i"'s, but is very efficient at it and can be clocked very high and do a lot of them in parallel;  and one chip that feeds it the "i"s and does all the control logic, but has to be clocked at lower speeds.

Because the interactions between sub rounds of SHA-256 hash are highly interdependent.  It would require a significant amount of interchip communication.  Also there is no control logic needed.  There is nothing unpredictable about SHA-256 that would result in an unexpected outcome, branching, or other control logic.  While you could for example have the A to D sub rounds handled by one chip and the E to H subrounds handled by another chip but a 1 GH/s and 160 rounds per GH the two chips would need to communicate 160 billions times per second and would require latency in the picoseconds and bandwidth in the terrabit range.

It *could* be done but it wouldn't be easier, simpler, or cheaper than just unrolling a loop and processing an entire hash per clock.

Quote
If you think that sounds silly, have a look at this paper:
http://ce.et.tudelft.nl/publicationfiles/1194_657_SHA2.pdf

Well no, the PowerPC core is used to handle data between rounds.  This is a stream hash design (like hashing a 50kb document). If you are hashing a 50kb document it is still done if 512byte blocks.  The output of the first block is then hashed with next 512bytes of the document until you finish the document.  handling input, and storing intermediaries between blocks is significant overhead.  For example to hash a 50KB document requires roughly 100 blocks.  You can't do them "paralllel" with multiple chips because the input for block x is the output of block x -1.  They "solved" that problem by using an embedded microprocessor. 

However that problem doesn't even exist in Bitcoin. The block header doesn't change for a long time (4 billion hashes even a 500MH/s is 8 seconds) and even when it does change it is independent of all prior hashes.  All the control work it extra-hash and can be done by the host PC.  At >3GH it takes <6% CPU time on a single core sempron to handle all the block header creation, hash validation work, OS overhead, and communication with pool.  Ballpark a quadcore i-5 acting as a host could handle at least 100GH/s+.

That paper was interesting but even using PowerPC internal core it gets roughly 1/130th the performance of ztek's design for example.   Their design achieved 1.4Gbps which is only 1400 / 512 /2 = 1.4MH/s.   They also used a FPGA which is about 7x the size of the one used by ztek and costs about 20x as much.  That paper should make you think about the performance claims made by Butterfly Labs.  
hero member
Activity: 518
Merit: 500
What do you mean "do only or mostly unrolling".  

A FPGA doesn't do unrolling.  Unrolling is simply a method to convert a loop logic into a flat logic.

For example this is a loop (it would be considered a rolled logic)
while (i=0 i<4; i++)
{
   print i
}


However that logic can be expressed identically using this flat logic (unrolled logic):
print 0
print 1
print 2
print 3


Yes, Im aware (I do have large feet Cheesy). But in the case of SHA hashing both the while/do loop and your "print" statement is slightly more complicated, and I do not see why it could not be beneficial to optimize chips for specific tasks. To use your example, you could have one FPGA that only does "print i"'s, but is very efficient at it and can be clocked very high and do a lot of them in parallel;  and one chip that feeds it the "i"s and does all the control logic, but has to be clocked at lower speeds.

If you think that sounds silly, have a look at this paper:
http://ce.et.tudelft.nl/publicationfiles/1194_657_SHA2.pdf

They seem to do precisely that, separate a small critical path of the hashing function in to SHA coprocessors that are fed by a (virtual) PowerPC core, in this case, implemented on the same FPGA as the coprocessors. It doesnt take too much faith to imagine it makes more sense to use an actual PowerPC chip or some other GPP to feed 32 FPGA's with "print i" instructions.

donator
Activity: 1218
Merit: 1079
Gerald Davis
No, the issue is that implementing the unrolling logic costs a lot of die space and limits clocks.

That is not true.  Unrolled version are more complicated because routing can be tough especially if near max of the chip design but an unrolled version is going to be faster than a rolled version because of the loop overhead.

Quote
For all I know it might make sense to have one FPGA do only or mostly unrolling, and one, running at higher clock speeds, only hashing.

What do you mean "do only or mostly unrolling".  

A FPGA doesn't do unrolling.  Unrolling is simply a method to convert a loop logic into a flat logic.

For example this is a loop (it would be considered a rolled logic)
while (i=0 i<4; i++)
{
   print i
}


However that logic can be expressed identically using this flat logic (unrolled logic):
print 0
print 1
print 2
print 3

In assembly code (or GPU opcode or FPGA bitstream) the later version is more efficient.  It can be easily changed to say 5 iterations or 30 iterations but SHA-256 never changes so it a perfect candidate for unrolling.

So no FPGA "does unrolling".  You unroll the algorithm to  make it more efficient and load code on the FPGA which involves no loops and can be completed in a single cycle. The SHA-256 algorithm is a loop with 80 iterations.  If you implemented it rolled on an FPGA it would take 80 clock cycles to complete 1 hash.  So you can unrolled the loop (requires more LU) and complete the entire hash in one cycle.

Rolled (looped) 80 clock = 1 hash.
Unrolled 1 clock = 1 hash.

Now the rolled version IS smaller so you could put multiple rolled versions on 1 chip.  However loops have lots of overhead so in the same die space you can't get 80 rolled version.  Maybe you can only get 60 rolled versions (likely you will get much much less maybe 20 loops but lets be generous and say 60).

60 rolled version x 1 hash per 80 clocks = 0.75 hashes per clock.
1 unrolled version 1 hash per 1 clock = 1 hash per clock.

You will never get >80 rolled version on the same FPGA.  Why?  Loops always have overhead.  

Quote
What Ive read is that using 50% die space for the unrolling is fairly typical, so using 50% of the chips could make a lot of sense.  Never wondered why even the "single" board has 2 FPGA's (assuming thats what they are) ?

This sentence is illogical and I think it is because unrolling doesn't mean what you think it does.  
There is no "unrolling" involved in creating or checking a hash.  
There is no "unrolling" in SHA-256.

Unrolling is simply the process of converting a looping logic to a flat logic.

http://en.wikipedia.org/wiki/Loop_unwinding
hero member
Activity: 518
Merit: 500
This public display that they will be doing before or on the 25th. Will it be in Kansas City, in their apartment(Lab)? How soon after they do it will they send the test unit to Inaba as they said they would?

Is there any information about this other than the fact that they said they would do it in public on or before the 25th of November?

Also in the USA the 24th is a big Holiday (Thanksgiving) and the 25th is also a day off work for many.

Thanks.



I think they will do it in the restaurant. Free fajitas anyone Smiley ?
hero member
Activity: 518
Merit: 500

Why not just have the ASIC or FPGA do everything then?

Because different chips are good at different things. I know AMD drivers use the CPU to unroll shader loops before sending them to the GPU. I assume they have a reason for that, and latency isnt killing them (even if game rendering is probably by itself far more latency bound than hashing. Its no good having 100FPS if they lag 1 second Smiley ).

I wouldnt be one bit surprised if it turns out the BFL "rig box" has some CPU or asip for that very purpose, and the single boards do not.

Quote
The amount of extra cost, complexity, and tolerances requires for high speed intra chip communication aren't trivial problems.  Given a hash is so easy to solve you don't need two chips working on it.  You can just have two chips working on two hashes independently.

Yes, but probably  slower.

Quote
I think you misunderstand.  It isn't an issue that unrolled = slower.


No, the issue is that implementing the unrolling logic costs a lot of die space and limits clocks.  For all I know it might make sense to have one FPGA do only or mostly unrolling, and one, running at higher clock speeds, only hashing. What Ive read is that using 50% die space for the unrolling is fairly typical, so using 50% of the chips could make a lot of sense.  Never wondered why even the "single" board has 2 FPGA's (assuming thats what they are) ?

Quote
Still no hard feeling man. I think we are just going to need to agree to disagree.

Agreed Smiley.
donator
Activity: 1218
Merit: 1079
Gerald Davis
All that matters is that the ASIx can feed the FPGA enough work to keep busy. Why are you so sure this is latency bound and couldnt not work asynchronous?

Why not just have the ASIC or FPGA do everything then?  If your communication is asynchronous you still need to be concerned about latency.  The "work" from one chip needs to be buffered for the other chip.  That buffer is going to need low latency between the buffer and the receiving chip otherwise the FPGA is going to be delayed in loading.  FPGA is already delayed in loading w/ current designs but a few clock cycle delay is negligible on a 10+ second operating window.  

The amount of extra cost, complexity, and tolerances requires for high speed intra chip communication aren't trivial problems.  Given a hash is so easy to solve you don't need two chips working on it.  You can just have two chips working on two hashes independently.

Quote
Yes, but every paper Ive seen says this is a substantial tradeoff. You trade substantial clockspeed and FPGA die estate for the unrolling. Just imagine you could offload that to another chip and use 100% of the FPGA and at higher speeds for the hashing. Surely there is something to be gained?

I think you misunderstand.  It isn't an issue that unrolled = slower.  The FPGA has no concept of rolled or not.  It just "runs" the logic on each clock tick.

small % of die used = higher clock.
larger % of die used = lower clock.

This is due to routing not rolling.  The more "full" the FPGA the harder routing becomes and eventually speed needs to be throttled.  Although speed is throttled the relationship between die usage and speed is non-linear so a slower "fuller" design will have higher throughput then a faster lighter design.

Depending on the amount of logic needed there are situations where not unrolling can be faster but that is just because there aren't FPGA in custom LU counts.  For example say your unrolled design takes 110K LU and can't be reduced.  On a 150K LU chip that is a lot of wasted die space.  By rolling some of the logic and fitting 2x 70K designs on a single chip you would get higher performance.  Still this is like I said a limit of having only preset FPGA "sizes".  If there was a 120K LU FPGA then it likely would be even more economical.

Still no hard feeling man. I think we are just going to need to agree to disagree. Unless shown real proof (not scammer claims) that multi-chip intra-hash design or rolled design at higher clock is more efficient I am going to stick with my (albeit limited) electrical engineering knowledge.

Hell I would even read a whitepaper but marketing double speak just doesn't convince me.  I wish this was real but there is too much smoke to not have a fire.  Interesting BF Labs (or is it Butterfly Labs) ignored my question about their website having incorrect corporate name.  Things like having the wrong legal name are just too damn suspicious.

hero member
Activity: 518
Merit: 500


Um latency doesn't matter if all the work is done inside a chip but if you are talking about 2 chips working together intra-hash then latency is massive.  massive. Remember a single hash takes less than a millionth of a second to complete.  Any latency at all would slow that down.

AFAICS,  that is simply not true. Or not necessarily true. You assume one chip would be idle waiting for the output of the other chip, but there is no need for that. You got to have enough bandwidth between the chips, but even that is likely very minor. I dont see why it would matter if the FPGPA processes the unrolled loops even 100000ns after the ASIC (or ASIP) preprocessed it. All that matters is that the ASIx can feed the FPGA enough work to keep busy. Why are you so sure this is latency bound and couldnt not work asynchronous?

Quote
Actually most FPGA designs cryptography or not involve unrolled loops.

Yes, but every paper Ive seen says this is a substantial tradeoff. You trade substantial clockspeed and FPGA die estate for the unrolling. Just imagine you could offload that to another chip and use 100% of the FPGA and at higher speeds for the hashing. Surely there is something to be gained?

Quote
If you have multiple chips working together intra-hash then latency would need to be in the pico second range otherwise chips will be idling waiting for data from other chips.  

Asynchronous.
Smiley
donator
Activity: 1218
Merit: 1079
Gerald Davis
So what? Latency is no issue, throughput is. Consider the miner is connected to USB and then ethernet (or worse) to the internet. Who cares about nano second latency if you can greatly improve throughput.

Latency needs to be put in context.
With FPGA doing a portion (or all) of nonce range then latency on I/O to FPGA is a non-issue.

If FPGA does full nonce range (2^32) even operating a 1GH/s it takes 4 seconds to complete.  So a latency of few milliseconds is a non issue.

However it was you claim that multiple chips are working together intra-hash.  A single hash @ 1 GH would take less ~1 nanosecond to complete.  So yes even 1 nanosecond of latency would be a killer.  To have intra-hash communication @ GH speeds would require latency  in the pico-second range and that isn't a trivial task.

Quote
That doesnt mean its the most efficient way to do it or the best use of your FPGA. Particularly if the loop unrolling greatly hurts your clockspeed as seems to be the case. Even if not, the space used by it could otherwise be used to do what the FPGA does best. Loop unrolling probably isnt that.

Actually most FPGA designs cryptography or not involve unrolled loops.  Loops are very expensive.  CPU get around that by simply brute forcing their way through but it is hardly efficient or streamlined.  Most compilers will do some level of loop unrolling even in high level programming languages.  The theoretical max performance on a LX150 is 200MH/s and one FPGA designer has gotten 190MH/s.  So pretty damn efficient.  While the chip does run slower as number of gates used increases it is non-linear.  If you have a FPGA X which runs at 100MHz with 90% of gates used, cutting that down to 45% of gates used (and keeping logic rolled) doesn't get you double speed.  Maybe you get 30% more speed, however it now takes multiple cycles to complete a single operation so you are behind even though the chip is running faster.

Most FPGA designs are unrolled unless cost constraints simply make that prohibitive.  Larger FPGA tend to have non-linear cost structure so using a rolled design (at lower performance) on a smaller FPGA sometimes is necessary simply because of cost consideration.  If your target price point is $59 you likely have no choice but to accept lower performance at higher clock and use a rolled design.

Quote
Look, you are a clever guy, but I don't think you're cryptographer or a chip designer. Id be careful being too sure about all this.

I have a double degree in computer science and electrical engineering.  While I may not be an FPGA expert I know enough that your "solution" can't be the reason BF labs has higher performance using multiple boards. 

Either it is:
a) a scam
b) some other reason (chip binning for example).
legendary
Activity: 980
Merit: 1008
hmm, virtual box but not directly under wine? Wonder if that helps narrow down the language it's in.?
Unfortunately it doesn't. Because 1: WINE doesn't work with USB devices, 2: WINE doesn't work with Windows drivers (only Windows applications). So WINE is a dead end.
VitualBox on the other hand emulates a computer, on which Windows is running, so there is an actual copy of Windows running and it supports USB pass-through. So there's no reason why VirtualBox wouldn't work, since all the calculations are performed by the device connected via USB, and not by the emulated computer (which is really slow).

I'm glad they announced a public demonstration on the 25th though. At least we'll know approximately by then if it's a scam. Either they demonstrate on or around the 25th, or they are packing their bags right now and will be nowhere to be found on the 25th.
hero member
Activity: 518
Merit: 500

That doesn't make any sense.  A single hash is incredibly fast.  Literally a millionths of a second or less to complete.  If you tried to have inter-chip communication between ASIC and FPGA that communication bus would be on the gigabit per second scale with nano-second latency. .

So what? Latency is no issue, throughput is. Consider the miner is connected to USB and then ethernet (or worse) to the internet. Who cares about nano second latency if you can greatly improve throughput.

Quote
There is plenty of room even in the Spartan LX150 to completely unroll the loop INSIDE the FPGA.  

That doesnt mean its the most efficient way to do it or the best use of your FPGA. Particularly if the loop unrolling greatly hurts your clockspeed as seems to be the case. Even if not, the space used by it could otherwise be used to do what the FPGA does best. Loop unrolling probably isnt that.

Look, you are a clever guy, but I don't think you're cryptographer or a chip designer. Id be careful being too sure about all this.
sr. member
Activity: 349
Merit: 250
Send them a resume, give them personal information?

No one can meet them in real life or confirm where they are located. I thought about trying this but I doubt it would work.

I will pay the guy $200 (in bitcoin) just to have coffee with Inaba and have him take Inaba into the "Lab". Assuming Inaba can take photos of both the guy and the lab.

This offer still stands.

Anyone thinks is not a scam, put money where your mouth is... go pre-order the board.

I cant wait to see your faces in a month from now


If you really have faith pay in bitcoin:)

If you're dumb enough to be scammed, you deserve it.

I cant take someone, whos confused with something obvious like Thermalright and Thermaltake,.... seriously.

There's only one way to definitively settle this. Psychic Hotline!
hero member
Activity: 504
Merit: 504
Decent Programmer to boot!
Send them a resume, give them personal information?

No one can meet them in real life or confirm where they are located. I thought about trying this but I doubt it would work.

I will pay the guy $200 (in bitcoin) just to have coffee with Inaba and have him take Inaba into the "Lab". Assuming Inaba can take photos of both the guy and the lab.

This offer still stands.

Anyone thinks is not a scam, put money where your mouth is... go pre-order the board.

I cant wait to see your faces in a month from now


If you really have faith pay in bitcoin:)

If you're dumb enough to be scammed, you deserve it.

I cant take someone, whos confused with something obvious like Thermalright and Thermaltake,.... seriously.


Have you heard about the new Thermal-left computers? Really rare, only 5000BTC! Okay, I see what you're getting at.
full member
Activity: 168
Merit: 100
Send them a resume, give them personal information?

No one can meet them in real life or confirm where they are located. I thought about trying this but I doubt it would work.

I will pay the guy $200 (in bitcoin) just to have coffee with Inaba and have him take Inaba into the "Lab". Assuming Inaba can take photos of both the guy and the lab.

This offer still stands.

Anyone thinks is not a scam, put money where your mouth is... go pre-order the board.

I cant wait to see your faces in a month from now


If you really have faith pay in bitcoin:)

If you're dumb enough to be scammed, you deserve it.

I cant take someone, whos confused with something obvious like Thermalright and Thermaltake,.... seriously.
full member
Activity: 168
Merit: 100
Anyone thinks is not a scam, put money where your mouth is... go pre-order the board.

I cant wait to see your faces in a month from now
sr. member
Activity: 448
Merit: 250
Someone should apply to their job postings. I would do it, but I can't even bullshit tech crap, let alone sound like I really know it.
Pages:
Jump to: