Pages:
Author

Topic: Block Erupter: Dedicated Mining ASIC Project (Open for Discussion) - page 5. (Read 58642 times)

sr. member
Activity: 800
Merit: 250
I came in late, is the project still going forward despite all the GLBSE hoo ha?

Check the ASICMINER thread. Everything is still going, friedcat is just waiting for Nefario to send the list of shareholders.
hero member
Activity: 602
Merit: 500
I came in late, is the project still going forward despite all the GLBSE hoo ha?
hero member
Activity: 868
Merit: 1000
Anyone know which way friedcat decided to go with this? ie.e sell hardware, shares, selfmine?

See this topic: ASICMINER https://bitcointalksearch.org/topic/asicminer-entering-the-future-of-asic-mining-by-inventing-it-99497

The first 16 TH (roughly) will be used for mining for the company, then the mining farm will be extended while the ASICs are sold to the general public
full member
Activity: 157
Merit: 103
Anyone know which way friedcat decided to go with this? ie.e sell hardware, shares, selfmine?
legendary
Activity: 2128
Merit: 1073
Thus why I'm certainly using USB direct for all ASIC USB devices - not Serial-USB and adding more overhead on top of it (and timing issues)
Thank you for the writeup. I'm not really familiar with building clusters using USB, I always worked with real serial HDLC/RS-232/RS-422 controllers or with Ethernet multicast.

The only real USB experience I had was with FTDI USB controllers. Neither ngzhang nor Enterpoint bothered to route all available signals from the serial chip to the FPGAs, so the high-bandwidth low-latency modes of transmission couldn't be used with them.

Hopefully the ASIC controller designers won't make the same mistakes and will allow you to use isochronous or bulk modes when the bus utilization becomes non-neglible.
legendary
Activity: 4634
Merit: 1851
Linux since 1997 RedHat 4
only to find all the windows problems were driver related - not my code.
I'm presuming that you had problems with the usbser.sys from Microsoft. Did you also had problems with the Prolific/FTDI drivers as well?

I think (though not 100% sure) the serial-USB device's existence is decided by the firmware so it can be fixed after the fact anyway?
On the LPC1343 like ModMiner yes. ngzhang used hard serial-USB chips (Prolific or FTDI) in his designs. Same with Enterpoint (FTDI).

I've only done the MMQ so far. Though the code is 'done' and in my git I haven't sent the pull to cgminer yet coz I've had to rebase it and there seems to be a bug in the ~2k lines of code I've changed that I still have to track down Smiley

Most likely I'll try Icarus/Prolific next and find more obstacles Tongue
(unless I get side tracked on something else ... an ASIC device shows up? Cheesy)

The Windows driver work around, in the MMQ case, was to use http://sourceforge.net/projects/libwdi/files/zadig/ to force it to use WinUSB (on WinXP)
So it's not insurmountable - but best if not every windows end user has to do that.

I've bitched about Serial-USB for a long time but only recently got around to doing this USB direct implementation

Firstly, I've only been messing with USB for a few weeks, so if anything below is way off - let me know.

Guessing at the early figures and considering around 50GH/s from a single device using 1 diff shares, and that USB has a standard transaction time of 0.125ms for 480MB/s USB 2.0, there already isn't a lot of space (and txn time is higher for 12MB/s, 1ms)

50GH/s is 11.6x1diff shares a second on average so just dealing with 6 transactions for that (send work, verify, request, receive, request, finished)
You're using up almost 1% of the USB for a single device (0.87%)
There's of course more overhead (device status e.g. temperature or anything else available to be monitored) but 6 is pretty much the minimum.
Add 10 of these devices ... and I've no idea how well USB works running at ~9% capacity (and how that affects other USB devices)
Also, if the device is idle for even 1ms waiting for work, that's more than 1% of it's work time lost
Thus why I'm certainly using USB direct for all ASIC USB devices - not Serial-USB and adding more overhead on top of it (and timing issues)

Down the track, once the first version ASIC devices have been optimised more for hashing performance (e.g. adding passing share difficulty to the firmware if not already ... or even going as far as implementing something like Stratum in the firmware) this will reduce the bandwidth usage of a single device, but then again it shouldn't be that far down the track when 50GH/s per USB device might increase substantially.
legendary
Activity: 4634
Merit: 1851
Linux since 1997 RedHat 4
...
Edit: Note to self: Kano is swapping the standard terminology: step vs. round. Using standard terminology first SHA-256 hash in Bitcoin consists of 2 steps of 64 rounds each.
...
Yep I mixed them around - oh well - fortunately it was obvious Cheesy
Thanks for correcting me.
legendary
Activity: 2128
Merit: 1073
1) Firstly, the double sha256 is a total of 3 rounds (with 64 steps each) - just the whole first round is constant across a full nonce range.
(commonly known as the midstate) that you only need to do once per nonce range.
2) Secondly, the first 3 steps of the 2nd round are constant across a full nonce range.
3) Thirdly, some of the W values are also constant across a full nonce range (easy to work out which)
4) Then finally, as you said, you don't need to complete the last 3 steps of the 3rd round.
Thanks. I'm quoting this because it is a very nice reference for the state-of-the-art GPU/FPGA optimizations. I remembered the 4) on your list the most because it most clearly shows the shift-register structure inherent to the SHA-256.

Edit: Note to self: Kano is swapping the standard terminology: step vs. round. Using standard terminology first SHA-256 hash in Bitcoin consists of 2 steps of 64 rounds each.

In ASIC terms it would be risky to implement any of 2, 3 or 4
While you may gain a few % overall (6 out of 128 steps plus W optimistations) it also means you can only sha256 an exact BTC block header.
If BTC continues to use sha256 but makes any changes to the block header, then that wouldn't be a problem if none of steps 2, 3 or 4 were implemented in the silicon, since you could change the firmware to deal with a different header.
At least for the chip discussed in this thread it appears that the block header structure is fixed:
0-31    writing midstate
32-43   writing data
44-47   reading nonce
legendary
Activity: 2128
Merit: 1073
only to find all the windows problems were driver related - not my code.
I'm presuming that you had problems with the usbser.sys from Microsoft. Did you also had problems with the Prolific/FTDI drivers as well?

I think (though not 100% sure) the serial-USB device's existence is decided by the firmware so it can be fixed after the fact anyway?
On the LPC1343 like ModMiner yes. ngzhang used hard serial-USB chips (Prolific or FTDI) in his designs. Same with Enterpoint (FTDI).
legendary
Activity: 4634
Merit: 1851
Linux since 1997 RedHat 4
1) Firstly, the double sha256 is a total of 3 rounds (with 64 steps each) - just the whole first round is constant across a full nonce range.
(commonly known as the midstate) that you only need to do once per nonce range.
2) Secondly, the first 3 steps of the 2nd round are constant across a full nonce range.
3) Thirdly, some of the W values are also constant across a full nonce range (easy to work out which)
4) Then finally, as you said, you don't need to complete the last 3 steps of the 3rd round.

In ASIC terms it would be risky to implement any of 2, 3 or 4
While you may gain a few % overall (6 out of 128 steps plus W optimistations) it also means you can only sha256 an exact BTC block header.
If BTC continues to use sha256 but makes any changes to the block header, then that wouldn't be a problem if none of steps 2, 3 or 4 were implemented in the silicon, since you could change the firmware to deal with a different header.
legendary
Activity: 2128
Merit: 1073
seventy-two 32-bit registers
Fixed that first part for you... but just wasn't up to trying to edit the rest for conceptual logic failures.
The 72 misconception is really getting boring.

FIPS-180-2 defines SHA-256 in terms of two arrays of 32-bit words:
H[8] and W[64]. 8+64=72. Yet a quick comparison with SHA-1 shows that the same "alternative implementation" can be used for SHA-256.

In case of SHA-1 the "original implementation" is H[5] and W[80]; while "alternative implementation" is H[5] and W[16]. Thus: 85 vs. 21.

In case of SHA-256 we have 72 vs. 24 (H[8] and W[16]).

The further observation is that the "arrays" or "circular queues" in the FIPS-180-2 definition aren't really accessed randomly or in any variable order. Therefore both H and W can be converted to 32-bit wide shift registers, but with unusual feedback functions.

The above is just for pure SHA-256, without any Bitcoin specific optimizations. At least two people claimed to be able to apply some unspecified optimizations to Bitcoin hash expressed as a binary function:

1) killerstorm
https://bitcointalksearch.org/topic/sha-256-as-a-boolean-function-55888

2) Gareth (BitInstant)
https://bitcointalksearch.org/topic/m.557579

but nothing came out of it. By now pretty much everyone knows about the fact that one can shave 3 last rounds from the 2nd SHA-256 in Bitcoin: instead of looking for zero 32-bit word at the most significant position in H; take an advantage of the fact that H is a shift register and last 3 rounds simply shift the would-be-most-significant-word. So look for a negation of a specific constand value (0x3c6ef372?).
 
Software optimization isn't the same as hardware optimization. ASIC design should not be thought of as "lets make a chip that can do this calculation at 2000mhz over and over and over..." that's counter-intuitive. You've locked yourself into thinking in terms of GPU design which need not apply to other processes. The reason GPUs (and yes, CPUs too) are designed this way is because they are multi-function chip. There's operations they know how to do, and they process things according to instructions. That's fine for generalized applications. In the case of GPU you've got a hard limit / goal of producing a video frame every so many fractions of a second... sha2 just doesn't need that level of coordination. You aren't having to work for a variety of instructions - it's a single process that doesn't change.

Besides which we're not actually talking about that much data. sha2... we only need to work with 512 bits at a time. At the very least we had better be unrolling the chunk processing for so that it isn't looping... that's hardware design 101.

I'm puzzled by this part. I never mentioned CPU nor GPU. I tried to pattern my argument after the sort of arguments that were being made around 1980 during the http://en.wikipedia.org/wiki/Mead_%26_Conway_revolution . Perhaps you were mixing me with someone else?
legendary
Activity: 4634
Merit: 1851
Linux since 1997 RedHat 4
Oh yeah - and make sure the MCU guy stops it from producing a serial device on Windows
(doesn't matter on linux, but that will stop it on linux also of course)

All code with ASIC should be using USB direct not serial-USB
And having the serial-USB can cause problems on windows (and usually means a manual driver fix)

I've been screwing around with this for the last few weeks on an MMQ-FPGA converting it from serial-USB to USB
only to find all the windows problems were driver related - not my code.
Lucky I've had access in IRC to the guy who does libusb, to help me sort it out Smiley

My reason for doing this was to prepare for the ASIC devices from each of the companies - and I'm glad I did do it in advance - coz the problems have been rather annoying.
Of course there will be other issues when dealing with ASIC, but of course I can't do anything about all of them until I have the devices.

I think (though not 100% sure) the serial-USB device's existence is decided by the firmware so it can be fixed after the fact anyway?
sr. member
Activity: 420
Merit: 250
The argument for the sea-of-hashes design can be derived from the classic analysis made by Mead & Conway and contemporaries.

Consider a circular sea-of-gates big enough to implement many copies of the Bitcoin double-SHA256.

SHA256 is basically a pair of 32-bit wide shift registers seventy-two 32-bit registers


Fixed that first part for you... but just wasn't up to trying to edit the rest for conceptual logic failures.

Software optimization isn't the same as hardware optimization. ASIC design should not be thought of as "lets make a chip that can do this calculation at 2000mhz over and over and over..." that's counter-intuitive. You've locked yourself into thinking in terms of GPU design which need not apply to other processes. The reason GPUs (and yes, CPUs too) are designed this way is because they are multi-function chip. There's operations they know how to do, and they process things according to instructions. That's fine for generalized applications. In the case of GPU you've got a hard limit / goal of producing a video frame every so many fractions of a second... sha2 just doesn't need that level of coordination. You aren't having to work for a variety of instructions - it's a single process that doesn't change.

Besides which we're not actually talking about that much data. sha2... we only need to work with 512 bits at a time. At the very least we had better be unrolling the chunk processing for so that it isn't looping... that's hardware design 101.


legendary
Activity: 1162
Merit: 1000
DiabloMiner author
Hey friedcat, if you give me one of these, I'll make sure DiabloMiner supports it.
legendary
Activity: 4634
Merit: 1851
Linux since 1997 RedHat 4
So I see what's going on in here Smiley

My post in the shareholder thread that seems should be in here:
https://bitcointalksearch.org/topic/m.1350329
legendary
Activity: 2128
Merit: 1073
That's because the routing really sucks on Spartan-6 FPGAs. I'm not convinced an ASIC would have the same problem.
The argument for the sea-of-hashes design can be derived from the classic analysis made by Mead & Conway and contemporaries.

Consider a circular sea-of-gates big enough to implement many copies of the Bitcoin double-SHA256.

SHA256 is basically a pair of 32-bit wide shift registers with some somewhat convoluted feedback logic. The feedback logic is active (doing the actual computation) whereas D-type flip-flops and connections are passive (just shuffle the signal around). Let X be the average connection length in this design.

Now think about unrolling the above design over a plane. You'll need the values of the feedback terms from the neighbouring cells w-2,w-7,w-15,w-16. Your average connection length rises (2+7+15+16)/4 times or about 10*X . So the passive losses in the interconnect rose about an order of magnitude. You could compensate for this by removing some D-type flip-flop stages and slowing down the clock. By definition you can't really remove the active logic gates that compute the feedback terms. As an extreme you can have a purely combinatorial SHA-256 hasher doing everything in single cycle of a rather slow clock.

I'm not aware of any neat analytical solution for the above optimization problems. But the numerical experiments show that racing the combinatorial signals over vast expanses of silicon is a losing game. Speed of light in MOS transmission line is much less than the speed of light in vacuum. This analysis can be made without actual place-and-route, it is sufficient to have an estimated distribution of the inter-connection lenghts that create planar graph for the logic. I don't recall if sphere-of-gates instead of sea-of-gates is a win, but sphere-of-gates has an obvious termal problem even if we could somehow manufacture it.

In summary: wafer-scale integration was attempted several times in the past without an obvious win. Check out the history before you follow that trail.
hero member
Activity: 560
Merit: 500
Yes, its just like unrolling loop iterations, but in hardware. You have timing problems because you have one clock pulse per hash and everything needs to arrive into their stages at the right time; if you have traces that are too long or too short then stuff doesn't function correctly or you need to waste more silicon trying to properly synchronize data.

It is much easier to just keep data exactly where it needs to be and iteratively process it.

I agree it's easier... but it isn't better. Which was of course, my entire point.

No. As bitfury demonstrated in practice, tiny hashing cores are superior to unrolled designs. He gets 300 Mh/s compared to 220-240 Mh/s for the competition, on an LX150. Some of the gains are attributable to overclocking and overvolting, but most come from its superior design.

I can confirm that Bitfury's bitstream runs at ~305 Mhash/s on normal voltage range as per Xilinx specs.
legendary
Activity: 1162
Merit: 1000
DiabloMiner author
No. As bitfury demonstrated in practice, tiny hashing cores are superior to unrolled designs. He gets 300 Mh/s compared to 220-240 Mh/s for the competition, on an LX150. Some of the gains are attributable to overclocking and overvolting, but most come from its superior design.
That's because the routing really sucks on Spartan-6 FPGAs. I'm not convinced an ASIC would have the same problem.

It would, but because heat and (lack of) voltage (to prevent more heat, causing stability issues) would become much more apparent on more complex designs. Spartan 6s just make the problem seem a magnitude or two worse than it is.
mrb
legendary
Activity: 1512
Merit: 1028
firefop, I think you are confusing 2 aspects which are orthogonal to each other: the die size (large or small) is mostly irrelevant to the type of design (tiny hashing cores or large unrolled cores).

For one, even a large unrolled core core would fit in a chip smaller than BFL's SC (56.25mm2 at 65nm). So no matter what design you choose (unrolled or not) you can put as many cores as you want to target whatever die area you want.

You guys claim that routing is not an issue on ASIC, but this is incorrect too. It is less of an issue compared to FPGAs, but it still is, especially for SHA-256 where you have 256 bits of state to manipulate. If you are familiar with the algorithm, you should know that this state (A..H) is rotated in the main loop, so the 256 bits are used all over the place, and create routing challenges. This is less of an issue with a non-unrolled design, as the state can be kept close to the tiny core.
hero member
Activity: 686
Merit: 564
No. As bitfury demonstrated in practice, tiny hashing cores are superior to unrolled designs. He gets 300 Mh/s compared to 220-240 Mh/s for the competition, on an LX150. Some of the gains are attributable to overclocking and overvolting, but most come from its superior design.
That's because the routing really sucks on Spartan-6 FPGAs. I'm not convinced an ASIC would have the same problem.
Pages:
Jump to: