Pages:
Author

Topic: Block Erupter: Dedicated Mining ASIC Project (Open for Discussion) - page 6. (Read 58619 times)

sr. member
Activity: 420
Merit: 250
Yes, its just like unrolling loop iterations, but in hardware. You have timing problems because you have one clock pulse per hash and everything needs to arrive into their stages at the right time; if you have traces that are too long or too short then stuff doesn't function correctly or you need to waste more silicon trying to properly synchronize data.

It is much easier to just keep data exactly where it needs to be and iteratively process it.

I agree it's easier... but it isn't better. Which was of course, my entire point.

No. As bitfury demonstrated in practice, tiny hashing cores are superior to unrolled designs. He gets 300 Mh/s compared to 220-240 Mh/s for the competition, on an LX150. Some of the gains are attributable to overclocking and overvolting, but most come from its superior design.

No. You said it yourself "on an lx150" - the correct way to do this would be to use dozens or hundreds of chips and have it process in a single stage... impractical on an FPGA, but perfectly doable for an ASIC.

mrb
legendary
Activity: 1512
Merit: 1028
Yes, its just like unrolling loop iterations, but in hardware. You have timing problems because you have one clock pulse per hash and everything needs to arrive into their stages at the right time; if you have traces that are too long or too short then stuff doesn't function correctly or you need to waste more silicon trying to properly synchronize data.

It is much easier to just keep data exactly where it needs to be and iteratively process it.

I agree it's easier... but it isn't better. Which was of course, my entire point.

No. As bitfury demonstrated in practice, tiny hashing cores are superior to unrolled designs. He gets 300 Mh/s compared to 220-240 Mh/s for the competition, on an LX150. Some of the gains are attributable to overclocking and overvolting, but most come from its superior design.
sr. member
Activity: 420
Merit: 250
Yes, its just like unrolling loop iterations, but in hardware. You have timing problems because you have one clock pulse per hash and everything needs to arrive into their stages at the right time; if you have traces that are too long or too short then stuff doesn't function correctly or you need to waste more silicon trying to properly synchronize data.

It is much easier to just keep data exactly where it needs to be and iteratively process it.

I agree it's easier... but it isn't better. Which was of course, my entire point.
legendary
Activity: 1162
Merit: 1000
DiabloMiner author
I think you're suggesting that unrolled cores are the answer. They aren't. You run into timing problems, and you also pay for that silicon to be produced no matter how sparse or packed it is. The best option seems to be iterative rolled up cores that take ~110 cycles to do a nonce, but you have ~100 times more cores.

Plus, it increases yields as the controller hardware can just test which cores work and ignore known broken ones (ie, intentionally binning parts ala modern GPU design).

I'm not familiar with the term unrolled cores... I assume it's a bastardization related to how you might unroll loop iteration on an x84 cpu?

Why there be timing problems, we'd have a relatively slow cycle time... and thus plenty of time to check and/or error correct.


Yes, its just like unrolling loop iterations, but in hardware. You have timing problems because you have one clock pulse per hash and everything needs to arrive into their stages at the right time; if you have traces that are too long or too short then stuff doesn't function correctly or you need to waste more silicon trying to properly synchronize data.

It is much easier to just keep data exactly where it needs to be and iteratively process it.
sr. member
Activity: 420
Merit: 250
I think you're suggesting that unrolled cores are the answer. They aren't. You run into timing problems, and you also pay for that silicon to be produced no matter how sparse or packed it is. The best option seems to be iterative rolled up cores that take ~110 cycles to do a nonce, but you have ~100 times more cores.

Plus, it increases yields as the controller hardware can just test which cores work and ignore known broken ones (ie, intentionally binning parts ala modern GPU design).

I'm not familiar with the term unrolled cores... I assume it's a bastardization related to how you might unroll loop iteration on an x84 cpu?

Why there be timing problems, we'd have a relatively slow cycle time... and thus plenty of time to check and/or error correct.
legendary
Activity: 1162
Merit: 1000
DiabloMiner author
Just wanted to toss in my thoughts on the project:

First, get over the smaller is better idea, Yes smaller gaps are nicer for lesser power consumption but it isn't essential. Most miners aren't going to care of the unit is a square foot or 3 square inches... as long as it does the work, and we don't have to modify the cooling.

What you should really look at is... using very large silicon with the gate structure being shallow and very very wide. What if you were able to process an entire nonce in a few cycles through a massive asic gate array... that's only as deep as it needs to be to to generate a single hash.

The amount of silicon wouldn't raise the price that much since you'd simply be making the process much more modular that current designs, and duplicating it over a much larger number of chips. You would raise the cost having to custom enclosure and heatsink for the large hardware. . . but you could recover some of that by using a larger process (90nm?).

The issue with this design is you need to have the software already optimized before making the hardware.

The downfall of designs in EVERY other asic manufacturer, seems to be using 'as small as possible chips' then having to run them at high clock rates and having them do repetative incremental work. Creating a need for custom cooling and stupidity like cooling the bottom of the pcboard with a mosfet cooler (yah BFL I said it).  When the design goals should be exactly the opposite (aka load entire noncerange, process entire noncerange) then output flush and start with a new nonce range.


I think you're suggesting that unrolled cores are the answer. They aren't. You run into timing problems, and you also pay for that silicon to be produced no matter how sparse or packed it is. The best option seems to be iterative rolled up cores that take ~110 cycles to do a nonce, but you have ~100 times more cores.

Plus, it increases yields as the controller hardware can just test which cores work and ignore known broken ones (ie, intentionally binning parts ala modern GPU design).
sr. member
Activity: 420
Merit: 250
Just wanted to toss in my thoughts on the project:

First, get over the smaller is better idea, Yes smaller gaps are nicer for lesser power consumption but it isn't essential. Most miners aren't going to care of the unit is a square foot or 3 square inches... as long as it does the work, and we don't have to modify the cooling.

What you should really look at is... using very large silicon with the gate structure being shallow and very very wide. What if you were able to process an entire nonce in a few cycles through a massive asic gate array... that's only as deep as it needs to be to to generate a single hash.

The amount of silicon wouldn't raise the price that much since you'd simply be making the process much more modular that current designs, and duplicating it over a much larger number of chips. You would raise the cost having to custom enclosure and heatsink for the large hardware. . . but you could recover some of that by using a larger process (90nm?).

The issue with this design is you need to have the software already optimized before making the hardware.

The downfall of designs in EVERY other asic manufacturer, seems to be using 'as small as possible chips' then having to run them at high clock rates and having them do repetative incremental work. Creating a need for custom cooling and stupidity like cooling the bottom of the pcboard with a mosfet cooler (yah BFL I said it).  When the design goals should be exactly the opposite (aka load entire noncerange, process entire noncerange) then output flush and start with a new nonce range.
legendary
Activity: 1162
Merit: 1000
DiabloMiner author
Will there be any way to purchase units using ASICMINER shares?
legendary
Activity: 1778
Merit: 1008
but the chips discussed above are tiny. TINY. 6mmx6mm is smaller then most of my fingernails.

And its the packaged chip size. The naked die could be many times smaller. I wonder what the rationale is for making such small chips? Small chips means higher yield, but on a mature process like this, yield  wouldnt be a serious issue for anything below 100mm2. Packaging, testing, assembly, cooling,   PCB costs etc would make this a bad trade off I would think.

Friedcat can you say how small the actual die is and why you designed it so small?
The main reason is that it makes the iteration cycle shorter and reduces the potential risk/complexity. This is our first ASIC project and our main target is a successful production. Any technical challenge should be avoided in the first place, rather than confronted with a lot of time.

reasonable. get the process down and understand it inside and out first with something small and simple, THEN do the bigger, more complex stuff. very reasonable approach.

...i want more shares now...
donator
Activity: 848
Merit: 1005
but the chips discussed above are tiny. TINY. 6mmx6mm is smaller then most of my fingernails.

And its the packaged chip size. The naked die could be many times smaller. I wonder what the rationale is for making such small chips? Small chips means higher yield, but on a mature process like this, yield  wouldnt be a serious issue for anything below 100mm2. Packaging, testing, assembly, cooling,   PCB costs etc would make this a bad trade off I would think.

Friedcat can you say how small the actual die is and why you designed it so small?
The main reason is that it makes the iteration cycle shorter and reduces the potential risk/complexity. This is our first ASIC project and our main target is a successful production. Any technical challenge should be avoided in the first place, rather than confronted with a lot of time.
legendary
Activity: 980
Merit: 1040
but the chips discussed above are tiny. TINY. 6mmx6mm is smaller then most of my fingernails.

And its the packaged chip size. The naked die could be many times smaller. I wonder what the rationale is for making such small chips? Small chips means higher yield, but on a mature process like this, yield  wouldnt be a serious issue for anything below 100mm2. Packaging, testing, assembly, cooling,   PCB costs etc would make this a bad trade off I would think.

Friedcat can you say how small the actual die is and why you designed it so small?
hero member
Activity: 938
Merit: 1002
Also, AFAIK with the decreasing nm figure the MASK becomes more expensive and the design process takes longer to keep the chance for failure to a minimum...

True, but this doesn't affect J/GHash. However, I think at least for first generation of ASICs, production costs will be more important. That's why I like ASICMINER's model.
donator
Activity: 994
Merit: 1000
Jutarul, arklan: what makes you think BFL will use an old 130nm process like friedcat did?
Power efficiency increases with the square of the transistor junction area.
Do the math at 90nm, or 65nm.
I will win the bet
Wink

If you disagree with me, please do bet against me on betsofbtco.in!

If your plan was to make advertisement for your bet, you succeeded. If you have information on what process and supplier BFL is using please post the corresponding reference links. The processing technology has indeed a huge impacting on power efficiency, however, as friedcat has indicated, design optimization is even more important. Also, AFAIK with the decreasing nm figure the MASK becomes more expensive and the design process takes longer to keep the chance for failure to a minimum... (?)
mrb
legendary
Activity: 1512
Merit: 1028
Jutarul, arklan: what makes you think BFL will use an old 130nm process like friedcat did?
Power efficiency increases with the square of the transistor junction area.
Do the math at 90nm, or 65nm.
I will win the bet
Wink

If you disagree with me, please do bet against me on betsofbtco.in!
legendary
Activity: 1778
Merit: 1008
Update

Chip Specification
Technology Summary:
...
Power Consumption: 4.2 J/GHash
...

Very good. Very good. More evidence that I will win my bet
wow. You sure are a glass is almost full kinda guy Smiley
1/4.2 = 238 MHash/J
So what you're saying is that BFL, who have not provided any evidence that they actually participated in chip design and optimization (if they have please correct me and post a link) outperform ASICMINER by almost 50%? You know that friedcat et al. spent a lot of time on optimizing that figure?

i've no idea what the impact of this would really be - but the chips discussed above are tiny. TINY. 6mmx6mm is smaller then most of my fingernails. what impact would a large chip size have on the whole hash/joule thing? the FPGA's in the oroginal BFL single are what, 3 or 4 times that size? larger, fewer chips = better power maybe?

...yea, i can't even convince myself on that one.
donator
Activity: 994
Merit: 1000
Update

Chip Specification
Technology Summary:
...
Power Consumption: 4.2 J/GHash
...

Very good. Very good. More evidence that I will win my bet
wow. You sure are a glass is almost full kinda guy Smiley
1/4.2 = 238 MHash/J
So what you're saying is that BFL, who have not provided any evidence that they actually participated in chip design and optimization (if they have please correct me and post a link) outperform ASICMINER by almost 50%? You know that friedcat et al. spent a lot of time on optimizing that figure?
mrb
legendary
Activity: 1512
Merit: 1028
Update

Chip Specification
Technology Summary:
  130 nm
  1 Ploy
  6 Metal
  1 Top Metal
  Logic Process
Core Voltage: 1.2 V
I/O Voltage: 3.3 V
Core Frequency: 335 MHz
Core Frequency Range: 255-378 MHz
PLL Multiplier: 28
Power Consumption: 4.2 J/GHash
Number of Pads: 40
  22 Data
  18 Power
Package Type: QFN40
Packaged Chip Size: 6 mm x 6 mm

Chip Interface
Data Pins (22 in total):
clk                    i
soft-reset             i
reset                  i
cs                     i
addr[6]                i
data[8]                i/o
w_valid                i
w_allow                o
r_allow                o
r_req                  i

Address Allocation:
0-31    writing midstate
32-43   writing data
44-47   reading nonce


Very good. Very good. More evidence that I will win my bet
legendary
Activity: 1270
Merit: 1000
I think there is confusion between MH/s (mega hash per second) with MHz (mega hertz, clock frequency).

No.

Current implementations are made so each clock does one "turn" of the sha256 algorithm, but on multiple stages, usually 128 to make a complete 2-sha256 bitcoin full computation (sometimes called a "stack" )

So 1 chip with one complete stack gives 1 complete computation each clock in such implementations, and therefore Mhz = MH/s

That was my question : is there only one stack in this chip ?

If you had 2 stacks in the chip, you would have had MH/s = 2* MHz

This is only true for most of the used designs. At least bitfury has the sea of hashers aproach where each hashing core  needs 68? cycles to compute a hash but it turned out that it allows  a more effective device utilisation.

As far i understand the discussion between bitfury and friedcat the sea of hashers aproach was at least taken into account.

Maybe friedcat could produce a prelininary datasheet that would be sufficient to design a pcb. I am curious if the pll multipliert is fixed which would finetuning for each chip would be a little more complicated than just setting a register.
hero member
Activity: 728
Merit: 540
I think there is confusion between MH/s (mega hash per second) with MHz (mega hertz, clock frequency).

No.

Current implementations are made so each clock does one "turn" of the sha256 algorithm, but on multiple stages, usually 128 to make a complete 2-sha256 bitcoin full computation (sometimes called a "stack" )

So 1 chip with one complete stack gives 1 complete computation each clock in such implementations, and therefore Mhz = MH/s

That was my question : is there only one stack in this chip ?

If you had 2 stacks in the chip, you would have had MH/s = 2* MHz




legendary
Activity: 980
Merit: 1040
Core Frequency: 335 MHz
Core Frequency Range: 255-378 MHz

Do I understand this correctly if I conclude that one chip will mine @ 335MH/s  ? or 378MH/s if it's a good one ?

Yes.

255-378 MHz is the result of the back-end simulation under 1.2V. If you do an over-voltage, it will probably be significantly higher, but the stability is hard to say. Exactly how high a frequency we could push them to, could only be answered when the chips are out.

I think there is confusion between MH/s (mega hash per second) with MHz (mega hertz, clock frequency).
Pages:
Jump to: