Pages:
Author

Topic: Algorithmically placed FPGA miner: 255MH/s/chip, supports all known boards - page 37. (Read 119440 times)

legendary
Activity: 938
Merit: 1000
What's a GPU?
I have started a new religion where the world ends in 2 hours and 58 mins , anyone interested?

I'm down.
hero member
Activity: 896
Merit: 1000
Buy this account on March-2019. New Owner here!!
I have started a new religion where the world ends in 2 hours and 58 mins , anyone interested?
rjk
sr. member
Activity: 448
Merit: 250
1ngldh
My clock shows somewhat under 4 hours.
I'm guessing that on whatever platform you are using, the javascript doesn't have access to the time zone setting on your computer. Either that, or the time zone setting is wrong on your computer.
hero member
Activity: 714
Merit: 504
^SEM img of Si wafer edge, scanned 2012-3-12.
My clock shows somewhat under 4 hours.
rjk
sr. member
Activity: 448
Merit: 250
1ngldh
Four hours to go?
My clock shows 11 hours Sad
Same here, the javascript code says midnight GMT minus 7 hours, so I'm not sure what's wrong lol
legendary
Activity: 1029
Merit: 1000
rjk
sr. member
Activity: 448
Merit: 250
1ngldh
c_k
donator
Activity: 242
Merit: 100
hero member
Activity: 504
Merit: 500
With the last few things you said I believe I know roughly what you will be offering. That's awesome!  Please do say you were on the dev list for one of Enterpoint's units? ;p
donator
Activity: 980
Merit: 1004
felonious vagrancy, personified
I just visited the site and it seems the counter will reach 0 at about 5:08 a.m. PDT tomorrow (Thursday).
The day and hour and minute of your birth, by any chance?
Because 5/31 5:08 a.m. looks pretty arbitrary to me, otherwise.

Edit: Hitting F5 changed it to the afternoon now.  5:06 p.m. - still pretty arbitrary. Any meaning to it?

No.  It's supposed to be 11:59PM PDT.  If it's showing something else it's because I really suck at writing javascript -- not a hidden message Wink.
sr. member
Activity: 448
Merit: 250
TargetDate = "5/31/2012 12:00 PM GMT-7";

D'oh.  That was supposed to be:


TargetDate = "5/31/2012 11:59 PM GMT-7";


… which I have just changed it to.  So, if you're wondering why the clock jumped backward by 11 hours and 59 minutes, this is why.

I just visited the site and it seems the counter will reach 0 at about 5:08 a.m. PDT tomorrow (Thursday).
The day and hour and minute of your birth, by any chance?
Because 5/31 5:08 a.m. looks pretty arbitrary to me, otherwise.

Edit: Hitting F5 changed it to the afternoon now.  5:06 p.m. - still pretty arbitrary. Any meaning to it?
donator
Activity: 980
Merit: 1004
felonious vagrancy, personified
About your anti-theft technology. Will buying an IP core involve shipping FPGA boards to you for reprogramming?

Absolutely not.

Or do you have a tool that your future customers can use to basically scan the available Spartan 6 chips and send you some info which you use to make custom bitstreams that only work on those particular devices?

There's only one bitstream per board-type (i.e. clock input pin, clock frequency, and host interface).  Everybody using board XYZ gets the same bitstream.
donator
Activity: 980
Merit: 1004
felonious vagrancy, personified
TargetDate = "5/31/2012 12:00 PM GMT-7";

D'oh.  That was supposed to be:


TargetDate = "5/31/2012 11:59 PM GMT-7";


… which I have just changed it to.  So, if you're wondering why the clock jumped backward by 11 hours and 59 minutes, this is why.
hero member
Activity: 489
Merit: 500
Immersionist
About your anti-theft technology. Will buying an IP core involve shipping FPGA boards to you for reprogramming?

Or do you have a tool that your future customers can use to basically scan the available Spartan 6 chips and send you some info which you use to make custom bitstreams that only work on those particular devices?

I am just trying to understand the logistics about your solution. There are people that have to update a larger number of boards if it ever happens.

I know, I should wait for the announcement but maybe my questions will help planning your release or add another answer or two to your FAQ.
sr. member
Activity: 266
Merit: 251
Hrm, looks like I picked the wrong week for a heads-down sprint to make a (self-imposed) deadline.  I'll keep my response short since at the moment I can't afford to get drawn into a long fascinating/distracting conversation...

Check PM.

Also - I am waiting for your release :-) And about your protection.... As this is very interesting if you can really protect that bitstream!

Because you know - what I missed actually is that I targeted at too high clock... That is possible mistake...

How long it would take to you to place a bit different round design with the way you did ?

I will tell you approx. numbers. round will be a bit bigger. ABCDEFGH round part fits into 64-72 slices depending on location.
W-round fits into 32-40 slices again depending on location.

Fitting fully expanded miner into 6 clock regions. I see feasibility there, but definitely re-starting with building tool set for that is big pain, as chip would not last that long. How long will it take for you to implement such round ? Density is very high. If your placer is good for that - then this parallel round alone might work at 240 Mhz because of round pipe-lining design, but only if you manage to place it. As "standard" placer places such design into full chip... Manually I tried only partially and then thrown this away as this fucking difficult, especially that you have to deal in tricky way around DSPs, BRAMs etc.

Then my idea - there's 2 places for rolled miners left / right bottom part and 9 rightmost, 7 mid-right, 5 mid-left, 9 left-most additional rolled round space. + 2 DSP-rounds in the leftmost part, + 2 rolled round in top part.
So 2+9+7+5+9+2 = 34 rounds can be still implemented.

Say having one parallel round - we have 1 x clock + having 0.52 more - it will be 1.52 x clock = 364 Mh/s per chip @ 240 Mhz. Higher than I can get alone with rolled design.



donator
Activity: 980
Merit: 1004
felonious vagrancy, personified
Hrm, looks like I picked the wrong week for a heads-down sprint to make a (self-imposed) deadline.  I'll keep my response short since at the moment I can't afford to get drawn into a long fascinating/distracting conversation...

About "there's no long lines" - I've already commented, but will try to draw it, where epic fail for parallel expander is exactly....  And in Spartan-6 there's difficult to pass more than 256-bit cross-section in 8 slices height long-way (there's 32 QUAD routes per each switch - so 256-bits would use QUAD routes in horizontal case for 8 slices height).

Then I guess it's a good thing my stages are three times that height!  They're tall-and-skinny (4x24 = 96 - 8 totally empty = 82 slices per single round) for a reason.  Also, as you point out:

interconnect works in one direction only, so if rounds placed in smart way, you'll get more efficiency in routing resources usage ( i.e. A,B <---> C,D while A --> C and B <--- D are interconnected and placed into same regions).

Indeed.  This is why I chose a ring-shaped design.  The innermost 8-slice-tall tracks are moving in opposite directions, so they don't compete for QUAD lines -- which doubles my QUAD budget.  I suppose you can now guess which part of each 4x24 region is the message expander Wink

So I really respect author's work of fitting 1.5 parallel rounds into Spartan 6 - it is tough and very nice work.

Thanks.  Your results are very impressive too!

I have to say, at times I find myself wishing for the reduced headaches of the the sea-of-tiny-hashers approach.  But ultimately I went with an unrolled design for anti-theft reasons (I'll explain in a week) and also to let me hardwire the k-values into the adder LUTs (a three-input-plus-constant adder is a lot smaller than a four-input adder).  Also, the sea-of-tiny-hashers approach yields more benefit if you're willing to do not only algorithmic placement (which both of us do) but also algorithmic routing (which I don't do).  I decided to stick to heuristic routing (except for a very few cases) to preserve portability -- I have a massive pile of Virtex-II Pros that I got almost-for-free and I might be able to pick up a bunch of deeply-discounted Virtex-5's as well.  Although the slice design has changed a bit during the 10 year span from v2pro to s6, if you don't hardwire the routing it's possible to have an AxB region of Virtex-II Pro "emulate" an XxY region of Spartan-6 very efficiently (although, of course, A>X and B>Y).

By the way, when I first saw your announcement, I took a look at your timing report -- 441 lines of generic boilerplate, and all but 9 lines of the actual report redacted (".... Dropped other traces report ....") and there were no carry chains on the lone path you decided to leave in the report!  At that point I was pretty suspicious.  On the other hand, after reading your postings, you clearly know what you're talking about -- the obnoxious "missing slices in columns 66+67" problem is something most people aren't aware of.  So now I'm leaning back towards believing it.  Anyways, I know you posted the redacted timing report in order to bolster your credibility, but because of the way you've edited it, it actually might have the opposite effect.

sr. member
Activity: 266
Merit: 251
About "there's no long lines" - I've already commented, but will try to draw it, where epic fail for parallel expander is exactly....

say computing w0+w1 and feeding to w9:

                                        ---+---------------------------------
                                   ---+---------------------------------
                              ---+--------------------------------
                          ---+-------------------------------
                     ---+------------------------------
                ---+-----------------------------
           ---+----------------------------
      ---+----------------------------
 ---+---------------------------
w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15 w16

How many wires ? biggest cross-section just for that ? 9x32 bits :-)
The same happens when pushing w9 to w16... and w14 to w16...
Lazy to calculate - but near 512 bits cross-section...

I've thought about this - it actually prevented me from falling asleep yesterday night.

A fully unrolled, pipelined miner would use 125 single columns, or maybe 125 double columns. The latter case leads to the dreaded U-turn.
The "current" sixteen w values, 32 bits each, would be percolated down the 125 columns as well. At 4 slices per 32 bits,
that's 64 slices.
I see this, for instance, from ZTEX's source code.

Now, in order to retrieve w[i-16], 32 VERTICAL wires are needed.
In order to retrieve w[i-15], another 32 VERTICAL wires are needed, but, as you correctly point out, w[i-16] and w[i-15] can be combined, so only 32 wires are needed. w[i-7] can be added to the sum of w[i-16] and w[i-15], so still only 32 vertical wires are needed. Likewise for w[i-2].

When I said that in a fully unrolled miner no long lines are needed, I meant no HORIZONTAL long lines.
For instance, in ZTEX's Verilog code, references are made to the current stage and to the prior stage only.

OK, I don't know out of the top of my head how many vertical wires are available in a Spartan-6, but I just tried to make the case that only 32 are needed. If 32 are not available in a single column, then two columns per stage have to be used, which leads to the dreaded U-turn.

Well - interconnect goes between switches. there's 2 slices per switch (slice L/M and slice X). so chip is like 70-72 switches wide (including BRAM/DSP) and 192 switches tall - means horizontal quad interconnect is 2.5 times larger than vertical............. So it is wise to use horizontal interconnect for W round... and I tried such designs, however it starts consuming switches when you would like to make U turn............ because you would add there some registers etc...... painful and tough...
sr. member
Activity: 448
Merit: 250
About "there's no long lines" - I've already commented, but will try to draw it, where epic fail for parallel expander is exactly....

say computing w0+w1 and feeding to w9:

                                        ---+---------------------------------
                                   ---+---------------------------------
                              ---+--------------------------------
                          ---+-------------------------------
                     ---+------------------------------
                ---+-----------------------------
           ---+----------------------------
      ---+----------------------------
 ---+---------------------------
w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15 w16

How many wires ? biggest cross-section just for that ? 9x32 bits :-)
The same happens when pushing w9 to w16... and w14 to w16...
Lazy to calculate - but near 512 bits cross-section...

I've thought about this - it actually prevented me from falling asleep yesterday night.

A fully unrolled, pipelined miner would use 125 single columns, or maybe 125 double columns. The latter case leads to the dreaded U-turn.
The "current" sixteen w values, 32 bits each, would be percolated down the 125 columns as well. At 4 slices per 32 bits,
that's 64 slices.
I see this, for instance, from ZTEX's source code.

Now, in order to retrieve w[i-16], 32 VERTICAL wires are needed.
In order to retrieve w[i-15], another 32 VERTICAL wires are needed, but, as you correctly point out, w[i-16] and w[i-15] can be combined, so only 32 wires are needed. w[i-7] can be added to the sum of w[i-16] and w[i-15], so still only 32 vertical wires are needed. Likewise for w[i-2].

When I said that in a fully unrolled miner no long lines are needed, I meant no HORIZONTAL long lines.
For instance, in ZTEX's Verilog code, references are made to the current stage and to the prior stage only.

OK, I don't know out of the top of my head how many vertical wires are available in a Spartan-6, but I just tried to make the case that only 32 are needed. If 32 are not available in a single column, then two columns per stage have to be used, which leads to the dreaded U-turn.
legendary
Activity: 1162
Merit: 1000
DiabloMiner author
I'm definitely not an ASIC developer, so correct me if I'm wrong here. From the small number of simple designs I've done and layed out in Cadence, routing in an ASIC is definitely not free, especially if you're really trying to push the boundaries of all your well, gate, pad, etc keepouts to maximize density on the silicon. If you're not careful with your planning and design your number of metal layers can jump way up which definitely adds to the cost of the design even outside of the possible performance penalties from haphazard routing. I've only ever done work at 90nm and above so I don't know how difficult the routing would be at 45nm or whatever a BTC ASIC would end up getting designed at, but small rolled cores might be more effective in an ASIC as well as in an FPGA. Someone would actually have to look into it to know. Maybe Vladimir could shed some light onto the subject.

ASIC gives you maximum flexibility in the design. The biggest problem with FPGAs is the fact FPGA designs must use the DSP blocks and the BRAM for storage of the constants. Routing is still a problem on ASICs but _you_ design the routing. You no longer have to worry about routing around things on FPGAs, and you no longer have to worry about paying for hardware you'll never use (like, for example, that high speed serial IO fabric isn't cheap, or is the onboard Ethernet controller and such).

ASIC has a huge upfront design cost, but if we could sell 250k ASICs (or, approximately more chips than all the FPGAs currently in use for mining put together) it would be cheaper per mhash over the next 10 years by an order of magnitude.
Definitely agree on that. I was more questioning whether a few fully unrolled cores are inherently more efficient that more rolled up cores? Routing is much more flexible on an ASIC, and things that have been giving eldentyrell fits like turning the corner don't have to be an issue, but that alone doesn't mean that unrolled is a better choice than rolled. Is there some inherent advantage to a fully unrolled core that would make it the defacto choice if someone were designing an ASIC?

Unrolled ASIC design seems to be a waste. You have a lot of dependencies, and the dependencies come in nearly identical sets (the only real difference is just shuffling the output to put it back into the next stage). Hell, unrolled GPU kernels? They're not even unrolled, they just optimize the ordering and parallelization (ie, what FGPA coders would consider a function of routing).
legendary
Activity: 1274
Merit: 1004
I'm definitely not an ASIC developer, so correct me if I'm wrong here. From the small number of simple designs I've done and layed out in Cadence, routing in an ASIC is definitely not free, especially if you're really trying to push the boundaries of all your well, gate, pad, etc keepouts to maximize density on the silicon. If you're not careful with your planning and design your number of metal layers can jump way up which definitely adds to the cost of the design even outside of the possible performance penalties from haphazard routing. I've only ever done work at 90nm and above so I don't know how difficult the routing would be at 45nm or whatever a BTC ASIC would end up getting designed at, but small rolled cores might be more effective in an ASIC as well as in an FPGA. Someone would actually have to look into it to know. Maybe Vladimir could shed some light onto the subject.

ASIC gives you maximum flexibility in the design. The biggest problem with FPGAs is the fact FPGA designs must use the DSP blocks and the BRAM for storage of the constants. Routing is still a problem on ASICs but _you_ design the routing. You no longer have to worry about routing around things on FPGAs, and you no longer have to worry about paying for hardware you'll never use (like, for example, that high speed serial IO fabric isn't cheap, or is the onboard Ethernet controller and such).

ASIC has a huge upfront design cost, but if we could sell 250k ASICs (or, approximately more chips than all the FPGAs currently in use for mining put together) it would be cheaper per mhash over the next 10 years by an order of magnitude.
Definitely agree on that. I was more questioning whether a few fully unrolled cores are inherently more efficient that more rolled up cores? Routing is much more flexible on an ASIC, and things that have been giving eldentyrell fits like turning the corner don't have to be an issue, but that alone doesn't mean that unrolled is a better choice than rolled. Is there some inherent advantage to a fully unrolled core that would make it the defacto choice if someone were designing an ASIC?
Pages:
Jump to: