Pages:
Author

Topic: Algorithmically placed FPGA miner: 255MH/s/chip, supports all known boards - page 50. (Read 119440 times)

hero member
Activity: 686
Merit: 564
I never really understood the difference, but what I have been able to grasp is that FPGAs have flipflops per LUT. In CPLDs only the immediate neigbours can be addressed directly. In FPGAs you can also address more distant ones.
Also, Wikipedia claims that CPLDs don't actually use LUTs to implement logic, which makes sense given that they're descended from PALs and those were purely sum-of-products. (The first ones were pretty much just several PALs glued together with some routing logic, from what I can tell.)

This is refereed to as wide routing and it takes a longer time till the signal is propagated through these lines. (As for if those are only hard wired lines or if they are buffered somehow idk)
All modern ones have active routing that buffers the signal somehow, though the original ones did have hard-wired lines.
legendary
Activity: 1666
Merit: 1057
Marketing manager - GO MP
That was just a general description of PLDs, and it is valid down the line from SPLDs, CPLDs and FPGAs

I never really understood the difference, but what I have been able to grasp is that FPGAs have flipflops per LUT. In CPLDs only the immediate neigbours can be addressed directly. In FPGAs you can also address more distant ones.
This is refereed to as wide routing and it takes a longer time till the signal is propagated through these lines. (As for if those are only hard wired lines or if they are buffered somehow idk)

If you have a LUT at a corner or edge you need to utilize those far routings to access the same amount of other LUTs as an inner one.
The trick usually is to use them for other tasks then the inner resources, like I/O (they usually have access to a pin)
hero member
Activity: 714
Merit: 504
^SEM img of Si wafer edge, scanned 2012-3-12.
Ah right, that makes sense, thanks! Smiley
And the flipflops, they're synced to the clock signal, I assume then.

Anyway, the LUTs integrated as RAMs makes sense. I'm wondering though how the wiring works, physically. And also what a corner turn is, but I guess that's about it not being easy to switch direction in the wiring, due to implementation details. I'll look it up myself when I have some time though. Unless you feel like explaining Smiley
legendary
Activity: 1666
Merit: 1057
Marketing manager - GO MP
I'm trying to derive the theory from reading the FPGA threads, and some rudimentary knowledge. I'm curious if I'm close, can anyone let me know if I'm wrong?

FPGAs (Field programmable gate arrays) are collections of logic gates on a chip. The gate types themselves can be changed, and the wiring between them can be changed, so you can make your own high speed chip layout.

Every timing tick, all gates "do their process", and update their outputs, based on what they just had as inputs.

Corner turns: Is this like, all processing is flowing to the right, and then at the end of the chip you need to do some wiring or tricks to continue processing to the left?

Almost,

The logic elements are nothing but very small RAMs which are refered to as LUTs (Look up Tables).
It is the same thing as writing a Logic Table. So you can realize for example the XORs in SHA-2 with it.

The wiring consists of a number of flipflops connected to the luts and have a backward wire to themselves and to other luts. So the FPGA can be used for useful computation.

All in all pretty wasteful but the high integration makes up for it.
hero member
Activity: 714
Merit: 504
^SEM img of Si wafer edge, scanned 2012-3-12.
I'm trying to derive the theory from reading the FPGA threads, and some rudimentary knowledge. I'm curious if I'm close, can anyone let me know if I'm wrong?

FPGAs (Field programmable gate arrays) are collections of logic gates on a chip. The gate types themselves can be changed, and the wiring between them can be changed, so you can make your own high speed chip layout.

Every timing tick, all gates "do their process", and update their outputs, based on what they just had as inputs.

Corner turns: Is this like, all processing is flowing to the right, and then at the end of the chip you need to do some wiring or tricks to continue processing to the left?
member
Activity: 89
Merit: 10

Cudos to you for doing this manual routing! it's one hell of a job  Grin

a couple of ideas:

Perhaps  you could use the dedicated busses between the DSP blocks to help out with the speed of the "corner turns".
Or put RAM blocks between there with a few cycles pipe delay to separate the "trees"

It might also be worth investigating smaller devices to get a better "fit" since the cost pr LUT is fairly constant.
They are also easier to cool and might run faster and/or have more headroom for overclocking
hero member
Activity: 504
Merit: 500
TBH I don't think hand placement will improve Fmax very much, although it could reduce the build time.

If you leave half the chip empty, then yes -- automatic placement will get pretty much the same frequency as hand placement.

If the chip is nearly full, automatic placement won't come even close to hand-placement -- if it can finish at all.

It's been three weeks since PAR has been able to finish my design with the RLOCs turned off at any frequency -- I even tried at 1mhz!  If the placement is crap there simply aren't enough wires to get from point A to point B.

It also helps to do your layout keeping in mind the wiring structure of the Spartan switchboxes: fast-path between slices in the same CLB, the 1x 2x 4x routing lines, and anything longer than that is too slow to be worth thinking about (unless it's a pure register-to-register path with no combinational logic).

  MMM, talk nerdy to me! Subbed just to follow. Though I am curious to know what you did to improve the corner turns and do you already have solutions in mind to improve them further?
donator
Activity: 980
Merit: 1004
felonious vagrancy, personified
TBH I don't think hand placement will improve Fmax very much, although it could reduce the build time.

If you leave half the chip empty, then yes -- automatic placement will get pretty much the same frequency as hand placement.

If the chip is nearly full, automatic placement won't come even close to hand-placement -- if it can finish at all.

It's been three weeks since PAR has been able to finish my design with the RLOCs turned off at any frequency -- I even tried at 1mhz!  If the placement is crap there simply aren't enough wires to get from point A to point B.

It also helps to do your layout keeping in mind the wiring structure of the Spartan switchboxes: fast-path between slices in the same CLB, the 1x 2x 4x routing lines, and anything longer than that is too slow to be worth thinking about (unless it's a pure register-to-register path with no combinational logic).
donator
Activity: 980
Merit: 1004
felonious vagrancy, personified
Emphasis on "time" Smiley
FWIW, I don't think it's feasible to do this directly in Verilog/VHDL.  I had to write a Java library to generate the (totally illegible) Verilog.

I am reloading this thread like a crazed weasel.  I'm quite eager to see what kind of results you get from the manual placement.

Corner turns now running at 80mhz.

I know this sounds completely nuts, but there is a very, very, very remote chance of being able to cram three full pipelines onto the chip.  If that doesn't work I can put a "half-pipeline" in the empty space, although this takes a bit more than 50% of the area of a full pipeline (since I can't hardwire the K-values into the LUT equations anymore).  So I think the clock:hashes ratio will be at least 2clocks:2.5hashes.

All of my results are on the less-expensive -2 chip (not the more-expensive -3).

The next week will probably be pretty quiet, I have a major non-bitcoin deadline I have to deal with.  Progress should resume after 9-Nov.
rph
full member
Activity: 176
Merit: 100
ztex reached 200MHz on -3 with ISE 13.2 - it builds in about 40 minutes total (map + par) with -xt 5 -t 19.
I bet he used a lot of CPU time to find those settings..  Grin

TBH I don't think hand placement will improve Fmax very much, although it could reduce the build time.

-rph
hero member
Activity: 714
Merit: 504
^SEM img of Si wafer edge, scanned 2012-3-12.
Oh wow. I know next to nothing about FPGAs*, but this looks very awesome nonetheless.

*I know that they're field-programmable gate arrays, and they're basically logic components which can be reconnected in different ways, making different chips depending on what the designer wants.
staff
Activity: 4284
Merit: 8808
Emphasis on "time" Smiley
FWIW, I don't think it's feasible to do this directly in Verilog/VHDL.  I had to write a Java library to generate the (totally illegible) Verilog.

I am reloading this thread like a crazed weasel.  I'm quite eager to see what kind of results you get from the manual placement.
donator
Activity: 980
Merit: 1004
felonious vagrancy, personified
Heh, it was a matter of time until somebody hand-placed it. Hardcore.

Emphasis on "time" Smiley

FWIW, I don't think it's feasible to do this directly in Verilog/VHDL.  I had to write a Java library to generate the (totally illegible) Verilog.
rph
full member
Activity: 176
Merit: 100
Heh, it was a matter of time until somebody hand-placed it. Hardcore.

-rph
donator
Activity: 980
Merit: 1004
felonious vagrancy, personified
But I have verified that the circuit you see works (i.e. mines actual shares)... though at 45mhz since the corner turn routes are a ridiculous 22ns.

Oh, and, to Xilinx: whatever you did to columns 66 and 67 drives me berzerk.  I can't use the entire pair of columns because slices are randomly MISSING for no apparent reason.
donator
Activity: 980
Merit: 1004
felonious vagrancy, personified
Excuse the nagging, but are you planning to share your work?  Grin

Undecided, for now.  But honestly, at 45MH/sec it doesn't matter just yet.
legendary
Activity: 1666
Merit: 1057
Marketing manager - GO MP
Excuse the nagging, but are you planning to share your work?  Grin
donator
Activity: 980
Merit: 1004
felonious vagrancy, personified
And 3 months ago someone said that theres no way to achieve 200MH/s Wink

In fairness, I still do not have the "corner turn" routes running at that speed.  But I have verified that the circuit you see works (i.e. mines actual shares)... though at 45mhz since the corner turn routes are a ridiculous 22ns.

So, no celebrations just yet...
legendary
Activity: 1029
Merit: 1000
And 3 months ago someone said that theres no way to achieve 200MH/s Wink
donator
Activity: 980
Merit: 1004
felonious vagrancy, personified
@big-chip-small-board:  So you get 100Mhash/s

As you can see, there is room for at least one more copy of the pipeline on the chip.

from this design at what size fpga?

You're looking at an LX150.

Pages:
Jump to: