Algorithmically placed FPGA miner: 255MH/s/chip, supports all known boards - page 50.

makomk

hero member

Activity: 686

Merit: 564

Quote from: ElectricMucus on November 01, 2011, 04:07:30 PM

I never really understood the difference, but what I have been able to grasp is that FPGAs have flipflops per LUT. In CPLDs only the immediate neigbours can be addressed directly. In FPGAs you can also address more distant ones.

Also, Wikipedia claims that CPLDs don't actually use LUTs to implement logic, which makes sense given that they're descended from PALs and those were purely sum-of-products. (The first ones were pretty much just several PALs glued together with some routing logic, from what I can tell.)

Quote from: ElectricMucus on November 01, 2011, 04:07:30 PM

This is refereed to as wide routing and it takes a longer time till the signal is propagated through these lines. (As for if those are only hard wired lines or if they are buffered somehow idk)

All modern ones have active routing that buffers the signal somehow, though the original ones did have hard-wired lines.

ElectricMucus

legendary

Activity: 1666

Merit: 1057

Marketing manager - GO MP

That was just a general description of PLDs, and it is valid down the line from SPLDs, CPLDs and FPGAs

I never really understood the difference, but what I have been able to grasp is that FPGAs have flipflops per LUT. In CPLDs only the immediate neigbours can be addressed directly. In FPGAs you can also address more distant ones.
This is refereed to as wide routing and it takes a longer time till the signal is propagated through these lines. (As for if those are only hard wired lines or if they are buffered somehow idk)

If you have a LUT at a corner or edge you need to utilize those far routings to access the same amount of other LUTs as an inner one.
The trick usually is to use them for other tasks then the inner resources, like I/O (they usually have access to a pin)

BTCurious

hero member

Activity: 714

Merit: 504

^SEM img of Si wafer edge, scanned 2012-3-12.

Ah right, that makes sense, thanks!

And the flipflops, they're synced to the clock signal, I assume then.

Anyway, the LUTs integrated as RAMs makes sense. I'm wondering though how the wiring works, physically. And also what a corner turn is, but I guess that's about it not being easy to switch direction in the wiring, due to implementation details. I'll look it up myself when I have some time though. Unless you feel like explaining

ElectricMucus

legendary

Activity: 1666

Merit: 1057

Marketing manager - GO MP

Quote from: BTCurious on November 01, 2011, 02:47:04 AM

I'm trying to derive the theory from reading the FPGA threads, and some rudimentary knowledge. I'm curious if I'm close, can anyone let me know if I'm wrong?

FPGAs (Field programmable gate arrays) are collections of logic gates on a chip. The gate types themselves can be changed, and the wiring between them can be changed, so you can make your own high speed chip layout.

Every timing tick, all gates "do their process", and update their outputs, based on what they just had as inputs.

Corner turns: Is this like, all processing is flowing to the right, and then at the end of the chip you need to do some wiring or tricks to continue processing to the left?

Almost,

The logic elements are nothing but very small RAMs which are refered to as LUTs (Look up Tables).
It is the same thing as writing a Logic Table. So you can realize for example the XORs in SHA-2 with it.

The wiring consists of a number of flipflops connected to the luts and have a backward wire to themselves and to other luts. So the FPGA can be used for useful computation.

All in all pretty wasteful but the high integration makes up for it.

BTCurious

hero member

Activity: 714

Merit: 504

^SEM img of Si wafer edge, scanned 2012-3-12.

I'm trying to derive the theory from reading the FPGA threads, and some rudimentary knowledge. I'm curious if I'm close, can anyone let me know if I'm wrong?

FPGAs (Field programmable gate arrays) are collections of logic gates on a chip. The gate types themselves can be changed, and the wiring between them can be changed, so you can make your own high speed chip layout.

Every timing tick, all gates "do their process", and update their outputs, based on what they just had as inputs.

Corner turns: Is this like, all processing is flowing to the right, and then at the end of the chip you need to do some wiring or tricks to continue processing to the left?

pusle

member

Activity: 89

Merit: 10

Cudos to you for doing this manual routing! it's one hell of a job Grin

a couple of ideas:

Perhaps you could use the dedicated busses between the DSP blocks to help out with the speed of the "corner turns".
Or put RAM blocks between there with a few cycles pipe delay to separate the "trees"

It might also be worth investigating smaller devices to get a better "fit" since the cost pr LUT is fairly constant.
They are also easier to cool and might run faster and/or have more headroom for overclocking

sadpandatech

hero member

Activity: 504

Merit: 500

Quote from: eldentyrell on October 30, 2011, 04:17:22 PM

Quote from: rph on October 30, 2011, 12:02:21 AM

TBH I don't think hand placement will improve Fmax very much, although it could reduce the build time.

If you leave half the chip empty, then yes -- automatic placement will get pretty much the same frequency as hand placement.

If the chip is nearly full, automatic placement won't come even close to hand-placement -- if it can finish at all.

It's been three weeks since PAR has been able to finish my design with the RLOCs turned off at any frequency -- I even tried at 1mhz! If the placement is crap there simply aren't enough wires to get from point A to point B.

It also helps to do your layout keeping in mind the wiring structure of the Spartan switchboxes: fast-path between slices in the same CLB, the 1x 2x 4x routing lines, and anything longer than that is too slow to be worth thinking about (unless it's a pure register-to-register path with no combinational logic).

MMM, talk nerdy to me! Subbed just to follow. Though I am curious to know what you did to improve the corner turns and do you already have solutions in mind to improve them further?

eldentyrell

donator

Activity: 980

Merit: 1004

felonious vagrancy, personified

Quote from: rph on October 30, 2011, 12:02:21 AM

TBH I don't think hand placement will improve Fmax very much, although it could reduce the build time.

If you leave half the chip empty, then yes -- automatic placement will get pretty much the same frequency as hand placement.

If the chip is nearly full, automatic placement won't come even close to hand-placement -- if it can finish at all.

It's been three weeks since PAR has been able to finish my design with the RLOCs turned off at any frequency -- I even tried at 1mhz! If the placement is crap there simply aren't enough wires to get from point A to point B.

It also helps to do your layout keeping in mind the wiring structure of the Spartan switchboxes: fast-path between slices in the same CLB, the 1x 2x 4x routing lines, and anything longer than that is too slow to be worth thinking about (unless it's a pure register-to-register path with no combinational logic).

eldentyrell

donator

Activity: 980

Merit: 1004

felonious vagrancy, personified

Quote from: gmaxwell on October 28, 2011, 05:29:49 PM

Quote from: eldentyrell on October 28, 2011, 04:14:38 PM

Emphasis on "time"

FWIW, I don't think it's feasible to do this directly in Verilog/VHDL. I had to write a Java library to generate the (totally illegible) Verilog.

I am reloading this thread like a crazed weasel. I'm quite eager to see what kind of results you get from the manual placement.

Corner turns now running at 80mhz.

I know this sounds completely nuts, but there is a very, very, very remote chance of being able to cram three full pipelines onto the chip. If that doesn't work I can put a "half-pipeline" in the empty space, although this takes a bit more than 50% of the area of a full pipeline (since I can't hardwire the K-values into the LUT equations anymore). So I think the clock:hashes ratio will be at least 2clocks:2.5hashes.

All of my results are on the less-expensive -2 chip (not the more-expensive -3).

The next week will probably be pretty quiet, I have a major non-bitcoin deadline I have to deal with. Progress should resume after 9-Nov.

rph

full member

Activity: 176

Merit: 100

ztex reached 200MHz on -3 with ISE 13.2 - it builds in about 40 minutes total (map + par) with -xt 5 -t 19.
I bet he used a lot of CPU time to find those settings.. Grin

TBH I don't think hand placement will improve Fmax very much, although it could reduce the build time.

-rph

BTCurious

hero member

Activity: 714

Merit: 504

^SEM img of Si wafer edge, scanned 2012-3-12.

Oh wow. I know next to nothing about FPGAs*, but this looks very awesome nonetheless.

*I know that they're field-programmable gate arrays, and they're basically logic components which can be reconnected in different ways, making different chips depending on what the designer wants.

gmaxwell

staff

Activity: 4326

Merit: 8951

Quote from: eldentyrell on October 28, 2011, 04:14:38 PM

Emphasis on "time"

FWIW, I don't think it's feasible to do this directly in Verilog/VHDL. I had to write a Java library to generate the (totally illegible) Verilog.

I am reloading this thread like a crazed weasel. I'm quite eager to see what kind of results you get from the manual placement.

eldentyrell

donator

Activity: 980

Merit: 1004

felonious vagrancy, personified

Quote from: rph on October 28, 2011, 01:14:09 AM

Heh, it was a matter of time until somebody hand-placed it. Hardcore.

Emphasis on "time"

FWIW, I don't think it's feasible to do this directly in Verilog/VHDL. I had to write a Java library to generate the (totally illegible) Verilog.

rph

full member

Activity: 176

Merit: 100

Heh, it was a matter of time until somebody hand-placed it. Hardcore.

-rph

eldentyrell

donator

Activity: 980

Merit: 1004

felonious vagrancy, personified

Quote from: eldentyrell on October 27, 2011, 04:06:39 PM

But I have verified that the circuit you see works (i.e. mines actual shares)... though at 45mhz since the corner turn routes are a ridiculous 22ns.

Oh, and, to Xilinx: whatever you did to columns 66 and 67 drives me berzerk. I can't use the entire pair of columns because slices are randomly MISSING for no apparent reason.

eldentyrell

donator

Activity: 980

Merit: 1004

felonious vagrancy, personified

Quote from: ElectricMucus on October 27, 2011, 04:07:40 PM

Excuse the nagging, but are you planning to share your work? Grin

Undecided, for now. But honestly, at 45MH/sec it doesn't matter just yet.

ElectricMucus

legendary

Activity: 1666

Merit: 1057

Marketing manager - GO MP

Excuse the nagging, but are you planning to share your work? Grin

eldentyrell

donator

Activity: 980

Merit: 1004

felonious vagrancy, personified

Quote from: Dexter770221 on October 27, 2011, 03:48:16 PM

And 3 months ago someone said that theres no way to achieve 200MH/s Wink

In fairness, I still do not have the "corner turn" routes running at that speed. But I have verified that the circuit you see works (i.e. mines actual shares)... though at 45mhz since the corner turn routes are a ridiculous 22ns.

So, no celebrations just yet...

Dexter770221

legendary

Activity: 1029

Merit: 1000

And 3 months ago someone said that theres no way to achieve 200MH/s Wink

eldentyrell

donator

Activity: 980

Merit: 1004

felonious vagrancy, personified

Quote from: pusle on October 27, 2011, 02:23:40 AM

@big-chip-small-board: So you get 100Mhash/s

As you can see, there is room for at least one more copy of the pipeline on the chip.

Quote from: pusle on October 27, 2011, 02:23:40 AM

from this design at what size fpga?

You're looking at an LX150.

Topic: Algorithmically placed FPGA miner: 255MH/s/chip, supports all known boards - page 50. (Read 119468 times)