Cyclone V now shipping! - page 2. | Bitcointalksearch.org

fpgaminer

hero member

Activity: 560

Merit: 517

Quote

7 ns also failed in both slow models, but only by a small margin. 7.3 ns should work (running now, but I have to drive to work now).

Is it still only failing on the JTAG clock or is it also failing on the hashing clock?

The code in my repo for Altera targets is based around my virtual_wire module, which isn't really the best (but it's easy to use on Altera dev boards where you already have a USB-Blaster).

You could try the DE2_115_makomk_serial project put together by teknohog, which uses a UART core. I'm designing a newer UART core with more functionality, etc, but that's not done yet and the one by teknohog is perfectly sufficient. You just need to make sure the makomk code in there is up-to-date (I haven't checked yet).

Jason

member

Activity: 114

Merit: 10

Quote from: Inspector 2211 on April 09, 2012, 09:56:11 AM

IMHO, these timing simulations are not 100% accurate - the real-life error rate you get with an actual clock, that's where the rubber meets the road.
The fast model works up to 4 ns (250 MHz) - that's very promising and maybe that's what you get in the real world.
Or maybe not.
We'll never know until someone builds an actual board.

We can probably get an idea from how accurate the simulations are on older Cyclones. For example, with the Cyclone IV on my DE2-115 dev board, I find that I can go only about 5-10MHz over the Fmax reported by the 85C Slow Model before I start seeing a few invalid blocks being reported by the pool I'm using. That's with a small 23mm heat sink on the FPGA with a fan blowing on it. According to my IR thermometer, the heat sink is under 40C.

You might try bringing up the timing advisor (Tools->Advisors->Timing Advisor) and changing settings to match some of the recommendations it makes if you have not already done so. You might pick up a few tens or hundreds of picoseconds of slack that way. Another thing worth trying if you have some patience and spare compute cycles is to run the Design Space Explorer on the design and see what it can come up with. Make sure you do a test run with it first before you let it run for days on end so you don't wind up wasting your time like I have!

Inspector 2211

sr. member

Activity: 448

Merit: 250

Quote from: Jason on April 09, 2012, 09:19:30 AM

Quote from: Inspector 2211 on April 08, 2012, 10:21:42 PM

In fact, I set the clock period of the system clock "osc_clk" to 8 ns in a constraint file, and the design compiled without error.
When I set it to 4 ns (more as a sanity check than as a serious attempt to achieve 250 MH/s), TimeQuest flagged errors.
6 ns ... stay tuned.

Strange, Quartus ignores the jtag clock for me. You might try setting a false path for the jtag clock in your constraints file in order to tell the system to ignore the jtag clock. It should report Fmax for the main clock under the slow timing model section of the timing analyzer output.

So you were able to hit 125MHz so far -- that's pretty good and suggests that the Cyclone V can hit at least 250MH/s. It will need to hit at least 300MH/s to make it competitive with the LX150 on a MH/$ basis, but it does seem at least possible that the Cyclone V may displace the LX150 as the new MH/$ leader.

A mining board with 2 Cyclone Vs can presumably be produced for about the same cost as a BFL single, but at 600MH/s, it's still less than three quarters of the BFL's mining rate. I wonder if the reduced power consumption (should use less than one quarter of what the BFL uses) would entice many people to buy such a board in lieu of BFL's offering? I'm guessing not, but I'm sure that won't stop someone from making them. Maybe this is what ngzhang has up his sleeve for his Icarus replacement?

7 ns also failed in both slow models, but only by a small margin. 7.3 ns should work (running now, but I have to drive to work now).

IMHO, these timing simulations are not 100% accurate - the real-life error rate you get with an actual clock, that's where the rubber meets the road.
The fast model works up to 4 ns (250 MHz) - that's very promising and maybe that's what you get in the real world.
Or maybe not.
We'll never know until someone builds an actual board.

Jason

member

Activity: 114

Merit: 10

Quote from: Inspector 2211 on April 08, 2012, 10:21:42 PM

In fact, I set the clock period of the system clock "osc_clk" to 8 ns in a constraint file, and the design compiled without error.
When I set it to 4 ns (more as a sanity check than as a serious attempt to achieve 250 MH/s), TimeQuest flagged errors.
6 ns ... stay tuned.

Strange, Quartus ignores the jtag clock for me. You might try setting a false path for the jtag clock in your constraints file in order to tell the system to ignore the jtag clock. It should report Fmax for the main clock under the slow timing model section of the timing analyzer output.

So you were able to hit 125MHz so far -- that's pretty good and suggests that the Cyclone V can hit at least 250MH/s. It will need to hit at least 300MH/s to make it competitive with the LX150 on a MH/$ basis, but it does seem at least possible that the Cyclone V may displace the LX150 as the new MH/$ leader.

A mining board with 2 Cyclone Vs can presumably be produced for about the same cost as a BFL single, but at 600MH/s, it's still less than three quarters of the BFL's mining rate. I wonder if the reduced power consumption (should use less than one quarter of what the BFL uses) would entice many people to buy such a board in lieu of BFL's offering? I'm guessing not, but I'm sure that won't stop someone from making them. Maybe this is what ngzhang has up his sleeve for his Icarus replacement?

Inspector 2211

sr. member

Activity: 448

Merit: 250

Quote from: lame.duck on April 09, 2012, 05:59:44 AM

No, the jtag clock isn't the limiting factor, i had only set the PLL to 120 MHz which seems not sufficient to get optimal synthesis results.
I was compiling for the smallest avaiable part offered by quartus, not the smallest caclone V (I mistakely mixed it with the xilinx artix7 where all sub100k parts were dropped). Btw. how did you select a (the) cycloneV E devise, i could only choose a GX part but the number of aluts seem to be the same.

5CGXBC7C6F23C7

6 ns ... passes "fast 1100mV 0C model", fails both slow models
4 ns ... passes "fast 1100mV 0C model", fails both slow models
3 ns ... fails all 3 models
7 ns ... running as we speak

lame.duck

legendary

Activity: 1270

Merit: 1000

No, the jtag clock isn't the limiting factor, i had only set the PLL to 120 MHz which seems not sufficient to get optimal synthesis results.
I was compiling for the smallest avaiable part offered by quartus, not the smallest caclone V (I mistakely mixed it with the xilinx artix7 where all sub100k parts were dropped). Btw. how did you select a (the) cycloneV E devise, i could only choose a GX part but the number of aluts seem to be the same.

Dexter770221

legendary

Activity: 1029

Merit: 1000

So, poor JTAG was a limitation? Good to know. We have to remember that this code is over 6 month old, When Spartan hit 90MH/s Wink

Makomk achieved 27.5 MH/s on DE0-nano, with code that you're trying he only got little above 13 MH/s.

Inspector 2211

sr. member

Activity: 448

Merit: 250

Quote from: lame.duck on April 06, 2012, 05:49:53 AM

Just for curiousity i compiled a single DE2_115_makomk_mod with Quartus 11.1 (no SP) for the smallest CycloneV GX Speedgrade 8. The report said 40% device usage, but the reported Fmax was quite low under 100 MHz. Maybe there is a bug in the timing analyzer since the reportet Fmax was a little higher at 85°C. I think i will try it again with SP2 apllied even if the 'Not in stock' will last for a while (i guess)

I tried that also and got 97 or 98 MHz Fmax, but the signal this pertains to is altera_reserved_tck, which (as far as I understand) is the JTAG clock, not the system clock.
In fact, I set the clock period of the system clock "osc_clk" to 8 ns in a constraint file, and the design compiled without error.
When I set it to 4 ns (more as a sanity check than as a serious attempt to achieve 250 MH/s), TimeQuest flagged errors.
6 ns ... stay tuned.

Jason

member

Activity: 114

Merit: 10

Quote from: Dexter770221 on April 06, 2012, 03:25:44 AM

CycloneV ALMs have different structure than Cyclone IV. In one ALM you have four 4-input LUT + two 1 bit adders (with dedicated fast carry chains). So theoretically should be possible to put one stage into 160 ALMs (if I'm calculating right). 160*128=20480 ALMs for one fully unrolled core. 5CEA7 have over 56k ALMs, thats enough for two cores + some other logic for reciving, transmitting and distribuing work.

Nice. Looks like the C5 is a bigger step up over the C4 than the C4 was over the C2. I'm downloading 11.1, 11.1SP1, and 11.1SP2 now so I can try out some builds myself, but it looks like you'r right about fitting two fully unrolled loops in there -- the question now is what kind of Fmax can be achieved?

Dexter770221

legendary

Activity: 1029

Merit: 1000

Smallest you can choose from, or smallest from Cyclone V family? I can only choose from different variants of 5CEA7 part.
40% is what I've calculated in my head just looking at ALM structure Wink

So, two cores possible. But that clock is very low Sad

Maybe throwing some flip flops (registers) at the output of ALM will help a little bit... Since its pipelined design it will not hurt performance.

lame.duck

legendary

Activity: 1270

Merit: 1000

Just for curiousity i compiled a single DE2_115_makomk_mod with Quartus 11.1 (no SP) for the smallest CycloneV GX Speedgrade 8. The report said 40% device usage, but the reported Fmax was quite low under 100 MHz. Maybe there is a bug in the timing analyzer since the reportet Fmax was a little higher at 85°C. I think i will try it again with SP2 apllied even if the 'Not in stock' will last for a while (i guess)

Dexter770221

legendary

Activity: 1029

Merit: 1000

CycloneV ALMs have different structure than Cyclone IV. In one ALM you have four 4-input LUT + two 1 bit adders (with dedicated fast carry chains). So theoretically should be possible to put one stage into 160 ALMs (if I'm calculating right). 160*128=20480 ALMs for one fully unrolled core. 5CEA7 have over 56k ALMs, thats enough for two cores + some other logic for reciving, transmitting and distribuing work.

Jason

member

Activity: 114

Merit: 10

I'm a bit puzzled by how you come up with 300MH/s. Makomk's variation with about 78K LEs has an Fmax around 110MHz on the C4. Squeezing two of these onto a C5 is going to be tough, and I don't see how you could do it without compromising the routing. I suppose the 28nm fabric would mitigate that somewhat, but it seems a big stretch to get the Fmax all the way up to 150MHz. The 60nm C4 hasn't proven to be much faster than the 90nm C2 based on my limited experimentation compiling LOOP_LOG2=1 designs for both FPGAs, although I did notice a significant reduction in power usage with the C4.

I guess it would be easy enough to download the latest version of Quartus to see what's possible on the C5. Has anyone else tried this already?

As for ISE, it's the reason my company recently switched from Xilinx to Altera products for their high speed networking products.

fpgaminer

hero member

Activity: 560

Merit: 517

Quote

I would be surprised if you could get Fmax over 100MHZ due to the difficulty in efficiently routing such a design.

I wouldn't be surprised at all if a C5 with 150K LEs will do ~300MH/s (dual hasher design). In fact, I'd be disappointed if it didn't. Even a C4-150 is likely to do 200MH/s using a dual-core makomk core, a UART for communication, and removing the first three rounds. Cyclone chips have a lot better routing than Spartan-6. Immensely so. And it's a real shame, because the Spartan-6 has a lot better support for adders. Quartus also produces far more predictable results than ISE.

I would have preferred the Cyclone IV to have won the mining race against Spartan-6, honestly, but it was just too expensive. On the other hand, as frustrating as battling ISE is, there's always some amount of glory in winning battles against it Tongue

Jason

member

Activity: 114

Merit: 10

Thanks for the reference. Unfortunately, the URL to Makomk's code in the message you referenced does not exist, though perhaps his code is reflected by the DE2-115-makomk-mod branch of fpgaminer's code. I just compiled that code with LOOP_LOG2 set to 0 and found that it compiles to 77,724 LEs/Fmax=109.84MHZ with the provided project settings, so probably 75K LEs can be achieved by optimizing for density (though this would reduce Fmax). The only problem with the theory that the 5CEA7 (Cyclone V with 150K LEs) could challenge the LX150 is that even if you did manage to fit two fully unrolled loops into it (it's quite difficult to fit the last 10%), I would be surprised if you could get Fmax over 100MHZ due to the difficulty in efficiently routing such a design. And in terms of MH/$, I just don't see it happening given the significantly lower cost of the LX150s ($160 vs. $240 for qty 1). Depending on Altera's pricing on the 5CEA9 Cyclone V's (300K LEs), there may be more potential there (whenever they become available).

Dexter770221

legendary

Activity: 1029

Merit: 1000

Quote from: Jason on April 04, 2012, 10:11:32 PM

Quote from: Dexter770221 on March 31, 2012, 09:30:02 AM

Thats why you can put fully unrolled core to Cyclone 75k part and you need 150k part from Xillinx to do the same.

Got a reference for this please? I have not seen any publicly available HDL code for the Cyclone IV that needs fewer than about 100K LEs (fully unrolled). This includes the code on fpgaminer's github (which includes some of Makomk's work) as well as Ztex's code.

https://bitcointalksearch.org/topic/m.446363

fpgaminer

hero member

Activity: 560

Merit: 517

Quote

Got a reference for this please? I have not seen any publicly available HDL code for the Cyclone IV that needs fewer than about 100K LEs (fully unrolled). This includes the code on fpgaminer's github (which includes some of Makomk's work) as well as Ztex's code.

The makomk based code fits into ~80K-90K LEs unrolled, if I recall correctly. Though it has been awhile since I've compiled that code, so it might be less

Jason

member

Activity: 114

Merit: 10

Quote from: Dexter770221 on March 31, 2012, 09:30:02 AM

Thats why you can put fully unrolled core to Cyclone 75k part and you need 150k part from Xillinx to do the same.

Got a reference for this please? I have not seen any publicly available HDL code for the Cyclone IV that needs fewer than about 100K LEs (fully unrolled). This includes the code on fpgaminer's github (which includes some of Makomk's work) as well as Ztex's code.

shad

full member

Activity: 148

Merit: 100

Quote from: rjk on March 31, 2012, 10:19:50 AM

Quote from: shad on March 31, 2012, 09:15:40 AM

Quote from: rjk on March 31, 2012, 08:46:23 AM

Don't forget that eldentyrell was able to squeeze 3 unrolled cores onto a 150k part

3 unrolled cores? afaik 3 cores with 0.5hash per cycle

Yes, that's what I meant, but it's better than 2 cores with 0.5 hash per cycle.

afaik other Spartan6LX150 miners have 1core with 1 hash per cycle
only eldentyrell has 3core with 0.5hash per cycle

yes its true 1.5 is better then 1,
i am sure there are more people trying to workout this nice 3core solution, i think its only a matter of time Wink

antirack

hero member

Activity: 489

Merit: 500

Immersionist

Quote from: Dexter770221 on March 31, 2012, 09:30:02 AM

Either way >2MH/$ from FPGA's is comming Wink

How much and when can I get it? Grin

Topic: Cyclone V now shipping! - page 2. (Read 14072 times)