Cyclone V now shipping! | Bitcointalksearch.org

ElectricMucus

legendary

Activity: 1666

Merit: 1057

Marketing manager - GO MP

Quote from: Inspector 2211 on April 10, 2012, 11:03:55 PM

Quote from: ElectricMucus on April 10, 2012, 07:51:26 PM

Isn't the whole unrolling discussion really quite naive?

The optimal solution will be, depending on the layout of the chip, partly unrolled partly not, depending on which place inside the chip the operation takes place. It isn't really proficient to even talk about unrolling since depending on the amount of elements which can be used for memory, logic and routing a particular approach will be best.
Of course computing a solution which fits this paradigm would probably be too much, however if the tools permit programming the FPGA on a low level it should be at least possible to use close to 100% of all resources for something useful.

As far as I understand, loops that are not 100% unrolled incur inefficiencies, as partial results have to be fed back in and there has to be logic (multiplexers) to do that, whereas data in a fully unrolled design just percolates from start to finish.

Not if the looping is free. If there are abundant local resources at a specific place because of chip layout constraints the penalty used for looping and memory would decrease up to the point where it would be more efficient. But to be honest I have no idea how this should be implemented and don't even know if it is at all possible with available tools.
Nevertheless I think it is worth a thought and if FPGAs prevail over a long period as the status quo something will come up. I am certain of that.

Another speculative note: the I/O ports of the FPGA might eventually be used to obtain additional routing, this would of course restrict the layout but it should be worth it for cases where a slow rate data stream would consume unnecessary resources if routed over wide distances.

Jason

member

Activity: 114

Merit: 10

Quote from: Inspector 2211 on April 11, 2012, 07:12:58 PM

Nah, I cannot afford funding a Hardcopy device, and even if I barely could, I wouldn't invest all of my money into a Bitcoin miner. So many things can go wrong - difficulty can explode, the exchange rate can plummet, MtGox could be shut down by the Japanese government, etc. etc.

I agree. The only way it makes sense it to spread the risk around. Problem is finding 10 people to invest $75K, or even 100 people to invest $7.5K and then wait at least 6 months before seeing any return on their investments.

I am curious to see what level of performance can be achieved and will see how many miners will fit on a Stratix IV/V myself soon and report back my findings. I think others have tried with some Xilinx products, but until they have a hardcopy equivalent, those results are less interesting.

Quote

The LargeCoin folks up in Vancouver seem to be doing something like that, however.
Maybe even full-custom ASICs, judging from the low estimated power draw of their 20 GH/s box.

I'm surprised someone would even attempt to do that for something as (relatively) obscure as LargeCoin.

Inspector 2211

sr. member

Activity: 448

Merit: 250

Nah, I cannot afford funding a Hardcopy device, and even if I barely could, I wouldn't invest all of my money into a Bitcoin miner. So many things can go wrong - difficulty can explode, the exchange rate can plummet, MtGox could be shut down by the Japanese government, etc. etc.

The LargeCoin folks up in Vancouver seem to be doing something like that, however.
Maybe even full-custom ASICs, judging from the low estimated power draw of their 20 GH/s box.

Jason

member

Activity: 114

Merit: 10

Quote from: Inspector 2211 on April 11, 2012, 06:09:58 PM

So, can you make the latest code available somewhere?
I'm trying to instantiate two fully unrolled instances of your "slightly older code" on the Cyclone V GX 7 target architecture - so far, unsuccessfully.
Once instance is placed and routed just fine, and achieves 140 MH/s in the conservative "slow" simulation and a whopping 250 MH/s in the optimistic "fast" simulation.

I still think the fastest you're going to get will be just a bit over the slow simulation based on my experience with the Cyclone II and Cyclone IV.

Dunno if you're interested, but if you'd like to get some idea for what's possible with Altera Hardcopy and you have modified the code to allow you to compile multiple instances of Makomk's code, you might try targeting a Stratix IV, EP4SGX530HH35 which is a prototype for the Hardcopy HC4GX25 ASIC (530K LEs). You could also try targeting one of the bigger Stratix V devices as a prototype for the Hardcopy V ASICs (up to 930K LEs) -- but I can't seem to find on the Altera website a list detailing which Stratix V is a prototype for which Hardcopy V like I can for the Hardcopy IV series. Don't bother trying unless you have at least 8GB RAM on the machine you're using as the bigger FPGAs really use up a lot of memory during compilations.

I'm guessing somewhere in the 2.4 GH/s range is possible with the Hardcopy IV, and considerably more with the Hardcopy V, although cooling the die may be challenging.

By my back-of-the-envelope calculations, it's going to take around $0.75 million in capital to launch such a project -- including the setup fees for the hardcopy ($200K), 500 ASICs ($500K), and a little more for the design and production of 500 basic mining boards. In order to beat BFL in terms of MH/$, it would almost certainly need to use the Hardcopy V ASIC which would conservatively give over 2 MH/$ performance if the ASIC could be adequately cooled. At today's difficulty/valuation levels, each one of those boards should be able to mine around 2 bitcoins per day, or $300/month, with a projected payback period of 5 months -- not bad. Any fearless investors out there? Wink

Inspector 2211

sr. member

Activity: 448

Merit: 250

Quote from: makomk on April 11, 2012, 05:10:39 PM

Quote from: Dexter770221 on March 31, 2012, 06:04:53 AM

makomk achieved 27.7MH/s from CycloneIV 22k part. My quess is that is 220MHz core rolled 8 times. Fully unrolled core fits to 75k Cyclone. That gives two cores on 150k part and propably at 300MHz (28nm vs. 60nm), so 600MH/s may be possible...

110 MHz at 4 clock cycles per hash, actually. The design scales down reasonably well to smaller devices.

Quote from: Jason on April 05, 2012, 09:42:21 AM

Thanks for the reference. Unfortunately, the URL to Makomk's code in the message you referenced does not exist, though perhaps his code is reflected by the DE2-115-makomk-mod branch of fpgaminer's code. I just compiled that code with LOOP_LOG2 set to 0 and found that it compiles to 77,724 LEs/Fmax=109.84MHZ with the provided project settings, so probably 75K LEs can be achieved by optimizing for density (though this would reduce Fmax).

That's slightly older code. It's probably better than the newer versions with LOOP_LOG2=0 but it gives invalid results if you change it to anything else.

So, can you make the latest code available somewhere?
I'm trying to instantiate two fully unrolled instances of your "slightly older code" on the Cyclone V GX 7 target architecture - so far, unsuccessfully.
Once instance is placed and routed just fine, and achieves 140 MH/s in the conservative "slow" simulation and a whopping 250 MH/s in the optimistic "fast" simulation.

makomk

hero member

Activity: 686

Merit: 564

Quote from: Dexter770221 on March 31, 2012, 06:04:53 AM

makomk achieved 27.7MH/s from CycloneIV 22k part. My quess is that is 220MHz core rolled 8 times. Fully unrolled core fits to 75k Cyclone. That gives two cores on 150k part and propably at 300MHz (28nm vs. 60nm), so 600MH/s may be possible...

110 MHz at 4 clock cycles per hash, actually. The design scales down reasonably well to smaller devices.

Quote from: Jason on April 05, 2012, 09:42:21 AM

Thanks for the reference. Unfortunately, the URL to Makomk's code in the message you referenced does not exist, though perhaps his code is reflected by the DE2-115-makomk-mod branch of fpgaminer's code. I just compiled that code with LOOP_LOG2 set to 0 and found that it compiles to 77,724 LEs/Fmax=109.84MHZ with the provided project settings, so probably 75K LEs can be achieved by optimizing for density (though this would reduce Fmax).

That's slightly older code. It's probably better than the newer versions with LOOP_LOG2=0 but it gives invalid results if you change it to anything else.

Inspector 2211

sr. member

Activity: 448

Merit: 250

Quote from: pieppiep on April 11, 2012, 02:34:50 AM

Quote from: Inspector 2211 on April 10, 2012, 11:03:55 PM

As far as I understand, loops that are not 100% unrolled incur inefficiencies, as partial results have to be fed back in and there has to be logic (multiplexers) to do that, whereas data in a fully unrolled design just percolates from start to finish.

Interesting.
So if you can fit one fully unrolled instance in a device so you can get 1 hash/clock and fit another half unrolled instance that gets 1 hash/2 clocks, the second one holds back the speed of the first one.
Would it be possible to clock the first at a speed a little faster than the second one? Or would this give difficulties to combine the 2 parts to have 1 output?

Yes, you can, and that would be a fallback strategy for the Cyclone V GX 7 in case one cannot fit two unrolled double-SHAs, however it'll hurt the $ per MH/s number.

Jason

member

Activity: 114

Merit: 10

I too looked over Wondermine's code and I am skeptical that it will challenge either the Ztex code or Makomk's modifications of Fpgaminer's code in terms of MH/s. Still, having said that, I wish him good luck as if he does manage it, we'll all benefit and learn something in the process.

fpgaminer

hero member

Activity: 560

Merit: 517

Quote

So if you can fit one fully unrolled instance in a device so you can get 1 hash/clock and fit another half unrolled instance that gets 1 hash/2 clocks, the second one holds back the speed of the first one.
Would it be possible to clock the first at a speed a little faster than the second one? Or would this give difficulties to combine the 2 parts to have 1 output?

The inefficiencies of a rolled hasher are (usually) in area consumption, not in timing performance.

But to answer your question, yes you can clock different hashers at different speeds. Async FIFOs are used to cross the clock domains.

lame.duck

legendary

Activity: 1270

Merit: 1000

Unfortunately those numbers have nothing to do with a real working hasher, as a single hashing core needs 1/3 more LEs and i wonder if he gets the LE count back to the anounced 1250 LEs. In fact in another run i turned most area optimisations on and got only slightly better results. Besides that, his design has no communication module and the control logic seems incomplete to me as there is no logic to distribute the different nonces to the hashing cores.

Btw. as far i know the makomk design aims at the C7 grade device and it would worth a test what speed is possible with a C6 grade device. At least for the EP3C25 C7 grade device i got a bitstream reaching 117 MHz which should be sufficient to run an the aimed 120 MHz (=30 MHash).

pieppiep

hero member

Activity: 1596

Merit: 502

Quote from: Inspector 2211 on April 10, 2012, 11:03:55 PM

As far as I understand, loops that are not 100% unrolled incur inefficiencies, as partial results have to be fed back in and there has to be logic (multiplexers) to do that, whereas data in a fully unrolled design just percolates from start to finish.

Interesting.
So if you can fit one fully unrolled instance in a device so you can get 1 hash/clock and fit another half unrolled instance that gets 1 hash/2 clocks, the second one holds back the speed of the first one.
Would it be possible to clock the first at a speed a little faster than the second one? Or would this give difficulties to combine the 2 parts to have 1 output?

Dexter770221

legendary

Activity: 1029

Merit: 1000

wondermine has released new code. Look yourself what he achieved:
https://bitcointalksearch.org/topic/m.844304

Inspector 2211

sr. member

Activity: 448

Merit: 250

Quote from: Jaryu on April 10, 2012, 11:26:29 PM

so basically what would be the... rough estimation of performance in MH/s for a chip if you can get the 2 instances fully loaded into it?

Impossible to say at this point, because
1) two instances don't even seem to fit
2) the difference between the slow estimate at 140 MH/s/instance and the fast estimate at 250 MH/s/instance is just too great.

The maximum seems to be 500 MH/s for two instances, but that's subject to too many assumptions to be realistic.
But then again, maybe wondermine comes up with a better implementation than fpgaminer.

Jaryu

member

Activity: 90

Merit: 10

so basically what would be the... rough estimation of performance in MH/s for a chip if you can get the 2 instances fully loaded into it?

Inspector 2211

sr. member

Activity: 448

Merit: 250

Quote from: ElectricMucus on April 10, 2012, 07:51:26 PM

Isn't the whole unrolling discussion really quite naive?

The optimal solution will be, depending on the layout of the chip, partly unrolled partly not, depending on which place inside the chip the operation takes place. It isn't really proficient to even talk about unrolling since depending on the amount of elements which can be used for memory, logic and routing a particular approach will be best.
Of course computing a solution which fits this paradigm would probably be too much, however if the tools permit programming the FPGA on a low level it should be at least possible to use close to 100% of all resources for something useful.

As far as I understand, loops that are not 100% unrolled incur inefficiencies, as partial results have to be fed back in and there has to be logic (multiplexers) to do that, whereas data in a fully unrolled design just percolates from start to finish.

Inspector 2211

sr. member

Activity: 448

Merit: 250

OK, after 15 1/2 hours Quartus failed to fit two instances of the double-SHA, albeit at the "optimize for speed" setting.
I have now restarted it with the "optimize for space" setting. The tension is almost unbearable...

ElectricMucus

legendary

Activity: 1666

Merit: 1057

Marketing manager - GO MP

Isn't the whole unrolling discussion really quite naive?

The optimal solution will be, depending on the layout of the chip, partly unrolled partly not, depending on which place inside the chip the operation takes place. It isn't really proficient to even talk about unrolling since depending on the amount of elements which can be used for memory, logic and routing a particular approach will be best.
Of course computing a solution which fits this paradigm would probably be too much, however if the tools permit programming the FPGA on a low level it should be at least possible to use close to 100% of all resources for something useful.

Jason

member

Activity: 114

Merit: 10

Quote from: Inspector 2211 on April 09, 2012, 11:03:52 PM

I have not tried to fit two instances of the miner yet.

That has the potential to reduce the Fmax you can achieve, although if one fully unrolled miner uses only 40% of the device, the effect should be small.

Quote

If you have the time, please elucidate the difference between LOOP_LOG2=0 and LOOP_LOG2=1.
I can't wrap my head around that - both of them are supposed to be fully unrolled ?!?
Which one is the "preferred", "believed to be faster" setting?

LOOP_LOG2=0 is a fully unrolled miner -- 2 full SHA256 instances. It achieves 1 clock cycle per bitcoin hash. LOOP_LOG2=1 divides the work of each SHA-256 hasher in two so that they takes 2 clock cycles per bitcoin hash but use significantly fewer LEs. You want to use LOOP_LOG2=0 whenever you can for best performance.

Quote

(I do understand that a setting of 2 means that there is only one SHA-256 instance and its output has to be fed back in front.
No need to set LOOP_LOG2 to 2 or higher on the Cyclone V.)

LOOP_LOG2 does not affect the number of SHA-256 instances. There are always two of them. It affects the amount of unrolling that is present in each of the SHA-256 instances.

LOOP_LOG2=0: fully unrolled (2 fully unrolled SHA256 hasers)
LOOP_LOG2=1: partially unrolled (2 clock cycles per output)
LOOP_LOG2=2: partially unrolled (4 clock cycles per output)
etc.

Quote

Duly noted, but convenient I/O is not my main focus now - rather, I now want to focus on getting two instances fully placed and routed and timed.

Makes sense. Who knows what form the I/O will take anyway until someone has designed/built some hardware around the Cyclone V.

You should be able to put two instances on the chip fairly easily either by creating a new top-level entity and instantiating two fpgaminer instances. You'll probably have to parameterize the virtual_wire instance IDs in order to avoid collisions, but that may only be a problem at runtime so you might also be able to ignore it.

fpgaminer

hero member

Activity: 560

Merit: 517

Quote

If you have the time, please elucidate the difference between LOOP_LOG2=0 and LOOP_LOG2=1.
I can't wrap my head around that - both of them are supposed to be fully unrolled ?!?

Your confusion may arise from some typos I made in the comments of my code awhile back. I apologize for that (and hope most of those typos have been fixed).

The LOOP_LOG2 parameter determines how many times the entire Bitcoin SHA-256 pipeline is folded in half. Each folding cuts performance in half, but also cuts resource consumption in half (1).

LOOP_LOG2=0 -> Fully unrolled, one hash per clock cycle.
LOOP_LOG2=1 -> Half unrolled, one hash per 2 clock cycles.
LOOP_LOG2=2 -> Quarter unrolled, one hash per 4 clock cycles.
...etc...

Quote

Duly noted, but convenient I/O is not my main focus now - rather, I now want to focus on getting two instances fully placed and routed and timed.

Mmmkay, I was mentioning it mostly due to it having probably better timing, and use far less resources (those virtual_wires are nothing to sneeze at).

(1) It should be noted that LOOP_LOG2=0, fully unrolled, has special advantages over the other settings due to constant optimization, dropping the last three rounds, etc. So if you were to graph LOOP_LOG2 vs. area consumption it would be linear from 1 onward, but not from 0 to 1.

Inspector 2211

sr. member

Activity: 448

Merit: 250

Quote from: fpgaminer on April 09, 2012, 08:23:38 PM

Is it still only failing on the JTAG clock or is it also failing on the hashing clock?

It never "failed" on the JTAG clock, because I never put a time constraint on the JTAG clock.
At 7.3 ns clock cycle, the design passes.
Fmax is quoted as 139 MHz and 141 MHz for Slow 0C and Slow 85C, respectively.
(Yes, for some reason the Cyclone V is expected to run slightly faster at a higher temperature.)

I have not tried to fit two instances of the miner yet.

If you have the time, please elucidate the difference between LOOP_LOG2=0 and LOOP_LOG2=1.
I can't wrap my head around that - both of them are supposed to be fully unrolled ?!?
Which one is the "preferred", "believed to be faster" setting?

(I do understand that a setting of 2 means that there is only one SHA-256 instance and its output has to be fed back in front.
No need to set LOOP_LOG2 to 2 or higher on the Cyclone V.)

Quote from: fpgaminer on April 09, 2012, 08:23:38 PM

You could try the DE2_115_makomk_serial project put together by teknohog, which uses a UART core. I'm designing a newer UART core with more functionality, etc, but that's not done yet and the one by teknohog is perfectly sufficient. You just need to make sure the makomk code in there is up-to-date (I haven't checked yet).

Duly noted, but convenient I/O is not my main focus now - rather, I now want to focus on getting two instances fully placed and routed and timed.

fpgaminer

hero member

Activity: 560

Merit: 517

Quote

7 ns also failed in both slow models, but only by a small margin. 7.3 ns should work (running now, but I have to drive to work now).

Is it still only failing on the JTAG clock or is it also failing on the hashing clock?

The code in my repo for Altera targets is based around my virtual_wire module, which isn't really the best (but it's easy to use on Altera dev boards where you already have a USB-Blaster).

You could try the DE2_115_makomk_serial project put together by teknohog, which uses a UART core. I'm designing a newer UART core with more functionality, etc, but that's not done yet and the one by teknohog is perfectly sufficient. You just need to make sure the makomk code in there is up-to-date (I haven't checked yet).

Jason

member

Activity: 114

Merit: 10

Quote from: Inspector 2211 on April 09, 2012, 09:56:11 AM

IMHO, these timing simulations are not 100% accurate - the real-life error rate you get with an actual clock, that's where the rubber meets the road.
The fast model works up to 4 ns (250 MHz) - that's very promising and maybe that's what you get in the real world.
Or maybe not.
We'll never know until someone builds an actual board.

We can probably get an idea from how accurate the simulations are on older Cyclones. For example, with the Cyclone IV on my DE2-115 dev board, I find that I can go only about 5-10MHz over the Fmax reported by the 85C Slow Model before I start seeing a few invalid blocks being reported by the pool I'm using. That's with a small 23mm heat sink on the FPGA with a fan blowing on it. According to my IR thermometer, the heat sink is under 40C.

You might try bringing up the timing advisor (Tools->Advisors->Timing Advisor) and changing settings to match some of the recommendations it makes if you have not already done so. You might pick up a few tens or hundreds of picoseconds of slack that way. Another thing worth trying if you have some patience and spare compute cycles is to run the Design Space Explorer on the design and see what it can come up with. Make sure you do a test run with it first before you let it run for days on end so you don't wind up wasting your time like I have!

Inspector 2211

sr. member

Activity: 448

Merit: 250

Quote from: Jason on April 09, 2012, 09:19:30 AM

Quote from: Inspector 2211 on April 08, 2012, 10:21:42 PM

In fact, I set the clock period of the system clock "osc_clk" to 8 ns in a constraint file, and the design compiled without error.
When I set it to 4 ns (more as a sanity check than as a serious attempt to achieve 250 MH/s), TimeQuest flagged errors.
6 ns ... stay tuned.

Strange, Quartus ignores the jtag clock for me. You might try setting a false path for the jtag clock in your constraints file in order to tell the system to ignore the jtag clock. It should report Fmax for the main clock under the slow timing model section of the timing analyzer output.

So you were able to hit 125MHz so far -- that's pretty good and suggests that the Cyclone V can hit at least 250MH/s. It will need to hit at least 300MH/s to make it competitive with the LX150 on a MH/$ basis, but it does seem at least possible that the Cyclone V may displace the LX150 as the new MH/$ leader.

A mining board with 2 Cyclone Vs can presumably be produced for about the same cost as a BFL single, but at 600MH/s, it's still less than three quarters of the BFL's mining rate. I wonder if the reduced power consumption (should use less than one quarter of what the BFL uses) would entice many people to buy such a board in lieu of BFL's offering? I'm guessing not, but I'm sure that won't stop someone from making them. Maybe this is what ngzhang has up his sleeve for his Icarus replacement?

7 ns also failed in both slow models, but only by a small margin. 7.3 ns should work (running now, but I have to drive to work now).

IMHO, these timing simulations are not 100% accurate - the real-life error rate you get with an actual clock, that's where the rubber meets the road.
The fast model works up to 4 ns (250 MHz) - that's very promising and maybe that's what you get in the real world.
Or maybe not.
We'll never know until someone builds an actual board.

Jason

member

Activity: 114

Merit: 10

Quote from: Inspector 2211 on April 08, 2012, 10:21:42 PM

In fact, I set the clock period of the system clock "osc_clk" to 8 ns in a constraint file, and the design compiled without error.
When I set it to 4 ns (more as a sanity check than as a serious attempt to achieve 250 MH/s), TimeQuest flagged errors.
6 ns ... stay tuned.

Strange, Quartus ignores the jtag clock for me. You might try setting a false path for the jtag clock in your constraints file in order to tell the system to ignore the jtag clock. It should report Fmax for the main clock under the slow timing model section of the timing analyzer output.

So you were able to hit 125MHz so far -- that's pretty good and suggests that the Cyclone V can hit at least 250MH/s. It will need to hit at least 300MH/s to make it competitive with the LX150 on a MH/$ basis, but it does seem at least possible that the Cyclone V may displace the LX150 as the new MH/$ leader.

A mining board with 2 Cyclone Vs can presumably be produced for about the same cost as a BFL single, but at 600MH/s, it's still less than three quarters of the BFL's mining rate. I wonder if the reduced power consumption (should use less than one quarter of what the BFL uses) would entice many people to buy such a board in lieu of BFL's offering? I'm guessing not, but I'm sure that won't stop someone from making them. Maybe this is what ngzhang has up his sleeve for his Icarus replacement?

Inspector 2211

sr. member

Activity: 448

Merit: 250

Quote from: lame.duck on April 09, 2012, 05:59:44 AM

No, the jtag clock isn't the limiting factor, i had only set the PLL to 120 MHz which seems not sufficient to get optimal synthesis results.
I was compiling for the smallest avaiable part offered by quartus, not the smallest caclone V (I mistakely mixed it with the xilinx artix7 where all sub100k parts were dropped). Btw. how did you select a (the) cycloneV E devise, i could only choose a GX part but the number of aluts seem to be the same.

5CGXBC7C6F23C7

6 ns ... passes "fast 1100mV 0C model", fails both slow models
4 ns ... passes "fast 1100mV 0C model", fails both slow models
3 ns ... fails all 3 models
7 ns ... running as we speak

lame.duck

legendary

Activity: 1270

Merit: 1000

No, the jtag clock isn't the limiting factor, i had only set the PLL to 120 MHz which seems not sufficient to get optimal synthesis results.
I was compiling for the smallest avaiable part offered by quartus, not the smallest caclone V (I mistakely mixed it with the xilinx artix7 where all sub100k parts were dropped). Btw. how did you select a (the) cycloneV E devise, i could only choose a GX part but the number of aluts seem to be the same.

Dexter770221

legendary

Activity: 1029

Merit: 1000

So, poor JTAG was a limitation? Good to know. We have to remember that this code is over 6 month old, When Spartan hit 90MH/s Wink

Makomk achieved 27.5 MH/s on DE0-nano, with code that you're trying he only got little above 13 MH/s.

Inspector 2211

sr. member

Activity: 448

Merit: 250

Quote from: lame.duck on April 06, 2012, 05:49:53 AM

Just for curiousity i compiled a single DE2_115_makomk_mod with Quartus 11.1 (no SP) for the smallest CycloneV GX Speedgrade 8. The report said 40% device usage, but the reported Fmax was quite low under 100 MHz. Maybe there is a bug in the timing analyzer since the reportet Fmax was a little higher at 85°C. I think i will try it again with SP2 apllied even if the 'Not in stock' will last for a while (i guess)

I tried that also and got 97 or 98 MHz Fmax, but the signal this pertains to is altera_reserved_tck, which (as far as I understand) is the JTAG clock, not the system clock.
In fact, I set the clock period of the system clock "osc_clk" to 8 ns in a constraint file, and the design compiled without error.
When I set it to 4 ns (more as a sanity check than as a serious attempt to achieve 250 MH/s), TimeQuest flagged errors.
6 ns ... stay tuned.

Jason

member

Activity: 114

Merit: 10

Quote from: Dexter770221 on April 06, 2012, 03:25:44 AM

CycloneV ALMs have different structure than Cyclone IV. In one ALM you have four 4-input LUT + two 1 bit adders (with dedicated fast carry chains). So theoretically should be possible to put one stage into 160 ALMs (if I'm calculating right). 160*128=20480 ALMs for one fully unrolled core. 5CEA7 have over 56k ALMs, thats enough for two cores + some other logic for reciving, transmitting and distribuing work.

Nice. Looks like the C5 is a bigger step up over the C4 than the C4 was over the C2. I'm downloading 11.1, 11.1SP1, and 11.1SP2 now so I can try out some builds myself, but it looks like you'r right about fitting two fully unrolled loops in there -- the question now is what kind of Fmax can be achieved?

Dexter770221

legendary

Activity: 1029

Merit: 1000

Smallest you can choose from, or smallest from Cyclone V family? I can only choose from different variants of 5CEA7 part.
40% is what I've calculated in my head just looking at ALM structure Wink

So, two cores possible. But that clock is very low Sad

Maybe throwing some flip flops (registers) at the output of ALM will help a little bit... Since its pipelined design it will not hurt performance.

lame.duck

legendary

Activity: 1270

Merit: 1000

Just for curiousity i compiled a single DE2_115_makomk_mod with Quartus 11.1 (no SP) for the smallest CycloneV GX Speedgrade 8. The report said 40% device usage, but the reported Fmax was quite low under 100 MHz. Maybe there is a bug in the timing analyzer since the reportet Fmax was a little higher at 85°C. I think i will try it again with SP2 apllied even if the 'Not in stock' will last for a while (i guess)

Dexter770221

legendary

Activity: 1029

Merit: 1000

CycloneV ALMs have different structure than Cyclone IV. In one ALM you have four 4-input LUT + two 1 bit adders (with dedicated fast carry chains). So theoretically should be possible to put one stage into 160 ALMs (if I'm calculating right). 160*128=20480 ALMs for one fully unrolled core. 5CEA7 have over 56k ALMs, thats enough for two cores + some other logic for reciving, transmitting and distribuing work.

Jason

member

Activity: 114

Merit: 10

I'm a bit puzzled by how you come up with 300MH/s. Makomk's variation with about 78K LEs has an Fmax around 110MHz on the C4. Squeezing two of these onto a C5 is going to be tough, and I don't see how you could do it without compromising the routing. I suppose the 28nm fabric would mitigate that somewhat, but it seems a big stretch to get the Fmax all the way up to 150MHz. The 60nm C4 hasn't proven to be much faster than the 90nm C2 based on my limited experimentation compiling LOOP_LOG2=1 designs for both FPGAs, although I did notice a significant reduction in power usage with the C4.

I guess it would be easy enough to download the latest version of Quartus to see what's possible on the C5. Has anyone else tried this already?

As for ISE, it's the reason my company recently switched from Xilinx to Altera products for their high speed networking products.

fpgaminer

hero member

Activity: 560

Merit: 517

Quote

I would be surprised if you could get Fmax over 100MHZ due to the difficulty in efficiently routing such a design.

I wouldn't be surprised at all if a C5 with 150K LEs will do ~300MH/s (dual hasher design). In fact, I'd be disappointed if it didn't. Even a C4-150 is likely to do 200MH/s using a dual-core makomk core, a UART for communication, and removing the first three rounds. Cyclone chips have a lot better routing than Spartan-6. Immensely so. And it's a real shame, because the Spartan-6 has a lot better support for adders. Quartus also produces far more predictable results than ISE.

I would have preferred the Cyclone IV to have won the mining race against Spartan-6, honestly, but it was just too expensive. On the other hand, as frustrating as battling ISE is, there's always some amount of glory in winning battles against it Tongue

Jason

member

Activity: 114

Merit: 10

Thanks for the reference. Unfortunately, the URL to Makomk's code in the message you referenced does not exist, though perhaps his code is reflected by the DE2-115-makomk-mod branch of fpgaminer's code. I just compiled that code with LOOP_LOG2 set to 0 and found that it compiles to 77,724 LEs/Fmax=109.84MHZ with the provided project settings, so probably 75K LEs can be achieved by optimizing for density (though this would reduce Fmax). The only problem with the theory that the 5CEA7 (Cyclone V with 150K LEs) could challenge the LX150 is that even if you did manage to fit two fully unrolled loops into it (it's quite difficult to fit the last 10%), I would be surprised if you could get Fmax over 100MHZ due to the difficulty in efficiently routing such a design. And in terms of MH/$, I just don't see it happening given the significantly lower cost of the LX150s ($160 vs. $240 for qty 1). Depending on Altera's pricing on the 5CEA9 Cyclone V's (300K LEs), there may be more potential there (whenever they become available).

Dexter770221

legendary

Activity: 1029

Merit: 1000

Quote from: Jason on April 04, 2012, 10:11:32 PM

Quote from: Dexter770221 on March 31, 2012, 09:30:02 AM

Thats why you can put fully unrolled core to Cyclone 75k part and you need 150k part from Xillinx to do the same.

Got a reference for this please? I have not seen any publicly available HDL code for the Cyclone IV that needs fewer than about 100K LEs (fully unrolled). This includes the code on fpgaminer's github (which includes some of Makomk's work) as well as Ztex's code.

https://bitcointalksearch.org/topic/m.446363

fpgaminer

hero member

Activity: 560

Merit: 517

Quote

Got a reference for this please? I have not seen any publicly available HDL code for the Cyclone IV that needs fewer than about 100K LEs (fully unrolled). This includes the code on fpgaminer's github (which includes some of Makomk's work) as well as Ztex's code.

The makomk based code fits into ~80K-90K LEs unrolled, if I recall correctly. Though it has been awhile since I've compiled that code, so it might be less

Jason

member

Activity: 114

Merit: 10

Quote from: Dexter770221 on March 31, 2012, 09:30:02 AM

Thats why you can put fully unrolled core to Cyclone 75k part and you need 150k part from Xillinx to do the same.

Got a reference for this please? I have not seen any publicly available HDL code for the Cyclone IV that needs fewer than about 100K LEs (fully unrolled). This includes the code on fpgaminer's github (which includes some of Makomk's work) as well as Ztex's code.

shad

full member

Activity: 148

Merit: 100

Quote from: rjk on March 31, 2012, 10:19:50 AM

Quote from: shad on March 31, 2012, 09:15:40 AM

Quote from: rjk on March 31, 2012, 08:46:23 AM

Don't forget that eldentyrell was able to squeeze 3 unrolled cores onto a 150k part

3 unrolled cores? afaik 3 cores with 0.5hash per cycle

Yes, that's what I meant, but it's better than 2 cores with 0.5 hash per cycle.

afaik other Spartan6LX150 miners have 1core with 1 hash per cycle
only eldentyrell has 3core with 0.5hash per cycle

yes its true 1.5 is better then 1,
i am sure there are more people trying to workout this nice 3core solution, i think its only a matter of time Wink

antirack

hero member

Activity: 489

Merit: 500

Immersionist

Quote from: Dexter770221 on March 31, 2012, 09:30:02 AM

Either way >2MH/$ from FPGA's is comming Wink

How much and when can I get it? Grin

rjk

sr. member

Activity: 462

Merit: 250

1ngldh

Quote from: shad on March 31, 2012, 09:15:40 AM

Quote from: rjk on March 31, 2012, 08:46:23 AM

Don't forget that eldentyrell was able to squeeze 3 unrolled cores onto a 150k part

3 unrolled cores? afaik 3 cores with 0.5hash per cycle

Yes, that's what I meant, but it's better than 2 cores with 0.5 hash per cycle.

Dexter770221

legendary

Activity: 1029

Merit: 1000

Spartan6 differs from CycloneIV. In first one only half of "slices" have carry chain logic that is very important in implementing adders. In Altera products every LUT have carry chain. Thats why you can put fully unrolled core to Cyclone 75k part and you need 150k part from Xillinx to do the same. Artix7 will have carry chains in every LUT, but ALM in CycloneV will have a LUT and an adder. It's almost impossible to direct compare this products when comes to predictions. Either way >2MH/$ from FPGA's is comming Wink

shad

full member

Activity: 148

Merit: 100

Quote from: rjk on March 31, 2012, 08:46:23 AM

Don't forget that eldentyrell was able to squeeze 3 unrolled cores onto a 150k part

3 unrolled cores? afaik 3 cores with 0.5hash per cycle

rjk

sr. member

Activity: 462

Merit: 250

1ngldh

Quote from: Dexter770221 on March 31, 2012, 06:04:53 AM

makomk achieved 27.7MH/s from CycloneIV 22k part. My quess is that is 220MHz core rolled 8 times. Fully unrolled core fits to 75k Cyclone. That gives two cores on 150k part and propably at 300MHz (28nm vs. 60nm), so 600MH/s may be possible...

Don't forget that eldentyrell was able to squeeze 3 unrolled cores onto a 150k part, although currently not getting a huge difference in speed (might later though, after some optimizations). Is there a 200k LUT part available? Might be able to fit 4 cores on it for 2 bitcoin hashing stages, instead of an odd 1.5 stages.

Dexter770221

legendary

Activity: 1029

Merit: 1000

makomk achieved 27.7MH/s from CycloneIV 22k part. My quess is that is 220MHz core rolled 8 times. Fully unrolled core fits to 75k Cyclone. That gives two cores on 150k part and propably at 300MHz (28nm vs. 60nm), so 600MH/s may be possible...

Omni

newbie

Activity: 42

Merit: 0

What kind of mh/s?

Dexter770221

legendary

Activity: 1029

Merit: 1000

Quote from: mrb on March 30, 2012, 07:44:15 PM

Artix 7 should be cheaper than Cyclone V, because Xilinx claims Artix 7 matches its predecessor's performance (Spartan 6) but sells at "35% its cost and is twice more energy efficient" (can't remember where I obtained this info, I just jotted it down a while back). That said, please do keep us informed about performance numbers you may achieve...

Not exactly as you quoted.
http://www.xilinx.com/products/silicon-devices/fpga/artix-7/index.htm
"... and offers over two times the capacity, 30% higher performance, 50% lower power consumption -- and logic up to 350K logic cell density at lower price points than Spartan®-6 FPGAs."

mrb

legendary

Activity: 1512

Merit: 1028

Quote from: Dexter770221 on March 30, 2012, 11:37:12 AM

Altera always have been a little bit expensive than Xillinx, so Artix7 200k may be with same or little lower price. This is gona be madness Wink

Artix 7 should be cheaper than Cyclone V, because Xilinx claims Artix 7 matches its predecessor's performance (Spartan 6) but sells at "35% its cost and is twice more energy efficient" (can't remember where I obtained this info, I just jotted it down a while back). That said, please do keep us informed about performance numbers you may achieve...

Dexter770221

legendary

Activity: 1029

Merit: 1000

Quote from: Gomeler on March 30, 2012, 12:00:24 PM

So, slightly higher per LUT but it should have lower power consumption? I wonder if we'll see a clockspeed bump with the shrink from 40nm(?) to 28nm. Hopefully the new FPGAs will drop the MH/$ costs for FPGAs so I can start mixing these in to the GPU farm.

Cyclone IV 115k LUTs is priced at 315$. So, this is much cheaper per LUT! Speed bump should be quiet significant becuse CIV are made at 65nm.

Gomeler

hero member

Activity: 697

Merit: 500

So, slightly higher per LUT but it should have lower power consumption? I wonder if we'll see a clockspeed bump with the shrink from 40nm(?) to 28nm. Hopefully the new FPGAs will drop the MH/$ costs for FPGAs so I can start mixing these in to the GPU farm.

Dexter770221

legendary

Activity: 1029

Merit: 1000

I've just found this news in my email.
239$ for 150k LUTs part. Looks VERY promissing.
http://www.buyaltera.com/scripts/partsearch.dll/multisearch?site=ALTERA&lang=EN&keywords=5CE+A7+FPGA+IC
Currently out of stock...
Altera always have been a little bit expensive than Xillinx, so Artix7 200k may be with same or little lower price. This is gona be madness Wink

makomk:
Considering your 27.5 MH/s what is your estimation on this? 2x200MH/s?

Topic: Cyclone V now shipping! (Read 14077 times)