Pages:
Author

Topic: An estimate of fpga performance - page 4. (Read 51502 times)

hero member
Activity: 644
Merit: 503
March 26, 2011, 06:35:45 AM
#44
If you guys are interested in my work, let me know, and I'll continue to post updates and such. Otherwise, I guess I'll just toil away in silence.

And a quick note:
The current design uses my PC to fetch work, and push it to the FPGA, as well as check for "Golden Tickets" (my funny internal name for valid nonces) and submit them when found. There's room in the pipelined design to put in a NIOS microprocessor. This could potentially use the ethernet port on the dev kit to do all the fetching and submitting. That way it'd be totally automated, and headless.  Cool
Oh, I think there'll definitely be interest. My initial thoughts are that Mhash/W is superb, about 15 times better than a 5970. Mhash/s is still quite low, given the cost, but a "quad DE0-Nano board" version would be particularly interesting - $320 (compared with $400 or more for a second-hand 5970). Professional miners who are concerned about on-going electricity costs more than they are about fixed, up-front costs might very well be interested.
hero member
Activity: 560
Merit: 517
March 26, 2011, 06:25:45 AM
#43
If you guys are interested in my work, let me know, and I'll continue to post updates and such. Otherwise, I guess I'll just toil away in silence.

And a quick note:
The current design uses my PC to fetch work, and push it to the FPGA, as well as check for "Golden Tickets" (my funny internal name for valid nonces) and submit them when found. There's room in the pipelined design to put in a NIOS microprocessor. This could potentially use the ethernet port on the dev kit to do all the fetching and submitting. That way it'd be totally automated, and headless.  Cool
member
Activity: 112
Merit: 11
March 26, 2011, 06:16:51 AM
#42
this is relevent to my interests

don't mind me, just monitoring this thread
hero member
Activity: 560
Merit: 517
March 26, 2011, 05:49:03 AM
#41
 Shocked Wow, this is such a coincidence! I was just browsing the forums tonight, and stumbled upon this thread. I finally registered an account just to post in this thread.

I've been working on an FPGA miner for the past few weeks! It's fully working*, currently running on my desk in front of me and generating up some tasty shares Cool I'll give an overview of my work:

Current Performance
Device: Altera Cyclone 3 C120 Dev Kit
Performance: 70Mhash/s
Power: 2.26W
Efficiency: 30.9 Mhash/W


It's written in Verilog, all crafted painstakingly by hand. There are two alternative designs. One is a serial design composed of many SHA256 cores running in parallel, each core computing a hash in 64 cycles (2 cores needed for the full hash). Each full core (2 half cores) consumes about 2800 LEs. The second design (currently running in front of me) is a pipelined version with one LOOOOONNNNGGGG chain of hashing stages running in parallel. That design computes 1 full hash every clock cycle. It runs at a maximum of 70MHz right now. Actually, I haven't tried pushing it to its limit, so it may very well run much faster. I'm hoping for 100MHz.

These are my results after off-and-on work for a few weeks. I've actually put most of my efforts into the serial design, because the pipelined design takes at least an hour to synthesize each time. The serial design can currently fit 42 full cores into the C120, each running at 90MHz and computing a full hash every 64 cycles. That's about 59Mhash/s.

The latest revision of the pipelined design consumes 90,000 LEs, so it's pretty big. I'm working to cram it into <64,000LEs so I can get two of them in one C120 chip, and push their clock to 100MHz, giving me a whopping 200Mhash/s.

I haven't used the on-board power meter before, but if I'm reading it correctly the FPGA is currently using 2.26 Watts. That ... seems really low, but Altera's website verifies that that's actually above average for a C120, so I guess it's accurate. That's 31 Mhash/W, which is 1200% more efficient than the most efficient GPU listed on the Wiki. So efficient, it's basically free. Poor guy runs terribly hot though. I need to go put a fan on him...

The only downside is that this board in particular, the C120, costs $1000. The same design will easily fit into the DE2-115 board (from Terasic), which only costs $600. I have one of those too, so I'll test on him later. You're not likely to pay off that $600 quickly, though, so I guess it isn't economical yet. A reduced version may run in the DE0-Nano board, which is $80, but obviously it won't have the same performance (about 25%).

All my efforts are put into optimizing every last bit of the design, so we'll see how far I push the poor FPGA. It already out-performs my GTX 285 card, so I'm happy  Grin and at a fraction of the power cost.

And I'm only getting started  Cool Who wants to front the money to buy me a Stratix board and move this into Hardcopy?  Tongue

* By fully working, I really do mean it. It's happily submitting hashes to a pool. I was quite thrilled when my little baby submitted his first share  Cheesy
member
Activity: 70
Merit: 10
March 25, 2011, 10:53:06 PM
#40
Say hypothetically that some mystery vendor releases a new chip capable of mining at 100x the power efficiency of existing cards, for 2x the price of a 5970. Would this mining hardware sell well? How many of you would buy such a magic box?

I understand that the difficulty would adjust to neutralize the increased power introduced by the new technology; however, that difficulty increase would also render the old technology irrelevant and would sort of force everyone to upgrade.

GPUs pretty much wiped out CPU mining last year. I wonder if there was another step up in performance if current generation GPUs could similarly be completely side-stepped.

Thoughts appreciated...
sr. member
Activity: 493
Merit: 250
IDENA.IO - Proof-Of-Person Blockchain
January 01, 2011, 03:02:06 PM
#39
Quote
These are not available on OpenBSD or NetBSD.

Yes is true friend! =)
newbie
Activity: 51
Merit: 0
January 01, 2011, 02:35:42 PM
#38
I don't know if there are NetBSD or OpenBSD driver from ATI.

I remember in openBSD is possible this not have these drivers, but netbsd is perfect have the most updates drivers too and is bledeng eye tecnology.


It depends on CUDA/OpenCL support in the proprietary ATI/Nvidia drivers.

These are not available on OpenBSD or NetBSD.
sr. member
Activity: 493
Merit: 250
IDENA.IO - Proof-Of-Person Blockchain
January 01, 2011, 12:17:57 AM
#37
I don't know if there are NetBSD or OpenBSD driver from ATI.

I remember in openBSD is possible this not have these drivers, but netbsd is perfect have the most updates drivers too and is bledeng eye tecnology.
newbie
Activity: 32
Merit: 0
December 31, 2010, 10:22:21 PM
#36
Best os to run bitcoin client? NetBSD? or OpenBSD?
The bitcoin client is OS independent, but the OpenCL driver for mining / ATI Radeon runs only under win and Linux (for Radeon 5970 linux is recocommend because you can't disable CrossFire under Windows). I don't know if there are NetBSD or OpenBSD driver from ATI. You could take debian or ubuntu and install the driver from ati.
sr. member
Activity: 493
Merit: 250
IDENA.IO - Proof-Of-Person Blockchain
December 31, 2010, 09:12:01 PM
#35
Best os to run bitcoin client? NetBSD? or OpenBSD?
newbie
Activity: 32
Merit: 0
December 30, 2010, 03:51:28 PM
#34
fpga in my case is mainly for fun, but I wont refuse to try a cuda/opencl graphics card either. I'm using about 600 Watt on average to keep a building frost-free at the moment..

As I read in a few threads here, the usage of GPU's isn't totally problem free either, or?

One HD5970 need 300 Watt. Put one computer with 2 HD5970 in your building and you have 600 Watt. I don't know if the windows driver support 2 HD5970 at the same time, but linux should do this. Of course you need Internet connection in your building. You need the standard bitcoin client and m0mchil (or puddinpops) miner. http://bitcointalk.org/index.php?topic=1334.0;all
full member
Activity: 354
Merit: 103
December 30, 2010, 03:21:32 PM
#33
fpga in my case is mainly for fun, but I wont refuse to try a cuda/opencl graphics card either. I'm using about 600 Watt on average to keep a building frost-free at the moment..

As I read in a few threads here, the usage of GPU's isn't totally problem free either, or?

member
Activity: 114
Merit: 10
December 30, 2010, 12:19:37 PM
#32
I have an old Altera DE2-70 board I picked up for $300 (academic price) 1.5 years ago (Cyclone II).  Looks like the current model is a DE2-115 based on the Cyclone III FPGA.

I just did a bit of research on existing SHA-256 implementations for FPGAs, and I see that several companies sell high performance FPGA implementations (e.g. http://www.cast-inc.com/ip-cores/encryption/sha-256/index.html).  Taking the Cast implementation as an example:

"The processing of one 512-bit block is performed in 66 clock cycles and the bit-rate achieved is 7.75Mbps / MHz on the input of the SHA256 core."

Taking a clock rate of 132MHz as a reasonably conservative number for my older Cyclone II (Cast claims up to 280MHz on high performance FPGAs), this comes out to 2Mhps per block.  Cast's implementation uses around 2,531 LEs on the Cyclone.  My older DE2-70 board contains about 68,000 LEs.

Adding 10% overhead for communication/synchronization/etc, it should be possible to put 24 SHA-256 processors on my DE2-70.  That should allow up to 48Mhps peak processing rate (>80Mhps for the DE2-115 which can also be clocked faster).

Another question:  How much communications bandwidth is needed at these speeds, and can it fit on a 100baseT channel?  Certainly not if we want the host to transfer all of the candidates to be hashed onto the FPGA (48M * 512 = 12.3Gbps -- well above even gigabit ethernet speeds).  Is there another approach that can overcome this limitation.  I think so...

FPGAs have room for a dedicated CPU as well as a lot of logic, depending on what level of functionality you need in the CPU.  There are a lot of free and powerful CPU cores available on opencores.org, but it will be hard to beat the Nios II architecture if you are using Altera FPGAs.

A 32-bit Nios II/f CPU core is capable of 140 MIPS of performance (at 125MHz) and uses 1600 LE's on the Cylcone II.  Is this sufficient to keep 24 high-speed SHA-256 blocks from stalling?  Not even close.  In fact, it would probably not even be able to keep even one SHA-256 block from stalling.  Back to the drawing board...

It looks like a better approach would be to implement the search logic directly in gates on the FPGA, and have it fill one or more 256-bit-wide queue(s) which would be drawn on by the SHA-256 processing blocks.  A single NIOS II CPU still makes sense for collecting the results and communicating the results back to the host CPU (TCP/IP stack), as well as to load the search logic starting and ending values.

Anyway, my back-of-the-envelope calculations seem to confirm almost everything ArtForz is saying below.  It looks like the ATI 5970s are the right choice if your goal is to crunch bitcoins.

OTOH, if you want an excuse to learn how to program FPGAs, you will certainly be able to run circles around a state-of-the-art hex-core i7 CPU with a pretty modest FPGA -- but at considerable effort.

Jason

The real issue on FPGA isnt the logic ops(cheap) or the rotates(pretty much free), but the 32-bit adds.
A_out = H + s0 + s1 + maj + ch + K + W
-> at least 3 level adder tree ((H + s0) + (s1 + maj)) + ((ch + K) + W)
Carry chain delay in a single 32-bit adder on a -3 speed grade Spartan6 is ~2ns, so without ANY routing delays we're already limited to 166MHz.
Real-world you're lucky to get 80MHz out of a non-pipelined round on a -3 S6
Pipelinining a round to 2 or 3 stages helps, but increases FF usage a LOT (you have to carry 256 bits of A..H, 512 bits of W[0..15] and the initial A..H for the final add around).
2-stage gives ~140MHz on a -3, 3-stage ~180MHz
= a 2-stage pipelined sha256 round is ~1k FFs, 3-stage pipelined ~1.5k FFs
XC6SLX150 has something like 160k FFs available, and the synthesis tools pretty much throw speed out the window once you go >70% FF utilization.
so realistically you MIGHT be able to fit 64 2-pipelined rounds of sha256 on a LX150, 2 clocks/bitcoinhash @ 140MHz -> 70Mh/s
or maybe with lots of luck and sacrificing a chicken to the place and route gods 48 rounds 3-stage @ 180MHz -> 68Mh/s
= 70Mh/s on a -3 speed grade XC6SLX150, 20%-30% less on a -2 speed grade.
so 9 grand for MAYBE 850Mh/s... a $500 HD5970 can get >550Mh/s stock, well >600Mh/s OCed at stock voltage even on a "bad" card.

okay, let's be REALLY generous, assume we can magically get 1.2Gh/s out of 12 150-2s and they consume NO POWER AT ALL.
So how long does it take at 600W for 2 5970s and $0.10/kWh to make up that $8k price difference? 0.6kW @ $0.10 kWh = $1.44/day ... about 15 years.
full member
Activity: 354
Merit: 103
December 29, 2010, 05:54:41 PM
#31
Wow thats some really impressive calculations.

$9k... to bad xmas is over for this time :-)

Smart idea to pipeline the adders, does it mean you spend more flip-flops but not more gates?

I was thinking of getting an old-fashioned xc3s500 for a reasonable price, at 1k-1.5k flip flops maybe it would be possible to fit one out of the 64 of these
pipelined sha modules into one chip?

So, if I'm lucky I could get it running at 60-70 MHz meaning a full sha would take about 1us and that would give me 0.5 MHash/sec, right?

Its almost as fast as my old computer which runs at 0.7 MHash/s :-)
sr. member
Activity: 406
Merit: 257
December 28, 2010, 01:25:22 PM
#30
The real issue on FPGA isnt the logic ops(cheap) or the rotates(pretty much free), but the 32-bit adds.
A_out = H + s0 + s1 + maj + ch + K + W
-> at least 3 level adder tree ((H + s0) + (s1 + maj)) + ((ch + K) + W)
Carry chain delay in a single 32-bit adder on a -3 speed grade Spartan6 is ~2ns, so without ANY routing delays we're already limited to 166MHz.
Real-world you're lucky to get 80MHz out of a non-pipelined round on a -3 S6
Pipelinining a round to 2 or 3 stages helps, but increases FF usage a LOT (you have to carry 256 bits of A..H, 512 bits of W[0..15] and the initial A..H for the final add around).
2-stage gives ~140MHz on a -3, 3-stage ~180MHz
= a 2-stage pipelined sha256 round is ~1k FFs, 3-stage pipelined ~1.5k FFs
XC6SLX150 has something like 160k FFs available, and the synthesis tools pretty much throw speed out the window once you go >70% FF utilization.
so realistically you MIGHT be able to fit 64 2-pipelined rounds of sha256 on a LX150, 2 clocks/bitcoinhash @ 140MHz -> 70Mh/s
or maybe with lots of luck and sacrificing a chicken to the place and route gods 48 rounds 3-stage @ 180MHz -> 68Mh/s
= 70Mh/s on a -3 speed grade XC6SLX150, 20%-30% less on a -2 speed grade.
so 9 grand for MAYBE 850Mh/s... a $500 HD5970 can get >550Mh/s stock, well >600Mh/s OCed at stock voltage even on a "bad" card.

okay, let's be REALLY generous, assume we can magically get 1.2Gh/s out of 12 150-2s and they consume NO POWER AT ALL.
So how long does it take at 600W for 2 5970s and $0.10/kWh to make up that $8k price difference? 0.6kW @ $0.10 kWh = $1.44/day ... about 15 years.
newbie
Activity: 7
Merit: 0
December 28, 2010, 12:53:05 PM
#29
The EFF built Deep Crack for less than $250,000, I thought he has make it with custom ASIC DES chips (called Deep Crack or AWT-4500) http://en.wikipedia.org/wiki/Deep_crack.
That appears to have been 1998.  You might be able to do it for a few 100's of thousands, but you would start with FPGAs and then hardwire.
newbie
Activity: 12
Merit: 0
December 28, 2010, 09:03:02 AM
#28
NVIDIA is geared towards floating point, while bitcoin's SHA256 algorithm wants integer math.

ATI GPUs are better at this.

You are maybe right, I don't know well the inner set of instructions per GPU-brand/type.

The instructions usually used for SHA-256 (IMHO, all the SHA-2 implementation as they use the same
scheme just the size is different) implementations are all the bit-wise (AND, OR, NOT and XOR)
operators on 32-bit word, the right shift instruction but also the rotate right/left instructions.

A comparison of all cycles required for all the instructions per type FPGA, GPU, Cell-like or other
SIMD could be useful. I don't know if someone in the forum already made this along with a rough
estimation of the cost per technology.

On the other hand, building something for SHA-2 that can be reused for other projects
relying on SHA-2 is not a waste of time/money.

If you or someone else build something in that scope, I will be willing to invest some time
and money in the project.




legendary
Activity: 1596
Merit: 1100
December 27, 2010, 07:09:23 PM
#27
NVIDIA is geared towards floating point, while bitcoin's SHA256 algorithm wants integer math.

ATI GPUs are better at this.
newbie
Activity: 12
Merit: 0
December 27, 2010, 05:35:39 PM
#26

- If this is a pure code breaking application, you are probably better off with FPGAs than GPUs, but it is very easy to gang together a few Xboxes.  FPGAs are harder to come by.

Looking at the price of the FPGA with the design of the custom board,
why not going for a (or more) Nvidia Tesla board C2050/C2070?
(price is around 3000,- USD per board for 448 GPU core)

http://www.nvidia.com/docs/IO/43395/BD-04983-001_v04.pdf
http://www.nvidia.com/object/product_tesla_C2050_C2070_us.html

newbie
Activity: 32
Merit: 0
December 27, 2010, 03:44:55 PM
#25
mike_la_jolla checking in here to clarify some FPGA questions.

- Those of you that think you can do a custom ASIC are nuts.  The expense and effort of an ASIC would cost millions ($USD).  The Genomic search market isn't even large enough to support a custom ASIC.

The EFF built Deep Crack for less than $250,000, I thought he has make it with custom ASIC DES chips (called Deep Crack or AWT-4500) http://en.wikipedia.org/wiki/Deep_crack.

How you can help? A full implementation would be great  Smiley I would give you 150BTC for a miner implementation (vhdl or verilog) on Spartan-6. Maybe there are other user who would donate. You should write your Bitcoin address in your Signature
Pages:
Jump to: