Author

Topic: Difficulty: More nodes active, or faster nodes? (Read 10933 times)

legendary
Activity: 1708
Merit: 1010
September 01, 2010, 07:36:19 PM
#21
I've been looking into this CUDA versus FPGA thing more, and it seems to be an ongoing debate even within the high-performance computing industry.  But then I found this little tidbit....

http://cadlab.cs.ucla.edu/~cong/papers/FCUDA_extAbstract_ICS09_final3.pdf

These guys figured out a way to automagicly map a FPGA with GPU-like 'stream' processors suited for the type of excecutions that a particular CUDA program uses.  Which not only allows a program written in CUDA to be used on a GPU to run on a FPGA without modifications, but also saves space on the FGPA by not implimenting functions not required for the CUDA program in question.  Potentially permitting more "streams" than would otherwise be possible.

I wonder how long it will be until some major GPU manufactuer such as Nvidia puts a FPGA on a graphics card?

Currently, however, a CUDA capable graphics card in a modern gamer system would be the best cost/performance value; if only because one still needs a graphics card anyway.

Going out of your way to buy a dedicated graphics card, in addition to the one required for a modern system, such as a 'Nvidia Tesla' card, is probably not a cost/performance advantage over a FPGA card bought for the same purpose and encoded with CUDA capable stream processors.

YMMV
sr. member
Activity: 252
Merit: 250
A slight offtopic:

There are ready-to use (in the sense that you don't have to manufacture hardware) PCI Express FPGA accelerators.

http://www.google.com/search?q=pci+express+fpga+accelerators

Nallatech even has low-profile cards that I suspect will enter into 1U servers

My question are:

1) How much money? I suspect they are at the range of NVidia Tesla prices ($1000+), and dev kit is not free
2) How much speed? I have a string processing application doing mainly dictionary lookups, so it certainly needs fast local memory unlike Bitcoin. Is it possible to get better performance/USD than with Xeons?
3) How hard is to program? Is it much harder than CUDA? Are C compilers for FPGA any good or I can get 3x speedups using VHDL instead of C?
legendary
Activity: 1708
Merit: 1010

That's probably just the usual weekend bump.


There's a difference between weekdays and weekends?  That would make me think that some IT guys are turning idle cpu's into a hash farm on the weekends.
full member
Activity: 307
Merit: 102
Yow -- anyone look at the 24hr chart today?  Someone brought some serious hash power on-line.

It's now almost double yesterday.



That's probably just the usual weekend bump.
legendary
Activity: 1708
Merit: 1010
Yow -- anyone look at the 24hr chart today?  Someone brought some serious hash power on-line.

It's now almost double yesterday.



Oh, don't be so coy, Ground!  We are all proud of what you have accomplished.  How many did you get to fit on your FPGA in the end?
member
Activity: 111
Merit: 10
Yow -- anyone look at the 24hr chart today?  Someone brought some serious hash power on-line.

It's now almost double yesterday.

member
Activity: 111
Merit: 10
You're on the right track.

The I/O requirements are almost zero.  You just need the 80 initial bytes, target hash, and a starting nonce.  You really don't need to talk to the FPGA again until you receive more transactions you want to include, or another block.  For speed, you would test against the target hash in-circuit, and the only output is on 'success'.

No need to randomize either -- any kind of value walk (increment, grey code, whatever) should suffice.

So yes --- the BTC 64,000 question is how many unrolled hash rounds you can fit into your part(s).
legendary
Activity: 1708
Merit: 1010
I've been thinking about this a bit more, and I thought of something.  If a programmer were trying to put as many sha-256 coprocessors as possible on the same FPGA chip, and I/O was his real limitation; he wouldn't need to set it up so that the processors would report the entire hash upon success.  All that would be neccessary is that the coprocessor report the nonce used to get that hash, and the cpu use that nonce once to get the proper result.  So fewer pins would be required for output of each coprocessor.  Also, the cpu could supply the initial nonce for each coprocessor, so a random number generator would not be neccessary, maybe?  I would love to try even just one custom cut coprocessor on one of these FPGA's.  Properly done, I'd bet it would be sick fast.
lfm
full member
Activity: 196
Merit: 104
Holy crap!  Are you serious?  This is an average desktop these days?  I'm still using a single core P3 running Ubuntu.  Well, at least I was until my power supply crapped out and killed my motherboard.  Can you believe they want $200 to upgrade!?

Note you can get a new motherboard with the VIA C7 CPU included for $65 or less (I got mine from newegg online). Just add memory and you might need a newer power supply to get the right connectors and you'd have a nice upgrade from that old P3. It runs at 1.8Ghz and Ubuntu works fine too.

I am working to get bitcoin software working with the built in SHA instructions for the C7 but I can't seem to get it quite right quite yet. Soon tho and it should then be available to anyone.
member
Activity: 111
Merit: 10
Best thread evar. Smiley
What would you even do with VHDL code, anyway?  With your Ubuntu Pentium-III?  Yeah.

My point was that if you are willing to spend a summer learning new tools & technologies and decomposing SHA256 to map around your hardware of choice, with a master/controller to mate with it, you're probably best off buying a moderate GPU (GeForce GT240 at $90, or GTX460 $200) and diving into CUDA.  At least two people here have already done it.

An Intel i7 870 (3GHz, 4core*HT) is $265, so I'm not talking about exotic gear here.

My whole point is that 5 million hash tries a second is nothing to shake a stick at.
legendary
Activity: 1708
Merit: 1010
I'm not an FPGA expert, but I dabble.


Is that right?  SO I guessed correctly, there is at least one member of the bitcoin community with access to at least one FPGA.  Are you going to post your code now that you have been outed?

Quote

I asked some folks...



And I buy Rogaine for a friend of mine...

Quote


The upshot was that a modern desktop machine has such a huge clockspeed advantage (8 cores at 3 GHz?),



Holy crap!  Are you serious?  This is an average desktop these days?  I'm still using a single core P3 running Ubuntu.  Well, at least I was until my power supply crapped out and killed my motherboard.  Can you believe they want $200 to upgrade!?

Quote

so only a massively parallel implementation would have a chance of competing. 


My root point was that, those who already owned these chips, don't really need to "compete".  The FPGA would simply add to what their CPU(s) can already do.  They don't have to *replace* the cpu.

Quote


 There really isn't much of an I/O constraint, so it's just grinding on the nonce and testing for success.  Most of the commercial SHA256 cores focus on I/O bandwidth, for the typical application of passing a lot of data into the hash. 


So do you think that a coproccessor using local memory on a FPGA could go faster than the hardware on a VIA 7 due to I/O limitations?

Quote

 This is something much different, with self-generated inputs and a test on the output of each cycle.

So.. How many rounds can you fit in a million gates?


I'm guessing at least four, maybe as many as 10.  Harder to guess if one is using local memory, that stuff can be wicked fast on some things.

Quote

A modern GPU and some crafty OpenCL/CUDA code seems like a better avenue for research, rapid turnaround, and scalable speeds.  You'd have both high clock speeds and parallelism at work.


Sure, for those who have a GPU and not a FPGA.  Sure 55mhz doesn't seem like crap compared to 8 cores at 3 ghz, but a sha-256 setup in software can't process a hash in a single cycle, whereas a well made 'solid state machine' (which, as I understand it, what a FPGA program actually is) can process one hash in a single cycle, if the I/O isn't a limitation anymore.  Multiply that by however many cores can be programmed onto a single chip, and there is serious potiental here.
legendary
Activity: 1708
Merit: 1010

Excuse me but you need to be more explicit. First does it need a "talented programmer" or is it "child's play"?


To a programmer talented enough to hack his GPU, hacking one or more parralel sha-256 coproccesors into one or more FPGA's would be child's play.

Quote

I understand you are excited about the possibilities but you are making grandiose claims without evidence, nor even concrete estimates of the performance you expect.

How bout some actual numbers? How many SHA256 hashes can you really do in parallel on your FPGA? Please state the actual model number of the FPGA you expect to use. What actual data rates are expected?


To be honest, I was over the top when I said that a set of four FPGA's would look like a supercomputer.  I can't really say what can be expected, until I try it, but my own (admittely limited) experience with FPGA's is that a single such chip can replicate a pretty complex shortwave receiver, tuning and all, including all of the currently popular modes on shortwave with zero aid from the master cpu.  Two are required only for the experimental modes, and they may even be better/faster these days, as it has been a number of years since I played with these things.  They are certainly cheaper.  At first glance at the complexity of the sha-256 algorithium, I would expect to be able to get at least four such coproccessors into a chip comparable to the kind that I have used in the past.  Any one of which should have a kh/s rate somewhat less than the hardware found on a VIA 7, assuming that they are implimented in pretty much the same way.  The 'virtualization' of the solid state circut within a FPGA does impose a slight (yet measurable) penalty, but I don't think that it would be so high as to really matter.  If all four chips, each with four coproccessors, could be successfully coded and utilized; I would expect that the resulting kh/s rate would total at least 14 times what the single coproccessor on the VIA 7 can run.  And that doesn't include the additional kh/s that the CPU or the GPU could add to the mix.

Quote


Beyond that is the price of the chips, the price of the developer environment and the power requirements. Any of these can be very significant barriers to this idea.

Peace.





Sure, but I was assuming that someone already has a set of these chips for experimental ham radio hobbies; in the same way that using GPU's to crunch numbers assumes that one already has the GPU.  It doesn't make economic sense to buy these chips for this reason any more than it makes sense to buy extra graphics cards solely to generate bitcoins.  Honestly, I don't know if this is the reason, and I can't know.  But if someone has done this, it's another game changer for the bitcoin community.

Anyway, I imagine that FPGA chips will eventually become a standard thing to have in a high end PC, and most OS's will be altered to take advantage of them on a regular basis.  Can you imagine what a game company could do if one of these were in every PS3?
member
Activity: 111
Merit: 10
I'm not an FPGA expert, but I dabble.

I asked some folks (much smarter than I) to run some back-of-envelope calculations for my preferred Xilinx Spartan-3E.
It has 1200k gates, runs at 50 Mhz, and so on.  You can get into it for $150 or so.  (Digilent Nexys2 is hard to beat)

The upshot was that a modern desktop machine has such a huge clockspeed advantage (8 cores at 3 GHz?), so only a massively parallel implementation would have a chance of competing.  There really isn't much of an I/O constraint, so it's just grinding on the nonce and testing for success.  Most of the commercial SHA256 cores focus on I/O bandwidth, for the typical application of passing a lot of data into the hash.  This is something much different, with self-generated inputs and a test on the output of each cycle.

So.. How many rounds can you fit in a million gates?

A modern GPU and some crafty OpenCL/CUDA code seems like a better avenue for research, rapid turnaround, and scalable speeds.  You'd have both high clock speeds and parallelism at work.
lfm
full member
Activity: 196
Merit: 104
A lot of hand waving there. For some concrete numbers it quotes 53 MB/s and since we only hash 192 bytes at a time, you might think it would do 27 mhash/s (but it probably would be less) which I beleive is actually within the range of a desktop with a couple GPUs.

Sorry, afraid I corrected this after you quoted it. The correct calculation would be 0.27 Mhash/s.

The other point is one that I didn't explicitly mention, one FPGA does not equal only one sha-256 processor.  It is possible, even likely, that more than one such processor could be programmed into a single FPGA chip.  These chips are fairly large so that they can 'virtualize' some pretty complex logic circuts, and a talented programmer could program one chip to be several sha-256 processors running in parrallel.  All this, and his main CPU and GPU are still available if still more Kh/s are desired.  Any hacker with the skills to program one or more GPU's in the same system to crunch hashes is already elite, and doing multiple sha-256 cores on a single FPGA would be child's play.  And we already know that there is some elite talent within the Bitcoin community, some who desire to run it, and some who desire to break it.

Excuse me but you need to be more explicit. First does it need a "talented programmer" or is it "child's play"?

I understand you are excited about the possibilities but you are making grandiose claims without evidence, nor even concrete estimates of the performance you expect.

How bout some actual numbers? How many SHA256 hashes can you really do in parallel on your FPGA? Please state the actual model number of the FPGA you expect to use. What actual data rates are expected?

Beyond that is the price of the chips, the price of the developer environment and the power requirements. Any of these can be very significant barriers to this idea.

Peace.
founder
Activity: 364
Merit: 7423
The performance numbers posted from a VIA C7's hardware SHA-256 weren't astronomical.  Only in the 1500 khash/s range.  If you think about it, just because it's implemented in hardware doesn't mean it's crazy fast.  It still has to do all the steps.  It's only if simplifying it down to single-purpose hardware makes it small enough to fit many in parallel.  That's not necessarily easy or a given.

legendary
Activity: 1708
Merit: 1010

Quote
The successful coding of the sha-256 algorithim into a fpga and recoding of the bitcoin client's generation function to use one or more such fpga's would produce a khash per second rate that no desktop could match.  It would look like a super-computer from our perspectives.

A lot of hand waving there. For some concrete numbers it quotes 53 MB/s and since we only hash 192 bytes at a time, you might think it would do 27 mhash/s (but it probably would be less) which I beleive is actually within the range of a desktop with a couple GPUs.



Yes, but there are two points that you overlooked.  First, the software transceiver ususally requires four of these chips.  (two for receive and two for transmit, one does digital signal processing and the other does digital filtering of the raw signal.  Said another way, one is the virtual mike/speaker and the other is a virtual tuner.  Not all such software radio setups need four, however)  So if a ham has four of these, all four could be programmed to this end.  The other point is one that I didn't explicitly mention, one FPGA does not equal only one sha-256 processor.  It is possible, even likely, that more than one such processor could be programmed into a single FPGA chip.  These chips are fairly large so that they can 'virtualize' some pretty complex logic circuts, and a talented programmer could program one chip to be several sha-256 processors running in parrallel.  All this, and his main CPU and GPU are still available if still more Kh/s are desired.  Any hacker with the skills to program one or more GPU's in the same system to crunch hashes is already elite, and doing multiple sha-256 cores on a single FPGA would be child's play.  And we already know that there is some elite talent within the Bitcoin community, some who desire to run it, and some who desire to break it.
lfm
full member
Activity: 196
Merit: 104
I would be willing to bet that someone used one of these Field-programmable gate array to do this...

http://www.springerlink.com/content/765kta4qr92daea8/

which is something that I, myself, obviously has considered.  Whoever is doing it is probably representing a great deal of the current hash percentage, and hogging a pretty good amount of the new bitcoins.

...

Quote
The successful coding of the sha-256 algorithim into a fpga and recoding of the bitcoin client's generation function to use one or more such fpga's would produce a khash per second rate that no desktop could match.  It would look like a super-computer from our perspectives.

A lot of hand waving there. For some concrete numbers it quotes 53 MB/s and since we only hash 192 bytes at a time, you might think it would do 0.27 mhash/s (but it probably would be less) which is actually within the range of a desktop.

Quote
Another possibility is that someone owned or bought one of these...

http://www.via.com.tw/en/initiatives/padlock/features.jsp

Ya, someone might! They measure out about 1.5 mhash/s. There are many ordinary Intel or AMD CPUs can do much better than that (with a little more electric power input tho).
newbie
Activity: 9
Merit: 0
legendary
Activity: 1708
Merit: 1010
I would be willing to bet that someone used one of these...

http://en.wikipedia.org/wiki/Field-programmable_gate_array

to do this...

http://www.springerlink.com/content/765kta4qr92daea8/


which is something that I, myself, obviously has considered.  Whoever is doing it is probably representing a great deal of the current hash percentage, and hogging a pretty good amount of the new bitcoins.  Considering that up to four of these programmable arrays are used by modern ham radios for this...

http://www.dsptools.com/Radio.htm

and this...

http://www.softrockradio.org/


The successful coding of the sha-256 algorithim into a fpga and recoding of the bitcoin client's generation function to use one or more such fpga's would produce a khash per second rate that no desktop could match.  It would look like a super-computer from our perspectives.  As a ham radio operator myself, I was aware of these devices, but I don't presently own any.  Even connected to my netbook over USB2, the khas/sec rate would be sick.  The program for these things are normally kept on the master computer's harddisk, and only take a few seconds to swap out; so a ham could use his software radio whenever he wants to, and then rewrite all of his FPGA's with the sha-256 algorithim before going to bed, and make money while he sleeps.

Another possibilty is that someone owned or bought one of these...

http://www.via.com.tw/en/initiatives/padlock/features.jsp


or some other cryptographic coproccesor on a daughtercard.


I'm sure once Bitcoin takes off, anyone with enough of the coins to have a deep personal interest in the strength of the currency will be running clients with hardware exceleration for the sha-256 has function.

That also makes me wonder if there are PCI daughtercards with FPGA's on them yet.  The last time that I looked into them, they were only available as external setups.
lfm
full member
Activity: 196
Merit: 104
The recent upticks in difficulty have been remarkable.  At 511.77, I've even stopped off those machines where I don't pay for marginal power consumption.  My calculation is that the wear/tear from elevated temperatures and full-speed fans has more risk and cost than the BTC value.   (You can question my math.)

This is just what one should naturally expect. Some people have more tolerance for the cost of mining than others.

Quote
But my question is this: Are we seeing the rise of mega-power machines, or just an exponential increase in the number of nodes?  That is, are the top-producing machines making leaps in power (over average), or are my machines still about average, just dwarfed by the number of participating machines?  Is there any way to tell? (node counts?)

I'd guess both. There is Art with his GPUs who claims to have about %10 of the generating power.

Then I see about 850 connections to the IRC channel. Not sure what its been like historically.
member
Activity: 111
Merit: 10
The recent upticks in difficulty have been remarkable.  At 511.77, I've even stopped off those machines where I don't pay for marginal power consumption.  My calculation is that the wear/tear from elevated temperatures and full-speed fans has more risk and cost than the BTC value.   (You can question my math.)


But my question is this: Are we seeing the rise of mega-power machines, or just an exponential increase in the number of nodes?  That is, are the top-producing machines making leaps in power (over average), or are my machines still about average, just dwarfed by the number of participating machines?  Is there any way to tell? (node counts?)

I totally understand that difficulty reflects the state-of-the-art in client khash improvements, but it's clear that my own computers also represent a rapidly diminishing part of the total network generation rate.  As the network gets larger, my chance of finding The Block every 10 minutes shrinks.  I'm thrilled, not bitter Smiley , but I'm curious to understand what has transpired over the past three weeks to make it so dramatic.

Cheers.
Jump to: