Official Open Source FPGA Bitcoin Miner (Last Update: April 14th, 2013) - page 42.

TheSeven

hero member

Activity: 504

Merit: 500

FPGA Mining LLC

Quote from: TheSeven on June 06, 2011, 10:45:38 AM

And I should probably publish the new version of my miner, it now supports multiple pools, long polling, etc.

Here's the current version of PyFPGAMiner, with a demo config file: http://dl.dropbox.com/u/23683845/pyfpgaminer-0.0.1.zip
There are a lot more configuration options available, you can either reconstruct those from the source code or just ask me

Oh, and don't forget to donate if you like it

163PG9aNBj4ZaFzAK2LsRLvRzttww7vu6u

tantive

newbie

Activity: 10

Merit: 0

damn, i missed that one...

TheSeven

hero member

Activity: 504

Merit: 500

FPGA Mining LLC

Quote from: tantive on June 06, 2011, 10:29:45 AM

I now have a bitfile for the atlys board (spartan 6 - lx45) with depth:=2 and 50mhz

The only problem is, that miner.py refuses to communicate over the serial port.
It detects the core, but when it starts "Measuring FPGA performance..." it produces and timeout: "Timed out waiting for FPGA to accept work"

@TheSeven: any idea how to debug or solve the problem? is the miner.py code working for all depths and frequencies?

You'll need to adjust the pin locations for clk_in, rx and tx in the UCF file, and adjust the clock divider for the serial port for the 50MHz frequency.
Replace "10000010001" with "0110110010" and "11000011001" with "01010001011" in uart.vhd.
And I should probably publish the new version of my miner, it now supports multiple pools, long polling, etc.

TheSeven

hero member

Activity: 504

Merit: 500

FPGA Mining LLC

Quote from: MoonBuggy on June 06, 2011, 09:29:10 AM

There's a (currently slim) possibility that I could secure some time on a few decent sized FPGA systems (although the owners are understandably wary about who they allow to play with their rather expensive equipment), but right now I'm not sure how worthwhile it would be to pursue; what kind of performance would be expected from a Convey HC-1ex (four Virtex 6 LX760s) or possibly an Xtreme Data XD-PCIE3000 with three Stratix IVs per card?

Any ballpark figures would be greatly appreciated, and there's a strong possibility that I'd need someone who knows their stuff to assist me in exchange for a share of the profits if it turns out to be plausible!

Ballpark estimate for the Convey machine would be 1-2GH/s. I'll know more after I attempt to synthesize a design. Do you know the speed grade of the FPGAs?
No idea about the Altera ones, ask fpgaminer

tantive

newbie

Activity: 10

Merit: 0

I now have a bitfile for the atlys board (spartan 6 - lx45) with depth:=2 and 50mhz

The only problem is, that miner.py refuses to communicate over the serial port.
It detects the core, but when it starts "Measuring FPGA performance..." it produces and timeout: "Timed out waiting for FPGA to accept work"

@TheSeven: any idea how to debug or solve the problem? is the miner.py code working for all depths and frequencies?

MoonBuggy

newbie

Activity: 12

Merit: 0

There's a (currently slim) possibility that I could secure some time on a few decent sized FPGA systems (although the owners are understandably wary about who they allow to play with their rather expensive equipment), but right now I'm not sure how worthwhile it would be to pursue; what kind of performance would be expected from a Convey HC-1ex (four Virtex 6 LX760s) or possibly an Xtreme Data XD-PCIE3000 with three Stratix IVs per card?

Any ballpark figures would be greatly appreciated, and there's a strong possibility that I'd need someone who knows their stuff to assist me in exchange for a share of the profits if it turns out to be plausible!

bitcent

newbie

Activity: 2

Merit: 0

Quote from: kokjo on June 06, 2011, 04:09:56 AM

Quote from: TheSeven on June 06, 2011, 04:01:00 AM

Quote from: kokjo on June 06, 2011, 03:44:40 AM

D) there is 2 rounds of sha256, and one sha256 round is 64 rounds. 2*64=128rounds.
the way to calculate the hash of a block is: sha256(sha256(blockdata))

To be exact, a constant is prepended to the inner sha256 hash to pad it to the 512 data bytes needed for the outer hash.

yes. the first a 1bit then alot of 0bits and then the size(64bit) until we reaches 512, i know that. but it does not matter much, how its done, just that it is done... it is the padding a talked about.
btw. its 512 bytes its 512 bits. a byte is 8 bits.

kokjo, gentakin & TheSeven - Thanks for your replies. The explanations (and LINKS!) helped me quickly understand what is happening, and in what state the FPGAs are getting the work. (I knew it was some "midstate" but not exactly where.) I'll spell it out here so you can check my understanding, and so the next noob can get a head start.

1) The first 1/2 of DATA is *already* hashed, and sitting in MIDSTATE. Gotcha. (So toss/ignore the first 1/2 of DATA for hash searching purposes.)

2) The next four 32bit long-words are:
2 a) the last 32bits of merkel tree
2 b) unix time in seconds
2 c) "bits", the current difficulty (encoded slightly)
2 d) nonce 0x00000000, which we iterate through 2^32 combinations looking for golden tickets

3) The remaining 384bits are the SHA256 spec for padding, which states:

Code:

re-processing:
append the bit '1' to the message
append k bits '0', where k is the minimum number >= 0 such that the resulting message
length (in bits) is congruent to 448 (mod 512)
append length of message (before pre-processing), in bits, as 64-bit big-endian integer

( http://en.wikipedia.org/wiki/SHA-2 )
3 a) Padding starts with 0x00000080, which big-little endian converts to 0x80000000, the high-order bit is the '1' appended to the message.
3 b) followed by all zeros until the last 64bits
3 c) last 64bits specify message length 0x00000000 0x80200000 which big-little endian converts to 0x00000000 0x00000280. 0x280 = 640bits = 80bytes, and ALL header blocks are 80 bytes. Check ... it all makes sense now. (THANKS!)

While trying to read the protocols and make sense of FPGAminer's code, I wrote a quick perl script to print out repeated getwork() responses in nice columns for analysis. If anyone wants, I can post it. Besides, it's kind of mesmerizing to watch.

Thanks again, guys!

kokjo

legendary

Activity: 1050

Merit: 1000

You are WRONG!

Quote from: TheSeven on June 06, 2011, 04:01:00 AM

Quote from: kokjo on June 06, 2011, 03:44:40 AM

D) there is 2 rounds of sha256, and one sha256 round is 64 rounds. 2*64=128rounds.
the way to calculate the hash of a block is: sha256(sha256(blockdata))

To be exact, a constant is prepended to the inner sha256 hash to pad it to the 512 data bytes needed for the outer hash.

yes. the first a 1bit then alot of 0bits and then the size(64bit) until we reaches 512, i know that. but it does not matter much, how its done, just that it is done... it is the padding a talked about.
btw. its 512 bytes its 512 bits. a byte is 8 bits.

TheSeven

hero member

Activity: 504

Merit: 500

FPGA Mining LLC

Quote from: kokjo on June 06, 2011, 03:44:40 AM

D) there is 2 rounds of sha256, and one sha256 round is 64 rounds. 2*64=128rounds.
the way to calculate the hash of a block is: sha256(sha256(blockdata))

To be exact, a constant is prepended to the inner sha256 hash to pad it to the 512 data bits needed for the outer hash.

gentakin

member

Activity: 98

Merit: 10

I can only answer about the getwork semantics:

* MIDSTATE is the sha256 hash after hashing the first 512-bit chunk of DATA, that is: the first half of DATA. So it is between SHA256 chunks, not in the middle of a sha256 round. The nonce is stored in the second half of the header, so the first half is constant and doesn't need to be hashed all over again.
* DATA is the block header for which a hash must be found. It does contain the unix timestamp. It also contains the current target value, so that's probably where the FPGA learns it (or it doesn't care at all and this is checked on the tcl-side). The nonce is set to 0x00000000.
* HASH1 is always the same, afaik. It's supposed to be some state buffer... or not. Not sure. Wink

When submitting the block via getwork, the original DATA needs to be adjusted to contain the valid nonce instead of 0x00000000.

So what the FPGA probably does is:
* increment nonce for every loop and use it as hash input for the second chunk.
* take midstate as the result of the first sha-256 chunk, then apply the second sha256 round.
* as bitcoin applies sha256 twice on the block header, hash the resulting 256bit string again, taking another sha256 round.
* if the resulting hash is "valid" (h==0), store it for the TCL script.

You might be interested in https://en.bitcoin.it/wiki/Block_hashing_algorithm .

edit: I'm too late, oh well. Grin

kokjo

legendary

Activity: 1050

Merit: 1000

You are WRONG!

lebish

newbie

Activity: 36

Merit: 0

The insides of my heart are le melting. Epix!

bitcent

newbie

Activity: 2

Merit: 0

Quote from: fpgaminer on May 19, 2011, 09:33:56 PM

Well, this is a bit earlier than I had wanted, but I will tweak and improve this as we go along.
...
Please feel free to give me feedback, suggestions, critiques, and of course to submit Pull requests.
...
June 2nd, 2011 - Flexible Unrolling Added
Thanks to the patch submitted by Udif, the code now supports a configurable amount of loop unrolling. The original design was fully unrolled, with 128 total round modules. By adjusting the CONFIG_LOOP_LOG2 Verilog define, you can choose to unroll to 64 round modules, 32, 16, 8, or 4. This makes the design smaller, at the equivalent cost of speed, which should allow it to run on many more FPGAs.

FPGAminer - I've been following this thread for ~2 weeks now and looking at your TCL code for your miner (mine.tcl), and I am still trying to figure out *exactly* what goes into the FPGAs for hashing, and what comes out to be submitted.

It looks like the following takes place:

1) get_work() and send the following to the FPGA:
1 a) MIDSTATE - all 256 bits
1 b) DATA - *ONLY* 256bits [256-511] (DATA string characters 128-191)
1 c) HASH1, TARGET, and the remaining 75% of DATA are discarded. (?!?!)

2) wait up to 20 seconds for a result - [wait_for_golden_ticket 20]

3) upon finding a "golden ticket", submit_work to the bitcoin client containing:
3 a) original DATA string[0-151], plus
3 b) "golden ticket" nonce string replacing DATA characters[152-159], plus
3 c) original DATA string[160-255]
... in essence, the original data string with the 20th 32-bit all-zero data block replaced with the golden nonce.

Side note: it appears that the 18th 32-bit block is Unix seconds - since 01/01/1970 00:00:00. Any other clues you can give about other fields?

Maybe a link to the getwork() definition of returned data?

My questions are the following:
A) When does the FPGA "learn" of the target value to beat ... or does it ever? (Hardcoded?)
B) SHA256 requires 512bit chunks of data to hash over. Is MIDSTATE really *right-in-the-middle* of a 64-round hash as opposed to just between 512bit chunks?
C) Exactly what gets hashed? It looks like the SHA256 engine is "primed" with MIDSTATE, and only gets 256bits of DATA to iterate with (ignoring the other 768bits of DATA).
D) If you only submit MIDSTATE and 256bits of DATA, how do we arrive at 128 round engines in the FPGA?

Any insight would be appreciated. Especially if an explanation points to a more in-depth description of the algorithm. (I've read every post here for ~2 weeks.) I've also refreshed my memory at http://en.wikipedia.org/wiki/SHA-2

BTW - Thanks for all of your work! GREAT JOB!

Fhtagn

newbie

Activity: 58

Merit: 0

Quote from: TheSeven on June 01, 2011, 02:21:41 AM

Quote from: Fhtagn on May 31, 2011, 02:37:32 PM

I've been looking for a good excuse to dust off my Verilog books and old Digilent Spartan 2e 200K board, maybe I can fit a serialized version on it.

I'd expect <1MH/s from that FPGA.

Thanks for the estimate; I'm glad to find a project with knowledgeable people involved.

At this point, for me, it's not about hashing speed. It's about gaining more FPGA/HDL experience. I'm already mining with a decent amount of dedicated GPUs.
I've always loved hardware design, but haven't had much time for it since college.

I'd eventually move up to more powerful devices.

I think that this project is an important one for Bitcoin. FPGAs and ASICs will provide a much more power efficient mining infrastructure. Being open sourced will, hopefully, put device manufacture ability into more hands. This bodes well for network security.

If this project scales to multi-chip designs and board runs, I'll do what I can to help in prototyping/testing.

Ladyada has put together a list of some board makers: http://www.ladyada.net/library/pcb/manufacturers.html

makomk

hero member

Activity: 686

Merit: 564

Quote from: TheSeven on June 04, 2011, 06:32:06 PM

I haven't managed to synthesize anything that performs decently on a Spartan 6 (it complains about a congested design that can't be routed), but ArtForz claims to have one of these running at 190MH/s.

Hmmmm. Supposedly the Spartan 6 lacks the long-distance routing fabric of the Virtex 6 chips, and both have a much higher ratio of logic to routing infrastructure than older generations. Of course, this isn't officially documented anywhere that I can find... FPGA manufacturers are annoyingly secretive.

teknohog

sr. member

Activity: 520

Merit: 253

555

Quote from: makomk on June 04, 2011, 06:21:22 PM

That reminds me - have you managed to synthesize your code for a Spartan 6? I tried it, but it bailed out early on with a cryptic message about synthesis failing and no other information I could find. Rumour has it the Spartan 6 support may be more temperamental than for earlier generations. (Not that I have an FPGA to run this on anyway!)

No, I haven't tried it as I only have a Spartan 3E 500K. I have only been looking at the specs of Spartan 6 and others, so as to find the biggest number of logic units. Thanks for the idea though, testing the synthesis in advance would help us choose the best chip.

On another note, I have been estimating how my miner performs, based on how often a solution is found. I'm getting something like 3 to 4 Mhash/s at 100 MHz, which is much better than expected, but I may just be lucky. The mining script is updated to show these estimates, though with some more work you could get actual rates from the chip.

DonCookie

newbie

Activity: 1

Merit: 0

Has anyone been able to get it up and running on a Cyclone II (Terasic DE2)? Even though I set "CONFIG_LOOP_LOG2=5" and set "altpll_component.width_clock = 3" I still get an error:

Error: Can't elaborate user hierarchy "main_pll:pll_blk|altpll:altpll_component"

udif

newbie

Activity: 3

Merit: 0

I just uploaded a new "upload" branch on my fork of fpgaminer's code.
The code now supports another parameter, CONFIG_MERGE_LOG2.
This allows you to drop some of the registers between pipe stages.
Warning - code has not been tested yet - this is just a preview
I'm having some issues with my FPGA card, so I couldn't test it yet.
In addition, the golden nonce adjustment isn't fixed in this code yet.

For example - using existing code:

using CONFIG_LOOP_LOG2=3 and CONFIG_MERGE_LOG2=0 creates 8 stages that takes 8 clock cycles each (for each SHA).
On my EP3C25, this took ~23K LEs, 14.5K FF's, and achieved ~60MHz.
A new result is received every 8 clock cycles ~7.5MH/s
This is equivalent to the old code.

using CONFIG_LOOP_LOG2=4 and CONFIG_MERGE_LOG2=0 creates 4 stages that takes 16 clock cycles each (for each SHA),
On my EP3C25, this took ~13K LEs, 8.5K FF's, and achieved ~50MHz.
A new result is received every 16 clock cycles, or ~3.1MH/s

Using the new code:

using CONFIG_LOOP_LOG2=3 and CONFIG_MERGE_LOG2=1 creates 4 stages that takes 8 clock cycles each (for each SHA),
but each stage is equal to 2 regular SHA stages.
On my EP3C25, this took ~17K LEs, 8.5K FF's, and achieved ~40MHz.
A new result is received every 8 clock cycles, or ~5MH/s

As you can see, the new option gives more size/speed options.

TheSeven

hero member

Activity: 504

Merit: 500

FPGA Mining LLC

Quote from: makomk on June 04, 2011, 06:21:22 PM

Quote from: teknohog on June 04, 2011, 02:56:52 PM

However, this is only about 60 % utilization, it's frustratingly close to being able to double this. (It would need about 10K vs. my 9K LUTs.) The next best Spartan3 has 1200K gates vs this 500K, so it might be able to quadruple the units.

I think you need a Spartan6 to do any serious mining, but even then you should check the number of logic units, the series has some low-end models as well.

That reminds me - have you managed to synthesize your code for a Spartan 6? I tried it, but it bailed out early on with a cryptic message about synthesis failing and no other information I could find. Rumour has it the Spartan 6 support may be more temperamental than for earlier generations. (Not that I have an FPGA to run this on anyway!)

I haven't managed to synthesize anything that performs decently on a Spartan 6 (it complains about a congested design that can't be routed), but ArtForz claims to have one of these running at 190MH/s.

makomk

hero member

Activity: 686

Merit: 564

Quote from: teknohog on June 04, 2011, 02:56:52 PM

However, this is only about 60 % utilization, it's frustratingly close to being able to double this. (It would need about 10K vs. my 9K LUTs.) The next best Spartan3 has 1200K gates vs this 500K, so it might be able to quadruple the units.

I think you need a Spartan6 to do any serious mining, but even then you should check the number of logic units, the series has some low-end models as well.

That reminds me - have you managed to synthesize your code for a Spartan 6? I tried it, but it bailed out early on with a cryptic message about synthesis failing and no other information I could find. Rumour has it the Spartan 6 support may be more temperamental than for earlier generations. (Not that I have an FPGA to run this on anyway!)

Topic: Official Open Source FPGA Bitcoin Miner (Last Update: April 14th, 2013) - page 42. (Read 432972 times)