Pages:
Author

Topic: Official Open Source FPGA Bitcoin Miner (Last Update: April 14th, 2013) - page 27. (Read 432950 times)

hero member
Activity: 560
Merit: 517
Quote
Not sure to what use they can be put either.
I'm guessing you could pair them up to get the equivalent of a 16BWER, but I'm not sure.

I just tried adding an extra register to the shifter's inferred RAM. After compilation it failed timing (80MHz) and ... the register was gone. I'm guessing it optimized the register away somehow, or balanced it. Either way, it ended up having a negative impact. I'm running another compile with USE_XILINX_BRAM_FOR_W off to see how that works. Perhaps we need to find a way for ISE not to optimize the shifter so much when USE_XILINX_BRAM_FOR_W is being used?
hero member
Activity: 686
Merit: 564
First off, I will say I am quite impressed by Xilinx's work and foresight on the logic of the S6's slices. They can perform a 3 component, 32-bit addition in 8 chained slices, with 4-bits being computed per slice. That blew my mind when I saw it in the FPGA Editor. This is great for our mining algorithm, and you can see why in this critical path analysis:
Yeah, neat isn't it? That's a big part of the reason I was initially confident about Spartan 6 performance, at least until I found out about the catch...

So, that's the good news. The bad news is, of course, only half of the slices are useful. There are two slices in a CLB. One slice always has fast-carry logic and chains to the slice directly above it (in the CLB above it). The other slice is a lowest form of life slice. It's still a powerful slice, with 4 6-LUTs (or 8 5-LUTs, or combinations thereof), and 8 flip flops, but the mining algorithm has rare use for it.
Which is a tad irritating.

It's being forced to route from a RAMB16BWER to a CLB that's right smack dab in the middle of a group of columns, furthest possible position from possible RAM locations. Here, check this image out, it will make you go insane, so don't stare too long:
http://i.imgur.com/gBv5R.png (RAM is on the left).
No, seriously, don't stare at it. The router will drive insanity into the depths of your soon rotting brain fleshes. After exploding the Universe, of course.
Whoops, you're right. It appears to be doing the something similar on the SLX75 too; I should've spotted that sooner. (Judging from the timing report at least; I haven't looked in FPGA Editor because the Linux version is a pain in the ass to actually run on my distro. It uses some grotty ancient Motif-based wrapper library that emulates some prehistoric version of the Windows API and requires deep magic to get it to launch.)

The interesting this is that we've got lots of RAM to play with. The design is using ~30% of the 16BWERs, and none of the 8BWERs. It seems like a good idea to try to use them and bring slice consumption down if possible, but only if their awkward placement can be solved appropriately.

The 8BWERs suffer from some slightly annoying errata. Also I think they may actually be shared with the 16BWERs and so are not actually all available, but don't quote me on that. Not sure to what use they can be put either.
hero member
Activity: 560
Merit: 517
Regarding Performance Optimizing Spartan-6 LX150:
DANGER: Long detailed post coming... sorry, I hope the information is useful though

I'm working with my LX150_makomk_speed_Test project, where I'm trying to nail down the performance bottlenecks and remove them. I'm learning up FPGA Editor so I can better visualize what the router is doing, and I've read through some of the Spartan-6 UGs to get a better understanding of the architecture.

First off, I will say I am quite impressed by Xilinx's work and foresight on the logic of the S6's slices. They can perform a 3 component, 32-bit addition in 8 chained slices, with 4-bits being computed per slice. That blew my mind when I saw it in the FPGA Editor. This is great for our mining algorithm, and you can see why in this critical path analysis:

Code:
W:           16 slices + 0 slices = 16
tx_t1_part:  8  slices + 0 slices = 8

t1:          8  slices + 0 slices
tx_state[0]: 8  slices + 8 slices = 16
tx_state[4]: 8  slices + 8 slices = 16
The worst critical paths are only 16 slices long, with a single break in the carry chain (AFAIK). W is a 4-way, performing a 3-way of the first 3, and a 2-way of the result and the remaining component. tx_state[4] is a 2-way with t1 and rx_state[3].

I haven't fully analyzed the router's behavior on the 2-way's yet, but it appears to include work from other operations ... somehow. Not sure yet.

So, that's the good news. The bad news is, of course, only half of the slices are useful. There are two slices in a CLB. One slice always has fast-carry logic and chains to the slice directly above it (in the CLB above it). The other slice is a lowest form of life slice. It's still a powerful slice, with 4 6-LUTs (or 8 5-LUTs, or combinations thereof), and 8 flip flops, but the mining algorithm has rare use for it.

The next bad news is, only half of the "good" slices can be used as RAM or shift registers. That's not a terrible thing since most will be consumed as adders anyway.

And that's about all I could find that's particularly good or bad with the S6 slices. Since the good slices are all in columns, and spaced evenly, the impact of the useless slices should actually be far less severe than I thought.


For the S6's routing architecture, the quick overview basically said routing costs roughly Manhattan Distance between CLBs. I haven't dug into the details more than that at this point.



With that knowledge in hand, and some beginner's experience with FPGA Editor, I dived in and found what appears to be the largest bottleneck in the current code:

Code:
Slack (setup path):     0.264ns (requirement - (data path - clock path skew + uncertainty))
  Source:               uut2/HASHERS[41].shift_w1/Mram_m (RAM)
  Destination:          uut2/HASHERS[41].upd_w/tx_w15_30 (FF)
  Requirement:          12.500ns
  Data Path Delay:      11.716ns (Levels of Logic = 6)
  Clock Path Skew:      -0.260ns (0.780 - 1.040)
  Source Clock:         hash_clk rising at 0.000ns
  Destination Clock:    hash_clk rising at 12.500ns
  Clock Uncertainty:    0.260ns

  Clock Uncertainty:          0.260ns  ((TSJ^2 + TIJ^2)^1/2 + DJ) / 2 + PE
    Total System Jitter (TSJ):  0.070ns
    Total Input Jitter (TIJ):   0.000ns
    Discrete Jitter (DJ):       0.450ns
    Phase Error (PE):           0.000ns

  Maximum Data Path at Slow Process Corner: uut2/HASHERS[41].shift_w1/Mram_m to uut2/HASHERS[41].upd_w/tx_w15_30
    Location             Delay type         Delay(ns)  Physical Resource
                                                       Logical Resource(s)
    -------------------------------------------------  -------------------
    RAMB16_X2Y46.DOA2    Trcko_DOA             1.850   uut2/HASHERS[41].shift_w1/Mram_m
                                                       uut2/HASHERS[41].shift_w1/Mram_m
    SLICE_X60Y126.A2     net (fanout=4)        5.845   uut2/HASHERS[41].cur_w1<2>
    SLICE_X60Y126.COUT   Topcya                0.379   uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd1_cy<19>
                                                       uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd1_lut<16>
                                                       uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd1_cy<19>
    SLICE_X60Y127.CIN    net (fanout=1)        0.003   uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd1_cy<19>
    SLICE_X60Y127.BMUX   Tcinb                 0.292   uut2/HASHERS[41].shift_w0/r<27>
                                                       uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd1_cy<23>
    SLICE_X78Y122.B3     net (fanout=1)        1.995   uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd_211
    SLICE_X78Y122.BMUX   Tilo                  0.251   uut2/HASHERS[41].upd_w/tx_w15<23>
                                                       uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd221
    SLICE_X78Y122.C5     net (fanout=2)        0.383   uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd221
    SLICE_X78Y122.COUT   Topcyc                0.295   uut2/HASHERS[41].upd_w/tx_w15<23>
                                                       uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd2_lut<0>22
                                                       uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd2_cy<0>_22
    SLICE_X78Y123.CIN    net (fanout=1)        0.003   uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd2_cy<0>23
    SLICE_X78Y123.COUT   Tbyp                  0.076   uut2/HASHERS[41].upd_w/tx_w15<27>
                                                       uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd2_cy<0>_26
    SLICE_X78Y124.CIN    net (fanout=1)        0.003   uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd2_cy<0>27
    SLICE_X78Y124.CLK    Tcinck                0.341   uut2/HASHERS[41].upd_w/tx_w15<31>
                                                       uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd2_xor<0>_30
                                                       uut2/HASHERS[41].upd_w/tx_w15_30
    -------------------------------------------------  ---------------------------
    Total                                     11.716ns (3.484ns logic, 8.232ns route)
                                                       (29.7% logic, 70.3% route)

It's being forced to route from a RAMB16BWER to a CLB that's right smack dab in the middle of a group of columns, furthest possible position from possible RAM locations. Here, check this image out, it will make you go insane, so don't stare too long:
http://i.imgur.com/gBv5R.png (RAM is on the left).
No, seriously, don't stare at it. The router will drive insanity into the depths of your soon rotting brain fleshes. After exploding the Universe, of course.

Oh, you looked at it anyway and are wondering about that little path heading downward? Yeah, it keeps going ... and going ... (into my damned soul).


And as you can read from the timing report above, routing accounts for *drum roll* 70.3%! Yay! That's 8ns of routing, and only 3.4ns of logic! Imagine if we got rid of all the routing...


I see four solutions at the moment, and will investigate as time allows:

1) Get rid of the RAMB16BWER to some extent.
2) Add an extra register to the output of shifter_32b when inferring RAM logic. Flip-flops should route close to the logic and mask RAM routing delay.
3) Add two, duplicate registers to the output of shifter_32b when inferring RAM logic.
4) Ditch the RAM infer completely and try to coax ISE into using all those flip-flops in the useless slices (which are peppered throughout the routed design at the moment).

I will try 3 first, and hope ISE does the intelligent thing. My hope is that flip-flops in the useless slices will get utilized, since they're mingled in with the useful logic and so should provide somewhat fast local routing.

The interesting this is that we've got lots of RAM to play with. The design is using ~30% of the 16BWERs, and none of the 8BWERs. It seems like a good idea to try to use them and bring slice consumption down if possible, but only if their awkward placement can be solved appropriately.
hero member
Activity: 560
Merit: 517
Quote
If yes, we should probably decide on a common simple serial protocol for all these FPGA designs, so that one software will fit them all. I'll happily implement that in my python miner, so you don't need to care about the software side of things.
I completely agree, and Python is a far better choice for controller software than Tcl Tongue

I took the time to write out some preliminary specifications for the internal hardware interfaces:
https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/wiki/Specification:-Components,-Interfaces,-and-Protocols-%5BWIP%5D

TL;DR: I abstracted away all of the hardware and implementation specifics and shielded them with a single, memory mapped component called the Comm. Being memory mapped, it's a trivial interface that protocols like I2C and SPI can wrap easily.

Relevant to the topic at hand, the specifics of this wrapping by SPI, I2C, UART, JTAG, or what have you will be documented here:
https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/wiki/Specification:-PHY-Interfaces

And yes, I've abused the term PHY in that document and the previous specification. Because I'm a horrible person and I want to watch language burn for my own personal pleasure Tongue

Currently in that PHY specification I've only outlined the requirements. TL;DR: It must support multiple devices (addressing a single device specifically), allow reading registers, and writing registers. That's it.

I haven't laid down any ground work for any specific protocols yet, but I listed out what the most immediate ones probably are: UART, SPI, and JTAG. Feel free to begin specifying those in this thread.

And as always, feel free to critique my work, call me stupid, and redo the whole thing Tongue
member
Activity: 89
Merit: 10

I second that.  Uarts are easily implemented in FPGA's and the obvious choice for a "universal" interface. (via usb mostly)

May I suggest that the protocol handles multiple devices connected in parallel by some form of adressing.
This way lots of fpga's can be connected together easily on a common board and served across a single com port. (similar to RS-485)

I guess this thread is as good place as any to discuss the protocol?



hero member
Activity: 504
Merit: 500
FPGA Mining LLC
Do you want to get rid of the virtual wire / chipscope? If yes, we should probably decide on a common simple serial protocol for all these FPGA designs, so that one software will fit them all. I'll happily implement that in my python miner, so you don't need to care about the software side of things.
hero member
Activity: 560
Merit: 517
August 8th, 2011 - Spartan-6 Code Added and Working, Altera Mining Script Updated
Thanks to the efforts of makomk, the Spartan-6 series of chips are now supported, and achieve the highest performance per $ of any chip. Code has been verified working on my LX150 development kit, and can achieve up to 100MH/s of performance.

The Altera Tcl Mining Script has just received a massive update. No more need to edit mine.tcl to hack in your FPGA's hardware and device names; mine.tcl will automatically detect mining FPGAs connected to the system. Pool information has been moved to a config.tcl file for easy editing. No more dependency on TclCurl, so the script should be Linux friendly now. And best of all, the console output has been cleaned up to look like poclbm.

Here is a full list of updates:

  • No more dependency on TclCurl, so less hassle on Linux.
  • Auto-detection of mining FPGAs. No more need to edit mine.tcl to put in your FPGA's information. Just run and the script should find the FPGA automatically.
  • Pool url, worker username, and worker password split into config.tcl. See config.example.tcl.
  • All debug messages removed from console output.
  • Console output is timestamped.
  • Miner information reported to console similar to poclbm's console output.
  • Console reports FPGA actual hashing rate, estimated hashing rate, and accepted/rejected shares.
  • Better error handling.
  • Tracking accepted/rejected shares.
  • Generic API introduced for communication with the FPGA, allowing better future support for different chips/firmwares.
  • General code maintainability improvements.


Screenshot of the new console output:


Screenshot of the script automatically detecting my FPGA  Cool



There are still two missing major features. 1) It doesn't find and utilizing multiple connected FPGAs. 2) Long Polling
The firmware code also needs to be updated to support a work queue and results queue. All of these are planned future updates. I will also (hopefully) merge the Xilinx and Altera mining scripts together, for ease of use and mixing both kinds of chips on the same system Tongue
hero member
Activity: 686
Merit: 564
Hmm ... that change doesn't seem like it should have any impact; ISE should be able to figure out that cur_w0 is rx_input[31:0] there unless I missed something.
It might for LOOP_LOG2=0, but it doesn't appear to for LOOP_LOG2=1. I've no idea why it doesn't; perhaps it doesn't know to make that particular optimization.

Was the timing analyzer pointing that out as being the trouble maker?
The timing analyzer was pointing to a long chain of logic, but it did indeed include an unnecessary mux between w0_fb and rx_input[31:0] feeding into the computation of the initial value of t1_part. Replacing that with a direct reference to rx_input[31:0] resulted in it reaching timing closure at 100MHz. (Again: this is all with LOOP_LOG2=1)
hero member
Activity: 560
Merit: 517
Quote
Trivial fix for this - in addition to using that branch, find the line:
Hmm ... that change doesn't seem like it should have any impact; ISE should be able to figure out that cur_w0 is rx_input[31:0] there unless I missed something. Was the timing analyzer pointing that out as being the trouble maker?
hero member
Activity: 686
Merit: 564
Finally got a set of modifications to the K_next lookup that ISE likes! It reported 100 MHz at LOOP_LOG2=1 on XC6SLX75 and I've sent a pull request to fpgaminer. Looks like it might be a bit hit and miss as to whether it meets timing though, since the next run failed with an actual period of 10.412ns rather than the required 10ns.

Edit: Might help if I actually linked to it. It's in the projects/LX150_makomk_Test dir of the xilinx-better-k-for-merge branch
Edit 2: Now it's consistently getting 10.412ns *sigh*.
Edit 3: Trivial fix for this - in addition to using that branch, find the line:
Code:
.rx_t1_part(feedback ? t1_part_fb : (rx_state[`IDX(7)] + cur_w0 + K))
Amend it to read:
Code:
.rx_t1_part(feedback ? t1_part_fb : (rx_state[`IDX(7)] + rx_input[31:0] + K))
You should now be able to achieve 100MHz (50MHash/sec) on the XC6SLX75, at least in theory. It's late and I need to sleep so this isn't committed anywhere yet.
hero member
Activity: 686
Merit: 564
On another note - the 2engine design I let running finally finished, and failed.  It ran out of placement sites - said it was short like 1k FFs and 3k LUTs to implement the design - and this was before routing, so it may not even have been able to attain 100 MHz there.  I am running it again, but it may take 2-3 days again - but this time I've enabled some more aggressive optimizations in the Xilinx ISE.
Ah. If you look at the list of LUTs it couldn't place, you'll almost certainly find that they're adders with carry chains: there aren't enough suitable sites with fast carry hardware to place all of the 32-bit-wide adders contiguously. I got the same error trying to fit a fully unrolled engine onto a XC6SLX75.
newbie
Activity: 44
Merit: 0
just got off a long day of work, brain hurts, so I havn't really gone over what you've written here so far, but skimming it definitely seems like it will help me understanding everything.  When I get some free time I'll try to analyze it and work it all out, maybe even draw a single cell diagram for anyone else trying to wrap their heads around it.

On another note - the 2engine design I let running finally finished, and failed.  It ran out of placement sites - said it was short like 1k FFs and 3k LUTs to implement the design - and this was before routing, so it may not even have been able to attain 100 MHz there.  I am running it again, but it may take 2-3 days again - but this time I've enabled some more aggressive optimizations in the Xilinx ISE.

We might have to settle for having 1 fully unrolled engine, and then 1 LOOP_LOG2=1 engine running and that should hopefully get to ~150MHz pretty easily.

Has anyone else gotten a 100MHz version working?  Should I compile a bit file for someone to test it?  I hate flying blind with a target device to test on.
hero member
Activity: 686
Merit: 564
edit: update
if I use this for the K and K_next assignment when LOOP == 1, I don't get the LUT messages anymore:
Quote
`ifdef USE_RAM_FOR_KS
         if ( LOOP == 1) begin
            assign K = Ks_mem[ i ];
            assign K_next = Ks_mem[ i + 1 ];
         end else begin
...
I think the problem is that K and K_next are not assigned in a clock state, thus they become asynchronous combinatorial logic - and XST can't map that to a ROM?  Or maybe it's the addition of using a multiplier output as an address selector?  Something in there XST wasn't liking for me.
I'm guessing the reason you don't get the LUT RAM messages anymore is because xst is now interpreting that as a single constant value rather than as a table lookup. (Since i is a genvar, both i and i+1 are constant at synthesis time, and xst already knows Ks_mem is read-only.)

I was able to achieve 100MHz and route it with the current design, slightly modified.  Changed the PLL to output clk0 so no clock division, and I changed the K/K_next assignments as to what I described in my previous post.  This routes it to a min clock period of 9.742ns for me.
Congratulations, nice one! Wonder what exactly ISE was doing before...

It also seems like the worst case critical path related to the 100MHz clock is between Hasher[8] and Hasher[13], looks like an output of an adder in Hasher[8], adding rx_state and k_next gets registered into Hasher[13]'s shift_w1 register.
That's interesting.

Edit: Just noticed there was another page of discussion. fpgaminer's explanation is indeed correct, including for the bits I modified. (I'm afraid I'm actually responsible for a lot of the confusing bits, including the cur_ prefixes which are there to distinguish between the values of wn in this round and the variable new_w15 which holds the value of w15 used by the next round.)

It might be more useful to think of the w computation as being
Code:
w[i+16] = w[i] + s0(w[i+1]) + w[i+9] + s1(w[i+14])
because that's effectively how it's being calculated. cur_w actually means w[+i] and next_w15 will hold the value of w[i+16] one clock cycle later.

This is only applicable to makomk's latest revisions, where I think he chose to do a long chain loop instead of tight feedback because of how the W shift registers are implemented at LOOP_LOG2>0. He'll have to chime in here to verify, as I haven't worked all the logic out for myself.
That's one reason for the change, but it's probably not the only one; all those muxes at every stage in the pipeline aren't exactly cheap from what I can tell. (Also: don't forget the extra added clock cycle of delay in the feedback path in order to make each piece of data arrive at the input with cnt one higher than on its previous pass through the pipeline.)
hero member
Activity: 560
Merit: 517
Quote
This would (for LOOP_LOG2=1) mean, that it will need to be fed a burst of work during 32 clocks, and then nothing at all during the next 32 clocks, which seems to be unneccessarily complicated to me.
It's fed "new" work every other clock cycle, confusingly enough. HASHERS[0], for example, will take new work at the first clock, old work at the second, new work at the third, old work at the fourth, and so on. It's really not a terrible way to do it, except for the long feedback chain required between the last and the first hasher, which is fine if those two instances are placed next to one another. ... and the fact that it's confusing and makes my brain squeal.

It may even be better, to be honest, because you have feedback in only one location instead of at every hasher. If you make a U shape out of the hasher placement it puts the first and last hasher next to each other for fast feedback there, and the rest can happily route forward. K_next is the only signal that changes the behavior of the hashers at each cycle, and is likely implemented as a 2:1 MUX at LOOP_LOG2=1.

Quote
Mine works in a different way. It feeds back a stage's outputs to its own inputs on every second clock, making HASHERS[0] handle rounds 0 and 1, and HASHERS[31] handle rounds 62 and 63. This way it just seems to work at half the frequency externally, no need for bursts.
That is how the normal code works. This is only applicable to makomk's latest revisions, where I think he chose to do a long chain loop instead of tight feedback because of how the W shift registers are implemented at LOOP_LOG2>0. He'll have to chime in here to verify, as I haven't worked all the logic out for myself.

Quote
Hm, this is weird. Maybe somehow related to verilog?
With my VHDL code, it tries to implement every single occurrence of two registers in a row as a shift register, leading to lots of unused flipflops and an unroutably high LUT usage. Limiting that to at least 4 registers in a row in the synthesis options improved things a lot.
Quite possibly. I am mostly just happy to finally have a routable design running on my LX150 at 100MHz  Tongue At least on Altera, Quartus couldn't figure out how to shove things into the M9Ks with vanilla code. makomk's version with explicit shifters is what got Quartus to use the M9Ks and cram the design into 75K LEs.

Perhaps the way VHDL is synthesized allows the compiler to better realize chains of registers? It seems odd, but I wouldn't put anything past these compilers Tongue
hero member
Activity: 504
Merit: 500
FPGA Mining LLC
For LOOP_LOG2>0 it is a bit more confusing. The hashers are still in a chain, but there are less of them (by some power of two), and the last one in the chain feeds the first one. That allows work to loop around the hasher chain a number of times to complete all 64 needed rounds. As an example, a unit of work will enter in at HASHERS[0], get worked on until HASHERS[31], and then get fed back into HASHERS[0] to get worked on another 32 times (for a total of 64 rounds). This feedback occurs on alternating clock cycles as determined by the cnt signal. When cnt==0 HASHERS[0] processes new work. On all other cnt HASHERS[0] processes old work from the last hasher in the chain.

This would (for LOOP_LOG2=1) mean, that it will need to be fed a burst of work during 32 clocks, and then nothing at all during the next 32 clocks, which seems to be unneccessarily complicated to me.

Mine works in a different way. It feeds back a stage's outputs to its own inputs on every second clock, making HASHERS[0] handle rounds 0 and 1, and HASHERS[31] handle rounds 62 and 63. This way it just seems to work at half the frequency externally, no need for bursts.

However, the compilers don't like that way of handling the calculation of W. While you do indeed need to keep 16 words floating around at any given time, only 4 of them are used for each calculation in each instance. So it's a bit wasteful to hand all 16 words of W to the next module in the chain, when it only needs 4 of those words. makomk optimizes this by explicitly using shift registers to get each word of W to where it specifically belongs. This is why you see a bunch of shift registers will all sorts of different lengths. They are carrying the words of W to their various destinations, and how long the shifter needs to be often depends on if its one of the first links in the chain or not.

Hm, this is weird. Maybe somehow related to verilog?
With my VHDL code, it tries to implement every single occurrence of two registers in a row as a shift register, leading to lots of unused flipflops and an unroutably high LUT usage. Limiting that to at least 4 registers in a row in the synthesis options improved things a lot.
hero member
Activity: 560
Merit: 517
Quote
The other thing I've been told is your FPGA should never really exceed 60-70% usage pre-routing.... because a lot of resources are needed to get high-speed routing done...  And trying to pack 2 engines in there is likely nearing more like 90% usage...
Probably, but a lot of the design gets optimized in routing so it's hard to actually use less pre-routing Tongue

Quote
For example, what are the different length shift registers for?
Here is a quick overview of SHA-256, which will be helpful to explain those shift registers. There are 4 steps to the SHA-256 algorithm. The Pseudo Code on that wiki page is very helpful.

Step 1) Pre-processing. This is done for us already and is not handled by the FPGA (it's done by bitcoind when you perform a getwork request). Just ignore it; it isn't important.
Step 2) W calculation. The W array is 64 32-bit words, and is calculated from the 512-bit input data. The first 16 words are a 1-to-1 copy of the 512-bits of data. The remaining words are calculated like so: w := w[i-16] + s0(w[i-15]) + w[i-7] + s1(w[i-2])
Step 3) Given the starting state (8 32-bit words, labelled A through H totaling 256-bits), perform 64 iterations of the round calculations. Each round takes as input its respective W value from the W array and the state of the previous round (initial state for the first round). It gives as output a new state.
Step 4) Add the output state from the last round to the starting (initial) state. This is done per 32-bit words. So A from initial state is added to A from final state, and so forth.


Step 3 is the easiest to understand and implement, because its inputs and outputs are obvious and consistent. They work exactly how you expect. Take input, do binary and arithmetic logic, and register the output. String 64 of those together in a chain and go make some tea.

Step 2 can also be easy to implement. You take 512-bits of data as input, monkey around with it, and register 512-bits as output. String a bunch of those together and bake some cupcakes. This is implemented in the non-optimized variation of the FPGA code, as seen in src/sha256_transform.v. In fact the whole of that unoptimized variation was quite a bit easier to comprehend.(1)

However, the compilers don't like that way of handling the calculation of W. While you do indeed need to keep 16 words floating around at any given time, only 4 of them are used for each calculation in each instance. So it's a bit wasteful to hand all 16 words of W to the next module in the chain, when it only needs 4 of those words. makomk optimizes this by explicitly using shift registers to get each word of W to where it specifically belongs. This is why you see a bunch of shift registers will all sorts of different lengths. They are carrying the words of W to their various destinations, and how long the shifter needs to be often depends on if its one of the first links in the chain or not.

In theory, a smart compiler could figure this out on its own; realize that a chain of registers is being formed and turn that into a shifter. But in reality that does not seem to be the case, so makomk makes it explicit. It probably also helps the compiler to see optimizations of constant data (most of the input data is constant for the first sha_transform instance).

Quote
So you have a variable K - want to see how hard it is to search a document for references to the letter K?
In Vim, put your cursor on K and press *. It finds every instance of K for me, and nothing extraneous. ...  Tongue Okay, sorry, I don't mean to be a smartass. I do indeed see your point, and I appreciate the feedback very much. I had the same pains as you before I started using Vim.

So would wires be labelled sig_ and registers reg_?

Quote
For example, wtf does cur_w1 mean?
Current W array index i-1 ... err, I think. Or it means current W array i - 15.  Tongue
No, I agree, those are confusing. I will take the time to clean it up and document.

Quote
or a _t1
T1 is one of the intermediary calculations in SHA-256. See the wiki pseudo code I linked.

Quote
But truthfully my understanding of the sha256 algorithm and it's pipelined version are probably a little bit lacking, and that is not helping to understand the code/flow.
I hope the quick overview I wrote above helps. Luckily SHA-256 is not too complex. I initially spent about a day or two mulling it all over in my head before I had a firm grasp over everything and how it gets implemented. If you want, check out the first commited version of sha256_transform.v: first commit. It's just two modules, and doesn't support anything fancy. Just fully unrolled and dead simple. It's probably a great place to start after you have a quick glance over the wiki pseudo code for SHA-256. From there you can look at the current src/sha256_transform.v I linked above, which includes the flexible loop unrolling parameter. And then look at the current makomk version of the code. That's exactly how the code has evolved, so it's a great way to take it step by step.


(1) Please note that the way this code handles LOOP_LOG2>0 is different from makomk's code. In this code, if LOOP_LOG2=2 then HASHERS[0] is responsible for rounds 0 through 3, as an example. i.e. the hashers feed back into themselves. In the makomk code you instead have the last hasher feed into the first. As far as I know this is necessary to get the code to fold correctly with his optimizations.
newbie
Activity: 44
Merit: 0
ok, so I'm not crazy yeah, pre-routing got me to 140% LUTs pre-routed, and the ISE has been at the map stage for something like 18 hours so far.... gonna let it run, I got a quad core, so doing parallel compiles is feasible.  The first global placement run took 4 hours for me.... and it's been stuck on the 2nd global placement run now - probably around 19 hours total running time so far... what's sad is this is the Map phase... Place and Route has yet to run =(

The other thing I've been told is your FPGA should never really exceed 60-70% usage pre-routing.... because a lot of resources are needed to get high-speed routing done...  And trying to pack 2 engines in there is likely nearing more like 90% usage...

in terms of pipelining, it's not so much the Hasher blocks I dont understand, it's the signals feeding into them.... For example, what are the different length shift registers for?  What is the definition of cur_w# and why do they need different lengths, or more specifically what do the specific lengths correlate to?  The previous hasher's output?  And on a fully unrolled loop - why are shift registers even necessary?  Shouldn't each hasher's digester essentially have the "register" of the state in there?
edit: ohh wait nm, that's the message scheduler!

Maybe not a full block diagram outling all the pipelined stages, but more of a "cell" diagram of a Hasher in terms of i.  E.g. Hasher i has connected to it's input Hasher i-1's cur_w0.  Something like that might help me figure out exactly what's going on.

And I guess that loops into your question on specifying signal names.  Personally I would have some sort of prepend to every wire/reg.  In VHDL there is no distinction and the behavior ( wire or reg ) is inferred through the design - e.g. signal is assigned a value in a clocked process ==> register.  And in VHDL I usually prepend all my signal names with sig_XXXXX.  One of the problems I have with single letter variable names is that they are impossible to search the document for references.  So you have a variable K - want to see how hard it is to search a document for references to the letter K?  If every K was instead sig_K, it would be much easier to search the document to find references.  Basically any single letter variable name IMO is bad.  

Some of the other signal names might be a mix of non-detailed name + my inexperience with the SHA algorithm.  For example, wtf does cur_w1 mean?  I understand a _fb = feedback.  But I don't know what w1 or w0 or w14 or w9 do.  Also, I'm unsure what a _w means, or a _w1, or a _t1. Or a prepend of cur_ - not exactly sure what that means either.

And although it may be easy and quick to type, the shift register definition also has 2 single letter registers, r and m - and this one isn't as bad because that stuff is internal, but imagine what a pain in the ass it is when you get a synthesis info/warning about some variable m - now I gotta search through all the source files by hand to look for a register m - because I can't just search for "m" in all the documents and get anything useful...

It might also help to organize the wire/reg definitions a little bit better.  The way it is now, definitions are strewn throughout the code.  I always prefer having my wire/reg/input/output declarations at the top of the module, like software coding.  It may also help to separate out the modules a little bit more.  The sha256_transform is so complicated already - maybe move all things like the digester or shift registers out of the same source file, that way the root sha256_transform module is more of a connectivity/hierarchy module defining the structure, not the fuction of the sha256 transform.  

But truthfully my understanding of the sha256 algorithm and it's pipelined version are probably a little bit lacking, and that is not helping to understand the code/flow.
hero member
Activity: 560
Merit: 517
Quote
I was able to achieve 100MHz and route it with the current design, slightly modified.  Changed the PLL to output clk0 so no clock division, and I changed the K/K_next assignments as to what I described in my previous post.  This routes it to a min clock period of 9.742ns for me.
Wonderful! I was able to route it at 90MHz successfully. Haven't tried 100MHz yet, because I figured it would require the extra Round0 partial t1 pipeline. I guess that isn't the case.

Quote
it'd be nice if there was a higher level timing diagram/pipeline diagram for this process.  I'd love to know what exactly each unit should be doing at any one time, e.g. which "nonce"/hash is block X currently working on.
At LOOP_LOG2=0 it is fairly simple. HASHERS[0] is working on nonce, HASHERS[1] is working on nonce - 1, HASHERS[2] is working on nonce - 2, ... HASHERS[63] is working on nonce - 63.

For LOOP_LOG2>0 it is a bit more confusing. The hashers are still in a chain, but there are less of them (by some power of two), and the last one in the chain feeds the first one. That allows work to loop around the hasher chain a number of times to complete all 64 needed rounds. As an example, a unit of work will enter in at HASHERS[0], get worked on until HASHERS[31], and then get fed back into HASHERS[0] to get worked on another 32 times (for a total of 64 rounds). This feedback occurs on alternating clock cycles as determined by the cnt signal. When cnt==0 HASHERS[0] processes new work. On all other cnt HASHERS[0] processes old work from the last hasher in the chain.

Looking at K_next and cnt can give you an idea of what the HASHERS are doing at any given time, although it's still a bit obfuscated because K_next is used to calculate the t1_partial for the next hasher Tongue

Quote
it really really really makes it hard to read and figure out what signals are what with the plethora of 1-letter signal/wire names....
Such as? Readability is actually important to me, so I appreciate the feedback. A lot of the signals have short names, like K and t1, but that is sourced directly from SHA-256 terminology. The code in sha256_transform is probably the biggest area for confusion, but it does have to handle a lot so I'm not surprised. Let me know any ways you think the names and design and be improved. I might even put together a chart like you suggested that shows graphically how the hashing chain works.

Quote
I haven't been able to reproduce the 11ns synthesis run for the XC6SLX75 since fixing the values of K_next, and I can't entirely figure out why; best I've seen since then is 14-15ns. (That's with 100 MHz as the target.) You might have better luck with a fully unrolled design, but then again perhaps not.
If you over constrain, you can making routing go haywire. That's why when I ramp up timing constraints I do it in increments. It is a huge pain in the butt to have to re-compile a number of times, but it gives a better gauge of what the design can actually handle. Usually I under constrain by a lot, see what the minimum period is, and then re-constrain to that period to push it further. Looks like magik has gotten 100MHz on an unrolled design. I've gotten 90MHz as well.

Quote
The first line says create a 32-bit register array, with (8-2+1) elements, so 7 elements, but the addr modulous wraps around at 7 - e.g. once ( addr + 1 ) == 7, then addr becomes 0, not 7.  So we are missing the last element of the shift register.
With 7 elements, the range would be 0 to 6, so modulus 7 is correct. I had to double check it myself when I first saw it, but it is indeed correct. And it most certainly helps the design compared to what I was initially trying.

Quote
on another note, I placed 2 cores ( 4 sha256 transforms ) into the design, it said I was using 140% LUTs, but it's still trying to route it right now?  It's been running for over 12 hours though....
Well I finally tried the same thing. Code is on the public repo now. It reports 139% LUT usage after synthesis, and gets stuck in Mapping. I let it run over night with it still stuck on Mapping (Global Placement phase).

The design with a single core reports ~70% LUT after synthesis, and ~40% after routing, so routing does quite a bit to optimize the design. So even though synthesis reported 140% I was hoping for Routing to fit the design anyway. I'm guessing either the design is far too large, or we've run up against the "useless" slices issue.

I will probably put the multicore design to the side for now and work on a speed oriented design instead, using those extra slices to enhance pipelining.
newbie
Activity: 44
Merit: 0
good point, I should just be mapping the K's directly into the hashers as constants

I was able to achieve 100MHz and route it with the current design, slightly modified.  Changed the PLL to output clk0 so no clock division, and I changed the K/K_next assignments as to what I described in my previous post.  This routes it to a min clock period of 9.742ns for me.

It also seems like the worst case critical path related to the 100MHz clock is between Hasher[8] and Hasher[13], looks like an output of an adder in Hasher[8], adding rx_state and k_next gets registered into Hasher[13]'s shift_w1 register.

I also don't like this chipscope in here heh, it's treating the signals you are looking at as clocks and thus routes them on BUFG's through BUFGMUXs.  It's low fanout so it doesn't matter that much, but from what I'm used to with FPGA design, you don't really want to be using non-clock signals as clocks, i.e. using these signals in edge logic, a la always @ posedge(some_sig).  I should probably see how much resources these chipscope modules are taking up as well....

it'd be nice if there was a higher level timing diagram/pipeline diagram for this process.  I'd love to know what exactly each unit should be doing at any one time, e.g. which "nonce"/hash is block X currently working on.

it's taken me a bit of time to figure out what's going where, and it really really really makes it hard to read and figure out what signals are what with the plethora of 1-letter signal/wire names....


and hrm... 2 engine design still tells me it's 140% LUTs lol.... and this is without getting the RAM => LUT message.  BRAM/FIFO usage is 202/268.... hrm.... i wonder if this compile will also last 12+ hours
hero member
Activity: 504
Merit: 500
FPGA Mining LLC
Pages:
Jump to: