Official Open Source FPGA Bitcoin Miner (Last Update: April 14th, 2013) - page 28.

makomk

hero member

Activity: 686

Merit: 564

Quote from: fpgaminer on July 22, 2011, 09:03:31 PM

No kidding. It broke my brain for awhile, until I realized it was just a delay chain, so you could add to cnt to get what cnt "looks like" at each stage in the chain.

Heh. There's a reason I'd been putting off making those changes to the partial unrolling originally; it was fairly obviously beneficial, but also rather fiddly.

Quote from: fpgaminer on July 23, 2011, 12:22:34 AM

Quote

Edit 3: Can't quite get it to hit 100 MHz with LOOP_LOG2=1 on the XC6SLX75 (actual period = 11.045ns). Looks like I'd need to port over the patch for the extra pipeline stage to compute the initial value of H+K+W[0].

On my last run, I think ISE reported an actual period of 15ns. I'm still getting used to the timing report in ISE, so I could be wrong. Regardless, that's with it targeting 50MHz so I'm sure it will give better results with tighter constraints. I will certainly try to patch it for the initial t1_partial; that's bound to be helpful.

I haven't been able to reproduce the 11ns synthesis run for the XC6SLX75 since fixing the values of K_next, and I can't entirely figure out why; best I've seen since then is 14-15ns. (That's with 100 MHz as the target.) You might have better luck with a fully unrolled design, but then again perhaps not.

Quote from: fpgaminer on July 24, 2011, 11:39:37 PM

makomk I also sent you a donation for the hard work you've done achieving 110MHz on Altera, and getting a fully unrolled core working on the LX150 chip. Many thanks to the both of you, and everyone who contributes to this project!

Thank you! Though I'm not sure the extent to which I helped with that second one... don't even have the tools to attempt such a thing.

Quote from: fpgaminer on July 24, 2011, 11:39:37 PM

However, it has been mentioned before that the Spartan-6 devices don't have fast carry chain routing on half of the slices. That may impede the ability to get two engines on an LX150.

Haven't done the math on that but it'd probably work out much the same as fitting one on the LX75: not enough carry chains for all the adders, won't fit without some trickery.

Quote from: magik on July 26, 2011, 02:38:56 PM

hrm... there has to be a better way to use those generate blocks to parse these values not as signals/wires to be used at runtime, but rather generate constant integers or look up tables/mux's to generate these...

I have some changes to do this, but they're on a computer I don't have access to this second and I don't think I pushed them to any public repos. (For some reason I appeared to be seeing a negative effect on Cyclone IV clock speeds at LOOP_LOG2=0.)

You well may find the tables aren't actually being synthesized as LUT RAM in the end anyway.

magik

newbie

Activity: 44

Merit: 0

hrm.... yeah been doing more testing... and it seems liek I have high LUT usage because some of the "RAM" is being inferred as LUTs?

do you get any of these messages when you compile?

Quote

INFO:Xst:3218 - HDL ADVISOR - The RAM will be implemented on LUTs either because you have described an asynchronous read or because of currently unsupported block RAM features. If you have described an asynchronous read, making it synchronous would allow you to take advantage of available block RAM resources, for optimized device usage and improved timings. Please refer to your documentation for coding guidelines.
   -----------------------------------------------------------------------
   | ram_type | Distributed | |
   -----------------------------------------------------------------------
   | Port A |
   | aspect ratio | 64-word x 32-bit | |
   | weA | connected to signal | high |
   | addrA | connected to signal | |
   | diA | connected to signal | |
   | doA | connected to signal | |
   -----------------------------------------------------------------------

really odd.... it's not happening to all of the sha_transform modules though... it only seems to be one.... the 2nd one with the NUM_ROUNDS set to 61 it appears

also, I see things like this when it's synthesizing:

Quote

Found 6x6-bit multiplier for signal created at line 120.
Found 6x32-bit multiplier for signal created at line 127.

line 120 is:

Quote

assign K = Ks_mem[(NUM_ROUNDS/LOOP)*cnt+i];

line 127 is:

Quote

assign K_next = Ks_mem[(NUM_ROUNDS/LOOP)*((cnt+i) & (LOOP-1)) +i+1];

hrm... there has to be a better way to use those generate blocks to parse these values not as signals/wires to be used at runtime, but rather generate constant integers or look up tables/mux's to generate these...

edit: update
if I use this for the K and K_next assignment when LOOP == 1, I don't get the LUT messages anymore:

Quote

`ifdef USE_RAM_FOR_KS
         if ( LOOP == 1) begin
            assign K = Ks_mem[ i ];
            assign K_next = Ks_mem[ i + 1 ];
         end else begin
...

I think the problem is that K and K_next are not assigned in a clock state, thus they become asynchronous combinatorial logic - and XST can't map that to a ROM? Or maybe it's the addition of using a multiplier output as an address selector? Something in there XST wasn't liking for me.

also, it seems the 1st round synthesizes much differently?
for the first sha block I get this:

Quote

Summary:
   inferred 10 Adder/Subtractor(s).
   inferred 551 D-type flip-flop(s).
   inferred 17 Multiplexer(s).
Unit synthesized.

for the 2nd block I get this:

Quote

Summary:
   inferred 62 RAM(s).
   inferred 2 Multiplier(s).
   inferred 63 Adder/Subtractor(s).
   inferred 295 D-type flip-flop(s).
   inferred 17 Multiplexer(s).
Unit synthesized.

why are these so different!?

first off are they sharing the RAM for the K's ? It seems only the K's for the 2nd block are generated, but Xilinx might be optimizing across the hierarchy here. But what about the # of adders/subtractors!? only 10 in the first block? how can that be? or is it that it's shifting the position of the adders from the digester to the higher module?

I also see this:

Quote

Synthesizing Unit .
Related source file is "e:/bitcoin/lx150_makomk_test/hdl/sha256_transform.v".
LENGTH = 8
WARNING:Xst:3035 - Index value(s) does not match array range for signal , simulation mismatch.

which relates to the shift register code wi:

Quote

      reg [31:0] m[0:(LENGTH-2)];
      always @ (posedge clk)
      begin
         addr <= (addr + 1) % (LENGTH - 1);

now when I look at that, I'm not sure if that's correct, so lets say LENGTH = 8. The first line says create a 32-bit register array, with (8-2+1) elements, so 7 elements, but the addr modulous wraps around at 7 - e.g. once ( addr + 1 ) == 7, then addr becomes 0, not 7. So we are missing the last element of the shift register.

I think this is just an indexing problem - LENGTH = 8 means 8 elements in the shift register. so you want reg[32:0] m[0:7] or reg[32:0] m[0:(LENGTH-1)]. Then below on the addr assignment, you would want addr <= ( addr + 1 ) % ( LENGTH ). Because using a LENGTH of 8, xxx % 8 will always return a value inclusively between 0 and 7.

Not sure how this is even working with one of the shift registers effectively 1 element short....
edit: seems if I "fix" this, it breaks it heh..... I need to look into this
ok another edit update, it seems this code is correct because you also have a 32-bit register r in there that's separate from the m storage register. And that also explains the different synthesis for this module. It's using a RAM, a 32-bit register r, 3-bit register addr, 9-bit adder for next address range, as opposed to just LENGTH*32 register/FF for the other types of shift registers... not sure which one is better here

on another note, I placed 2 cores ( 4 sha256 transforms ) into the design, it said I was using 140% LUTs, but it's still trying to route it right now? It's been running for over 12 hours though....

fpgaminer

hero member

Activity: 560

Merit: 517

I've just updated the public repo with the Tcl mining script I wrote for my Spartan-6 Dev Kit. I'm not really happy with it, but it does work on my machine. There's a hardcoded filepath in mine.bat that I couldn't get rid of, and some hardcoded JTAG addresses in mine.tcl that are specific to the dev kit. I had to use dumb string parsing instead of actual JSON, because Xilinx is using Tcl 8.4 instead of 8.5 for ChipScope stuff; and I had to drop TclCurl in favor of Tcl's http package. On the bright side, the http package works great and is more portable than TclCurl so I might do the same replacement on the Altera mining script. That should allow it to run on Linux Cheesy

This verifies that the current LX150 Xilinx design works with a live pool Cool

teknohog, I sent you a donation for laying the groundwork in the Xilinx Verilog project. makomk I also sent you a donation for the hard work you've done achieving 110MHz on Altera, and getting a fully unrolled core working on the LX150 chip. Many thanks to the both of you, and everyone who contributes to this project!

Quote

tried popping in two more sha cores to get 2 engines running ( fully unrolled ), ISE spit out this:

That's an awful lot of LUT usage. A single unrolled engine uses under 50% of the LUTs. Are those stats post Synthesis, or post Routing?

My guess is that by the resource usage of a single engine, two should fit on an LX150. However, it has been mentioned before that the Spartan-6 devices don't have fast carry chain routing on half of the slices. That may impede the ability to get two engines on an LX150. It might make more sense to use a single engine with extra pipelining and see if we can get it clocked up towards 200MHz. I will certainly be exploring both options.

Thank you for reporting those numbers, though. Let me know how your latest experiments go

magik

newbie

Activity: 44

Merit: 0

tried popping in two more sha cores to get 2 engines running ( fully unrolled ), ISE spit out this:

Quote

Slice Logic Utilization:
Number of Slice Registers: 92543 out of 184304 50%
Number of Slice LUTs: 121337 out of 92152 131% (*)
   Number used as Logic: 113389 out of 92152 123% (*)
   Number used as Memory: 7948 out of 21680 36%
   Number used as SRL: 7948

so looks like without a lil bit of massaging the current design uses up a bit more resources....

I'm gonna try 2 hashing engines ( 4 cores ) running at log_level2 - that should be able to fit, and then I'll see how fast I can scale up the clocking to get it routable

magik

newbie

Activity: 44

Merit: 0

great reply, thanks, I have it successfully generating a golden nonce now in simulation, awesome.

I'm going to toy around and see if I can get this running faster than 100 MHz, or rather, routing @ faster than 100 MHz, I'm liking that ISE 13 has multi-core support for the stuff like routing and simulation now!

fpgaminer

hero member

Activity: 560

Merit: 517

Quote

The latest confirmed 50MHash/s on the lx150 - which codeset is that? the LX150_makomk directory?

Yup!
https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/tree/master/projects/LX150_makomk_Test

Quote

and thats a lx150 or lx150t?

I have the Spartan 6 LX150T Development Kit, so it's an LX150T chip. However the T makes no difference on the performance of the algorithm; it merely indicates that the chip has transceivers on it, which are irrelevant to a mining application. The mining board being developed in this thread will use the Spartan 6 LX150-3N variant, which has no transceivers, -3 speed grade (fastest), and N for no memory controller.

Quote

also, I see a testbench in there - do you have maybe a timing diagram of expected/correct outcome for those inputs?

The testbench is a bit primitive at the moment. There are no test waveforms, and it isn't fully automated in that it won't just tell you if the design passed or failed (nor why it failed). It sets up the fpgaminer_top module as the Unit Under Test, manually setting up its internal registers for the test data documented in this file. When I do a test to verify the design I load the testbench up in ModelSim, put the top-level signals of the uut on the wave viewer, and tell it to run the simulation for about 8us. After that I can check the golden_nonce register to see if it matches the correct value of 0x0e33337a.

If it doesn't match then I debug manually. Ideally a robust testbench would check every stage of the SHA-256 calculations automatically and report the failures, but it's a bit non-trivial to implement because of the parameterized pipelining and the countless variations on the code by this point.

Quote

also looks like the ucf is set up to receive a 100MHz clock and I don't see any clock dividers in the code?

See main_pll.v which is instantiated in fpgaminer_top.

Quote

hrm... so it seems you are using chipscope to communicate with the chip? interesting, I havn't seen that before - you guys discuss that somewhere in this thread? what software are you using to talk through the chipscope objects?

It's a relic of my experience in using Altera chips. I've ~~used~~ abused Altera's In-System Source and Probes feature for a long time in various designs for quick debugging. It's very convenient, because it goes over JTAG which must be connected to program the chip anyway. Much nicer than having to run yet another cable around my already tangled desk.

My Altera implementations, on the DE2-115 dev board for example, use it and the mining script is already written and working: mine.tcl

So yes, my code for the LX150 ended up using the Xilinx equivalent, ChipScope's Virtual I/O. I did my initial tests to verify that the design is working on live hardware by simply using ISE's ChipScope interface. Now that the design is verified I am writing the actual mining script in TCL, for which Xilinx provides a ChipScope Engine interface.

Most other people seem to prefer using RS232 for the communication. I'm inclined to agree after seeing the tcl interface to ChipScope Tongue

But I don't have an RS232-USB adapter at home.

magik

newbie

Activity: 44

Merit: 0

ooh interesting stuff going on here for Spartan devices eh? I need to check some of this out in my compiler as well.

The latest confirmed 50MHash/s on the lx150 - which codeset is that? the LX150_makomk directory?

and thats a lx150 or lx150t?

also, I see a testbench in there - do you have maybe a timing diagram of expected/correct outcome for those inputs?

not too familiar with ISE 13 myself or verilog for that matter - i use mostly vhdl, but it also looks like you left a chipscope core in the project file in the github

also looks like the ucf is set up to receive a 100MHz clock and I don't see any clock dividers in the code?

edit:
hrm... so it seems you are using chipscope to communicate with the chip? interesting, I havn't seen that before - you guys discuss that somewhere in this thread? what software are you using to talk through the chipscope objects?

fpgaminer

hero member

Activity: 560

Merit: 517

I have now confirmed that with LOOP_LOG2=0, at 50MHz, the design works on live hardware and returns correct results. That means the Spartan-6 LX150 is now confirmed to perform at 50MHash/s.

Public repo has been updated with the code I just compiled and tested.

I want to write a mining script for it, and test it on a real pool. From there I'll ramp up the clock to see how close it will get to 100MHz

Quote

Edit 3: Can't quite get it to hit 100 MHz with LOOP_LOG2=1 on the XC6SLX75 (actual period = 11.045ns). Looks like I'd need to port over the patch for the extra pipeline stage to compute the initial value of H+K+W[0].

On my last run, I think ISE reported an actual period of 15ns. I'm still getting used to the timing report in ISE, so I could be wrong. Regardless, that's with it targeting 50MHz so I'm sure it will give better results with tighter constraints. I will certainly try to patch it for the initial t1_partial; that's bound to be helpful.

fpgaminer

hero member

Activity: 560

Merit: 517

Quote

Yep, that's entirely correct. I remember it wasn't much fun to figure this out; took me several hours to get right myself.

No kidding. It broke my brain for awhile, until I realized it was just a delay chain, so you could add to cnt to get what cnt "looks like" at each stage in the chain.

Quote

Edit: Fix tested in Modelsim at all working LOOP_LOG2 values (0, 1, 2 and 3) and pushed to partial-unroll-opt branch.

Wonderful, thank you for checking!

Quote

Sorry again about that bug.

No worries. Your work is greatly appreciated, and I'm very excited to get my LX150 dev kit mining

The guys over at the Modular FPGA hardware design thread will also be quite happy, since their design is based around the LX150.

I ran a LOOP_LOG2=0 compile overnight. Turns out, the compiles actually take very little time; under an hour. And yes, it completes just fine at 50MHz Grin

However, I've made silly mistake after silly mistake in the code, resulting in countless re-compiles. I'm hoping the compile I have going right now is the last one, and I can finally get correct results from the live hardware. I will report back with success once I've got it.

Looks like device utilization is about 50%, which is good, and XPower estimates 2.2W of consumption (FF toggle at 200, BRAM at 100%). I measure 50C on the chip's surface, ~38C with a small fan.

newMeat1

full member

Activity: 210

Merit: 100

Great work guys! Thanks

makomk

hero member

Activity: 686

Merit: 564

Quote from: fpgaminer on July 22, 2011, 04:45:16 AM

phew, finally tracked down the bug. The K and K_next wires in sha256_transform.HASHERS were not getting the right values.

Whoops, you are indeed right. I changed the non-USE_RAM_FOR_KS case but forgot to change or test the USE_RAM_FOR_KS one. Sorry!

Quote from: fpgaminer on July 22, 2011, 04:45:16 AM

K_next is a little bit more complicated, because it has to use cnt differently for each HASHERS. For example, if LOOP_LOG2=1, then K_next in HASHERS[1] needs to alternate between Ks_mem[34] and Ks_mem[2], when cnt=0 and cnt=1 respectively. In HASHERS[2] K_next alternates between Ks_mem[3] and Ks_mem[35] respectively.

It's a little weird, but it makes sense. When LOOP_LOG2=1, HASHERS[0] alternates between doing fresh work (cnt=0, Round 0) and doing old work (cnt=1, Round 32): new, old, new, old, etc. Since HASHERS[1] is directly connected to HASHERS[0] it alternates as well, but it will alternate in the opposite fashion. It gets old work, and then new work, old, new, old, new, etc.

Yep, that's entirely correct. I remember it wasn't much fun to figure this out; took me several hours to get right myself. (In fact, the whole code's a tad hairy.)

Quote from: fpgaminer on July 22, 2011, 04:45:16 AM

Off the top of my head, I think this will work in the general case:

Code:

assign K_next = Ks_mem[(NUM_ROUNDS/LOOP)*((cnt+i) & (LOOP-1)) +i+1];

That basically says, adjust cnt by our position in the HASHERS chain. I can try it out later and check.

That corresponds to what I was doing for the tested non-USE_RAM_FOR_KS case, so it should work. (Of course, the whole thing goes pear-shaped anyway if LOOP_LOG2 > 3 for reasons I haven't pinned down.)

Quote from: fpgaminer on July 22, 2011, 04:45:16 AM

With that fix, the code works correctly in ModelSim, and it works on live hardware Cheesy

So my LX150T dev board finally gets 25MH/s of performance. Progress! I may run a compile of LOOP_LOG2=0 overnight and see if that finishes.

I'm finally making pleasant progress with the LX150 because of your hard work, makomk, so thank you.

Yay - good news at last! Sorry again about that bug.

Edit: Fix tested in Modelsim at all working LOOP_LOG2 values (0, 1, 2 and 3) and pushed to partial-unroll-opt branch.

fpgaminer

hero member

Activity: 560

Merit: 517

phew, finally tracked down the bug. The K and K_next wires in sha256_transform.HASHERS were not getting the right values. K was easy:

Code:

assign K = Ks_mem[(NUM_ROUNDS/LOOP)*cnt+i];

K_next is a little bit more complicated, because it has to use cnt differently for each HASHERS. For example, if LOOP_LOG2=1, then K_next in HASHERS[1] needs to alternate between Ks_mem[34] and Ks_mem[2], when cnt=0 and cnt=1 respectively. In HASHERS[2] K_next alternates between Ks_mem[3] and Ks_mem[35] respectively.

It's a little weird, but it makes sense. When LOOP_LOG2=1, HASHERS[0] alternates between doing fresh work (cnt=0, Round 0) and doing old work (cnt=1, Round 32): new, old, new, old, etc. Since HASHERS[1] is directly connected to HASHERS[0] it alternates as well, but it will alternate in the opposite fashion. It gets old work, and then new work, old, new, old, new, etc.

I threw a quick hack into my code for K_next (it's on the public repo), that only works for LOOP_LOG2=1:

Code:

if(i & 1)
assign K_next = Ks_mem[(NUM_ROUNDS/LOOP)*!cnt[0]+i+1];
else
assign K_next = Ks_mem[(NUM_ROUNDS/LOOP)*cnt+i+1];

Off the top of my head, I think this will work in the general case:

Code:

assign K_next = Ks_mem[(NUM_ROUNDS/LOOP)*((cnt+i) & (LOOP-1)) +i+1];

That basically says, adjust cnt by our position in the HASHERS chain. I can try it out later and check.

With that fix, the code works correctly in ModelSim, and it works on live hardware Cheesy

So my LX150T dev board finally gets 25MH/s of performance. Progress! I may run a compile of LOOP_LOG2=0 overnight and see if that finishes.

I'm finally making pleasant progress with the LX150 because of your hard work, makomk, so thank you.

fpgaminer

hero member

Activity: 560

Merit: 517

I got your code rolled into an LX150T project with ChipScope (JTAG) for the "virtual wires." It compiles up just fine at 50MHz and LOOP_LOG2=1

However it does not deliver correct results when run on the live chip. The results are consistent, and the chip reads 54C at the surface, so I don't think the timing is wrong, nor is it overheating. I will run the code through modelsim and see if I can track down the bug.

https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/tree/master/projects/LX150_makomk_Test

Side Note: I tried out ISE's power analysis tool. I've never used it before, and I don't have a post P&R simulation vector to give it, so I just set some over-estimated toggling values. 120 FF toggle, 100% BRAM usage, I think. It reported something like 1.4W. Seems a bit low to me ... any idea if that seems reasonable to you?

Quote

Bigger, from what I remember. These changes get it down to about 51k LEs for LOOP_LOG2=1 on Cyclone IV, which isn't great but...

That's great progress in my book. Well done, makomk! The rolled up designs are important as well, because they can help fill out large chips.

Quote

they appear to do not-so-great things to the routing and placement...

ಠ_ಠ

makomk

hero member

Activity: 686

Merit: 564

Quote from: fpgaminer on July 21, 2011, 07:49:32 PM

It's grotesquely inefficient compared to LOOP_LOG2=0 Tongue

For the vanilla code (DE2_115_Unoptimized), LOOP_LOG2=1 is almost as big as LOOP_LOG2=0 on Altera. It's terrifying.

Bigger, from what I remember. These changes get it down to about 51k LEs for LOOP_LOG2=1 on Cyclone IV, which isn't great but...

Quote from: fpgaminer on July 21, 2011, 07:49:32 PM

The DCM confuses the heck out of me. I really should read its datasheet and get my head straight. Anyway, the code I use is on the public repo, so you can take a look at the DCM that coregen made for my project: https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/blob/master/projects/LX150_Test/hdl/main_pll.v
It uses the CLKDV_DIVIDE parameter to take an input 100MHz clock and spit out a 50MHz clock.

Yeah, that makes sense. Having read the datasheet, the CLKn outputs are the same frequency as the input, CLK2X is 2x (obviously), CLKDV is input frequency/CLKDV_DIVIDE and CLKFX is input frequency*CLKFX_MULTIPLY/CLKFX_DIVIDE.

Quote from: fpgaminer on July 21, 2011, 07:49:32 PM

If it fully unrolls on the LX150 and gets close to 100MHz I will be very happy. That alone will yield higher MHash/s/$ than the current Altera solution, and we can build up from there. Those DSP slices are sitting all lonely and unused

I'm not even touching the DSP slices this time around, at least not initially; they appear to do not-so-great things to the routing and placement...

fpgaminer

hero member

Activity: 560

Merit: 517

Quote

The LOOP_LOG2>0 code isn't terribly efficient.

It's grotesquely inefficient compared to LOOP_LOG2=0 Tongue

For the vanilla code (DE2_115_Unoptimized), LOOP_LOG2=1 is almost as big as LOOP_LOG2=0 on Altera. It's terrifying.

Quote

Just put together an untested port of them to Xilinx and got an apparently successful synthesis run for the XC6SLX75 at LOOP_LOG2=1 and 50 MHz. This is with some experimental block-RAM-as-shift-register code that I'm not sure will work, though. Will upload the code in a bit.

Thank you for sharing. I'll muck with the code a bit and try it out on my LX150. I've got my fingers crossed ...

Quote

Edit: Also, the DCM_SP doesn't seem to work quite the way I might expect. Curious.
Edit 2: Aha - with the way this code uses it, the CLKFX_DIVIDE and CLKFX_MULTIPLY settings are effectively irrelevant, as is CLKDV_DIVIDE.

The DCM confuses the heck out of me. I really should read its datasheet and get my head straight. Anyway, the code I use is on the public repo, so you can take a look at the DCM that coregen made for my project: https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/blob/master/projects/LX150_Test/hdl/main_pll.v
It uses the CLKDV_DIVIDE parameter to take an input 100MHz clock and spit out a 50MHz clock.

Quote

Edit 3: Can't quite get it to hit 100 MHz with LOOP_LOG2=1 on the XC6SLX75 (actual period = 11.045ns).

If it fully unrolls on the LX150 and gets close to 100MHz I will be very happy. That alone will yield higher MHash/s/$ than the current Altera solution, and we can build up from there. Those DSP slices are sitting all lonely and unused

makomk

hero member

Activity: 686

Merit: 564

Quote from: fpgaminer on July 21, 2011, 04:56:25 AM

I tried compiling a version of my code, LOOP_LOG2=1, and two extra pipeline registers with Register Balancing enabled. It couldn't even get past Mapping, with an area constraint error Tongue

The LOOP_LOG2>0 code isn't terribly efficient. There are some changes to make it a bit better in the partial-unroll-opt branch on my github repo; I'll see if I can get them merged in to DE2_115_makomk_mod and send you a pull request at some point.

Just put together an untested port of them to Xilinx and got an apparently successful synthesis run for the XC6SLX75 at LOOP_LOG2=1 and 50 MHz. This is with some experimental block-RAM-as-shift-register code that I'm not sure will work, though. Will upload the code in a bit.

Edit: Also, the DCM_SP doesn't seem to work quite the way I might expect. Curious.
Edit 2: Aha - with the way this code uses it, the CLKFX_DIVIDE and CLKFX_MULTIPLY settings are effectively irrelevant, as is CLKDV_DIVIDE.
Edit 3: Can't quite get it to hit 100 MHz with LOOP_LOG2=1 on the XC6SLX75 (actual period = 11.045ns). Looks like I'd need to port over the patch for the extra pipeline stage to compute the initial value of H+K+W[0]. Untested 50Mhz code dump here

fpgaminer

hero member

Activity: 560

Merit: 517

Quote from: TheSeven on July 21, 2011, 03:50:26 AM

Just to let you know, I just managed to route a fully-unrolled doubly-pipelined miner at 50MHz on the LX150 overnight, so it's definitely doable. As I don't have access to an LX150 I haven't verified its correctness yet, but even if it has some bugs, I don't think fixing them would affect timing much. I might try to improve it during the weekend.

Is the code available anywhere? I'd be happy to compile and run it on my board Grin

I'm guessing it's a modified version of your VHDL port?

I tried compiling a version of my code, LOOP_LOG2=1, and two extra pipeline registers with Register Balancing enabled. It couldn't even get past Mapping, with an area constraint error Tongue

TheSeven

hero member

Activity: 504

Merit: 500

FPGA Mining LLC

Quote from: fpgaminer on July 19, 2011, 07:48:32 PM

Just a quick progress report. I dug back into my Spartan-6 LX150, updated its code to the version that can roll up, and used a CONFIG_LOOP_LOG2 setting of 5; I just wanted to get something compiled and working Tongue

And after waiting long enough, it did indeed churn out a valid result! So ... progress! I am going to clean up the project and get it into the public repo. From there I will crank down the LOOP_LOG2 setting as low as it will go, and begin adding extra pipelining to see if it will push further.

Just to let you know, I just managed to route a fully-unrolled doubly-pipelined miner at 50MHz on the LX150 overnight, so it's definitely doable. As I don't have access to an LX150 I haven't verified its correctness yet, but even if it has some bugs, I don't think fixing them would affect timing much. I might try to improve it during the weekend.

fpgaminer

hero member

Activity: 560

Merit: 517

Quote

Correct, i used DE2_115_makomk_mod, changed the FPGA type and the clock pin, CONFIG_LOOP_LOG2 was 0.

Did you set the correct voltage for the clock pin?
Is the clock 50MHz? If it is not 50MHz, you will have to adjust the sdc file, so Quartus knows the correct speed of the clock.

What dev board is it, by the way?

Quote

Because the virtual_wires are the same I though that I can still use the tcl scripts from your "original" design !?!?!? Is this not a good idea?

That is correct, the tcl mining script should work fine, as long as you're using the latest version: https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/tree/master/scripts/mine

Quote

Do you publish also a version where CONFIG_LOOP_LOG2 can be different from 0?

The unoptimized version works with any setting of CONFIG_LOOP_LOG2. makomk's personal repository also has a version of his code that works with any setting of CONFIG_LOOP_LOG2, but as I pointed out there isn't much point to using it if CONFIG_LOOP_LOG2 isn't 0 because it basically reverts to the unoptimized version.

mlut

newbie

Activity: 6

Merit: 0

Quote from: fpgaminer on July 19, 2011, 06:25:36 PM

Quote from: mlut on July 19, 2011, 02:28:40 AM

Does someone got the current version on github (fpgaminer + makomk modifications) successfully running in an FPGA?
It compiles problemless in a EPS4GX230 device but creates me only Stales, no Shares.
The "original" fpgaminer version runs problemless ...

Are you using projects/DE2_115_makomk_mod/ with CONFIG_LOOP_LOG2 set to something other than 0? It must be set to 0 for the version on my public repo. The version on makomk's repo will work with other settings, but it basically just reverts to the unoptimized version if you do that so there's little point in using it unless CONFIG_LOOP_LOG2 is 0.

You also mentioned in your PM that you're using an "old" mining tcl script. How old? It was updated a few weeks ago to reflect changes in the code.

Correct, i used DE2_115_makomk_mod, changed the FPGA type and the clock pin, CONFIG_LOOP_LOG2 was 0.
Do you publish also a version where CONFIG_LOOP_LOG2 can be different from 0?
Because the virtual_wires are the same I though that I can still use the tcl scripts from your "original" design !?!?!? Is this not a good idea?

Topic: Official Open Source FPGA Bitcoin Miner (Last Update: April 14th, 2013) - page 28. (Read 432972 times)