Pages:
Author

Topic: FPGA development board "Icarus" - DisContinued/ important announcement - page 18. (Read 207302 times)

sr. member
Activity: 407
Merit: 250
Just look at https://bitcointalksearch.org/topic/m.792660 again.
If you've never done hardware design before, this might be a bit confusing, but everything that's written in that HDL file will happen in parallel, not sequentially as it would be in most programming languages.
In the loop, the previous state and data are copied to state_buf/data_buf. In parallel, the (old) contents of state_buf/data_buf are used to calculate the next state/data value.
Because of this, it will take two clock cycles for the values from S[i-1].state to propagate (though state_buf) to state. The generate loop basically just duplicates that code 64 times, but has no effect on "execution order", if there even is such a thing in HDL.

Thanks! I've done hardware design before, but only very simple circuits in HDL. I've always been an oldschool schematic/block diagram guy, did some HDL way back, but only simple stuff, then haven't touched it since. So getting back into it now. As I said I've been writing my own bitcoin mining core, which can be hopefully synthesized for multiple boards and wrapped in whatever PC comms layer we want. But it's slow going lol...

Thanks for pointing that out, I had missed that double stage assignment. That's what I was looking for and just not seeing it. (I do know how blocking versus non blocking assignments work though lol)

Thanks for taking the time to answer that. Smiley

(interesting that so far my design doesn't have this extra stage in it, my SHA core is done, and I'm just building the testbenches for it now to validate it. But my SHA core (I believe) runs in 64 clock cycles (probably 65-66 due to initial loading logic, I'll have to doublecheck). This is purely un-optimized right now, for now I'm just getting a working SHA core and then building a bitcoin core, and finally I'll go back and tune/optimize. (right now I'm at about 50% utilization on an LX75 but my delay on my critical path is high, 11ns, so I'm limited at just under 100MHz, I'm getting one SHA hash per clock. So of I can get that to 100MHz initially, and can cram 4 of these cores into an LX150 I can get 2 bitcoin hashes per clock, at 100MHz. We'll see how it validates on the testbenches though, and if I'm able to optimize it (and how well). I'm hoping to opensource this, but want to get it to a working state first (a little embarrassed to release it in it's current state lol). Then hopefully the community can optimize further. I'll probably just release it and put up a donation address or something.

I know the LX75 is over-constraining and tends to screw up routing so I'm targeting it first as a "stress test", then once I get the design working on that I'll move it to the LX150 and see how it goes.

Also my design was is a "clean room" implementation of SHA256. I have gotten "tips" by a few on the forums here though for optimization methods. And I have looked over the ZTex code of course, but frankly I found it hard to read in places. So I figured re-implementing it would be a good learning experience to get my Verilog skills polished up anyway. I wrote it directly from the SHA2 spec without having the ZTex code open. (it's as cleanroom as you can get these days lol).

Right now I'm running into issues with the Xilinx simulator. It's being bitchy about simulating my code (even though it synthesizes fine), which is why I haven't completed a testbench sim of it yet. Also getting a lot of warnings about unconnected nets in synth, but that's because the top level module (bitcoin hashing core) isn't done yet. Just the lower level SHA core.
legendary
Activity: 4634
Merit: 1851
Linux since 1997 RedHat 4
Not sure if you meant this, but a Hardware Error is simple - it's when the value returned was no good
(and cgminer keeps track of that also)
So what I meant was each time a bad share was returned you simply step the clock down as small as possible.
As I've mentioned before - I've had zero HW errors since I started mining with my 2 Icarus.
hero member
Activity: 504
Merit: 500
FPGA Mining LLC
The ZTEX bitstream allows you to adjust the MHz value via direct USB calls.

Would that be possible with the Icarus (with a new bitstream) also?

My thoughts on getting cgminer to handle the ZTEX (which I failed to get done) were somewhat similar to a part of ZTEX's Java code.
Basically in the initial run to step up the MHz 1 or 2 at a time until hardware errors, and then forever after, step it down 1 every time it got a hardware error - with some RPC API option to force it back to the initial "steeping up" status or even to set the value from the RPC API (to step down from)

With the current interface, even if it would support that, the error rate resolution would be way too low. IMHO this is something that should be handled by the µC on the ztex board, offloading the work from the miner. The Icarus doesn't have a µC, so you have no other option than a software implementation, and that might not work too well with the rather slow serial port...
legendary
Activity: 4634
Merit: 1851
Linux since 1997 RedHat 4
The ZTEX bitstream allows you to adjust the MHz value via direct USB calls.

Would that be possible with the Icarus (with a new bitstream) also?

My thoughts on getting cgminer to handle the ZTEX (which I failed to get done) were somewhat similar to a part of ZTEX's Java code.
Basically in the initial run to step up the MHz 1 or 2 at a time until hardware errors, and then forever after, step it down 1 every time it got a hardware error - with some RPC API option to force it back to the initial "steeping up" status or even to set the value from the RPC API (to step down from)
sr. member
Activity: 273
Merit: 250
Today I've uploaded the 200MHz bitstream on 6 boards. 4 boards got high % of invalid shares! and 2 boards had normal invalid rate "1 board got 0% and another 0.1%". I have the 6 boards divided on two towers, each has 3 boards. I found out that the 2 boards that got 0 and 0.1 invalids are the top ones! "room temp: ~22C".

If anyone is planning to upgrade to the 200 MHz bitstream without side fans, it is better not to have the boards on top of each other in a tower! and make sure there is enough space between each board!

I will re-test everything soon with side fans!

Roughly how much is "a high percentage"?

around 7%, 10%, 12%, and 3% in just few mints! I then reset those 4 boards to the 190MHz bitstream, and kept the top ones @200MHz. The top ones been mining for more than 12 hours with the same invalid rate almost 0%!

The first one and last one are running @200MHz

Invalid shares│Current│Average
 (K not zero) │MHash/s│MHash/s
      0 (0.0%)│ 400.03│ 393.83
      3 (0.1%)│ 379.83│ 365.15
      2 (0.1%)│ 379.88│ 380.07
      6 (0.2%)│ 380.18│ 374.18
      1 (0.0%)│ 379.79│ 376.41
      3 (0.1%)│ 399.92│ 399.31
sr. member
Activity: 273
Merit: 250
Today I've uploaded the 200MHz bitstream on 6 boards. 4 boards got high % of invalid shares! and 2 boards had normal invalid rate "1 board got 0% and another 0.1%". I have the 6 boards divided on two towers, each has 3 boards. I found out that the 2 boards that got 0 and 0.1 invalids are the top ones! "room temp: ~22C".

If anyone is planning to upgrade to the 200 MHz bitstream without side fans, it is better not to have the boards on top of each other in a tower! and make sure there is enough space between each board!

I will re-test everything soon with side fans!

Roughly how much is "a high percentage"?

around 7%, 10%, 12%, and 3% in just few mints! I then reset those 4 boards to the 190MHz bitstream, and kept the top ones @200MHz. The top ones been mining for more than 12 hours with the same invalid rate almost 0%!
hero member
Activity: 504
Merit: 500
FPGA Mining LLC
Today I've uploaded the 200MHz bitstream on 6 boards. 4 boards got high % of invalid shares! and 2 boards had normal invalid rate "1 board got 0% and another 0.1%". I have the 6 boards divided on two towers, each has 3 boards. I found out that the 2 boards that got 0 and 0.1 invalids are the top ones! "room temp: ~22C".

If anyone is planning to upgrade to the 200 MHz bitstream without side fans, it is better not to have the boards on top of each other in a tower! and make sure there is enough space between each board!

I will re-test everything soon with side fans!

Roughly how much is "a high percentage"?
hero member
Activity: 504
Merit: 500
FPGA Mining LLC
Yeah it could just be a terminology thing Wink lol

So what you're saying you see 2 clock cycles within the generate loop? I had misinterpreted that then I thought each iteration of the loop was only one clock cycle.

Can you elaborate on which point in the loop is which clock cycle? (looking at it again I'm still only seeing one clock cycle lol, so either I'm badly mis-reading, or I'm just missing something).

Thanks!

Just look at https://bitcointalksearch.org/topic/m.792660 again.
If you've never done hardware design before, this might be a bit confusing, but everything that's written in that HDL file will happen in parallel, not sequentially as it would be in most programming languages.
In the loop, the previous state and data are copied to state_buf/data_buf. In parallel, the (old) contents of state_buf/data_buf are used to calculate the next state/data value.
Because of this, it will take two clock cycles for the values from S[i-1].state to propagate (though state_buf) to state. The generate loop basically just duplicates that code 64 times, but has no effect on "execution order", if there even is such a thing in HDL.
sr. member
Activity: 273
Merit: 250
Today I've uploaded the 200MHz bitstream on 6 boards. 4 boards got high % of invalid shares! and 2 boards had normal invalid rate "1 board got 0% and another 0.1%". I have the 6 boards divided on two towers, each has 3 boards. I found out that the 2 boards that got 0 and 0.1 invalids are the top ones! "room temp: ~22C".

If anyone is planning to upgrade to the 200 MHz bitstream without side fans, it is better not to have the boards on top of each other in a tower! and make sure there is enough space between each board!

I will re-test everything soon with side fans!
sr. member
Activity: 407
Merit: 250
Yeah it could just be a terminology thing Wink lol

So what you're saying you see 2 clock cycles within the generate loop? I had misinterpreted that then I thought each iteration of the loop was only one clock cycle.

Can you elaborate on which point in the loop is which clock cycle? (looking at it again I'm still only seeing one clock cycle lol, so either I'm badly mis-reading, or I'm just missing something).

Thanks!
hero member
Activity: 504
Merit: 500
FPGA Mining LLC
In verilog, the generate block when you put a for loop in it, will synthesize that out into multiple blocks of logic (think of it as a fast way to instantiate chunks of logic multiple times over).

So when he's copying data from registers in S[i-1] to the current registers you're right he's moving it from the previous pipeline stage to the current pipeline stage. But that for loop instantiates the number of stages in the pipe as STAGES. (so 64 by default). That's the full 64 stage sha pipeline. Each individual block within a stage doesn't seem to be split further.

At least that's what I got out of his method by reading the code, and it's how I've built mine Wink

That entirely depends on your definition of the word "stage", which seems to match my definition of the word "round" in this case.
So in your terminology, each stage has a latency of two clock cycles.
In my terminology, a pipeline stage has a latency of one clock cycle per definition, so one "iteration" of that generate loop actually produces two (chained) pipeline stages. Then, afterwards, 64 of those stage pairs are chained together to form a single sha256 core, and two of those cores are chained together to form a full bitcoin hasher core. Then there's usually some more control logic around it which adds a couple of cycles of latency, so we'll end up somewhere around 260 clock cycles total latency.
sr. member
Activity: 273
Merit: 250
Even if you could do it without horribly bad things happening Wink I doubt you would get any benefit from it.

Reason being the limiting factor right now isn't the clock speed of the chips, but the delay caused by the critical path. If the critical path takes 5ns, your max clock speed attainable will be 200Mhz. If your critical path takes 10ns, it's 100Mhz, and so on... The problem is that by doubling up on rising/falling edge, you're doing work twice per clock. So if it takes 10ns, you need 20ns of clearance because of the "double work" meaning that the same 5ns critical path delay that caused a 200Mhz cap before, will now cause a 100Mhz cap (or lower due to other inefficiencies). So you will AT BEST break even with performance, but more likely make it worse.

Thank you for your detailed answer!
sr. member
Activity: 448
Merit: 250
I would be grateful if someone with good FPGA programming experience answers this question:

Is it possible to make use of both clock edges to improve the mining speed?

For example: replacing always@(posedge CLK) by always@(posedge CLK or negedge CLK)

Zhang've told me that this would lead to a disaster! I am still wondering if its possible to use a double edged clock design @ lower MHz "100->133"!

The top frequency for this fully unrolled and cascaded double SHA-256 is determined by one of the following two constraints:
- Longlines from one stage to the next (as EldenTyrell has pointed out, longlines are used in the middle of both 64 stage SHA blocks, because the FPGA is simply not wide enough)
- The 32 bit wide ripple carry adder, I think there are 6 or 7 of them per stage, two if them in series.
  Thus, if, say, one of these adders took 2.3 ns (guessing), then two of them in series would take 4.6 ns and that's that.

(That said: I don't know which of these two constraints is the costliest/slowest/longest one, and I'd be glad if someone could point that out and specify both of them in nanoseconds.)

So, just because you clock the flip-flops on both clock edges and thus halve the clock, your mining is not going to get any faster.
sr. member
Activity: 407
Merit: 250
Even if you could do it without horribly bad things happening Wink I doubt you would get any benefit from it.

Reason being the limiting factor right now isn't the clock speed of the chips, but the delay caused by the critical path. If the critical path takes 5ns, your max clock speed attainable will be 200Mhz. If your critical path takes 10ns, it's 100Mhz, and so on... The problem is that by doubling up on rising/falling edge, you're doing work twice per clock. So if it takes 10ns, you need 20ns of clearance because of the "double work" meaning that the same 5ns critical path delay that caused a 200Mhz cap before, will now cause a 100Mhz cap (or lower due to other inefficiencies). So you will AT BEST break even with performance, but more likely make it worse.
sr. member
Activity: 273
Merit: 250
I would be grateful if someone with good FPGA programming experience answers this question:

Is it possible to make use of both clock edges to improve the mining speed?

For example: replacing always@(posedge CLK) by always@(posedge CLK or negedge CLK)

Zhang've told me that this would lead to a disaster! I am still wondering if its possible to use a double edged clock design @ lower MHz "100->133"!
sr. member
Activity: 407
Merit: 250
In verilog, the generate block when you put a for loop in it, will synthesize that out into multiple blocks of logic (think of it as a fast way to instantiate chunks of logic multiple times over).

So when he's copying data from registers in S[i-1] to the current registers you're right he's moving it from the previous pipeline stage to the current pipeline stage. But that for loop instantiates the number of stages in the pipe as STAGES. (so 64 by default). That's the full 64 stage sha pipeline. Each individual block within a stage doesn't seem to be split further.

At least that's what I got out of his method by reading the code, and it's how I've built mine Wink
sr. member
Activity: 407
Merit: 250
Ooh! *drool*

I think I see several in that box with my name on it! lol
hero member
Activity: 504
Merit: 500
FPGA Mining LLC
Actually twice that many pipeline stages (relevant for latency), because each sha256 round is split into two pipeline stages in the ztex core.

Really? Since I'm trying to get my head around the code anyway, can you elaborate on this? I'm not seeing it in the code I'm looking at for sha256_pipes2.v

I see the main sha256_pipe2_base module, which seems to generate the 64 SHA stages,

Then I see pipe130 (which instantiates sha256_pipe2_base with 64 stages and does a single pass)

Then I see pipe123 (which instantiates sha256_pipe2_base with 61 stages and only seems to output a single 32bit word of hash)

Then I see pipe129 (which instantiates sha256_pipe2_base with 64 stages and does a single pass and outputs a full 256bit hash)

the top module seems to instantiate sha256_pipe130 and sha256_pipe123 (as p1 and p2)

I don't see anywhere where the sha cores are split? (but as I said before, my verilog is pretty rusty, and since I'm trying to brush up and write my own sha core, if you can help me out with what I'm misinterpreting I'd appreciate it) Wink

Thanks!

I've never really known any verilog (I like VHDL much better), but this looks like the sha256_pipe2_base module consists of two pipeline stages:

Code:
	for (i = 0; i <= STAGES; i = i + 1) begin : S

reg [511:0] data;
reg [223:0] state;
reg [31:0] t1_p1;
That's the first set of pipeline registers
Code:
		if(i == 0) 
begin
[...]
end else
begin

reg [511:0] data_buf;
reg [223:0] state_buf;
reg [31:0] data15_p1, data15_p2, data15_p3, t1;
That's the second set of pipeline resigers
Code:
			always @ (posedge clk)
begin
data_buf <= S[i-1].data;
Just copy the input data in the first stage
Code:
				data[479:0] <= data_buf[511:32];
data15_p1 <= `S1( S[i-1].data[`IDX(15)] ); // 3
data15_p2 <= data15_p1; // 1
data15_p3 <= ( ( i == 1 ) ? `S1( S[i-1].data[`IDX(14)] ) : S[i-1].data15_p2 ) + S[i-1].data[`IDX(9)] + S[i-1].data[`IDX(0)]; // 3
data[`IDX(15)] <= `S0( data_buf[`IDX(1)] ) + data15_p3; // 4
Do the actual caldulations in the second state
Code:
				state_buf <= S[i-1].state;													// 2
Just copy the input data in the first stage
Code:
				t1 <= `CH( S[i-1].state[`IDX(4)], S[i-1].state[`IDX(5)], S[i-1].state[`IDX(6)] ) + `E1( S[i-1].state[`IDX(4)] ) + S[i-1].t1_p1;	// 6

state[`IDX(0)] <= `MAJ( state_buf[`IDX(0)], state_buf[`IDX(1)], state_buf[`IDX(2)] ) + `E0( state_buf[`IDX(0)] ) + t1; // 7
state[`IDX(1)] <= state_buf[`IDX(0)]; // 1
state[`IDX(2)] <= state_buf[`IDX(1)]; // 1
state[`IDX(3)] <= state_buf[`IDX(2)]; // 1
state[`IDX(4)] <= state_buf[`IDX(3)] + t1; // 2
state[`IDX(5)] <= state_buf[`IDX(4)]; // 1
state[`IDX(6)] <= state_buf[`IDX(5)]; // 1
Do the actual caldulations in the second state
Code:

t1_p1 <= state_buf[`IDX(6)] + data_buf[`IDX(1)] + Ks[`IDX((127-i) & 63)]; // 2
end

end
end

The synthesis software will then do some register balancing and move part of the logic from the second to the first stage in order to equalize delays between those two stages and thus achieve a higher clock rate because the individual stages' critical path delay is reduced.
legendary
Activity: 1022
Merit: 1000
BitMinter
hero member
Activity: 592
Merit: 501
We will stand and fight.
 Grin

hi, i'm sorry about the disappear and no answer to many mails for a few days.
i got a box of boards yesterday. i'm busy for testing them.



i must finish some bulk orders before 3/12.

so please have a nice day, my friends. Grin
Pages:
Jump to: