FPGA development board "Icarus" - DisContinued/ important announcement - page 18.

Glasswalker

sr. member

Activity: 407

Merit: 250

Quote from: TheSeven on March 09, 2012, 09:12:06 PM

Just look at https://bitcointalksearch.org/topic/m.792660 again.
If you've never done hardware design before, this might be a bit confusing, but everything that's written in that HDL file will happen in parallel, not sequentially as it would be in most programming languages.
In the loop, the previous state and data are copied to state_buf/data_buf. In parallel, the (old) contents of state_buf/data_buf are used to calculate the next state/data value.
Because of this, it will take two clock cycles for the values from S[i-1].state to propagate (though state_buf) to state. The generate loop basically just duplicates that code 64 times, but has no effect on "execution order", if there even is such a thing in HDL.

Thanks! I've done hardware design before, but only very simple circuits in HDL. I've always been an oldschool schematic/block diagram guy, did some HDL way back, but only simple stuff, then haven't touched it since. So getting back into it now. As I said I've been writing my own bitcoin mining core, which can be hopefully synthesized for multiple boards and wrapped in whatever PC comms layer we want. But it's slow going lol...

Thanks for pointing that out, I had missed that double stage assignment. That's what I was looking for and just not seeing it. (I do know how blocking versus non blocking assignments work though lol)

Thanks for taking the time to answer that.

(interesting that so far my design doesn't have this extra stage in it, my SHA core is done, and I'm just building the testbenches for it now to validate it. But my SHA core (I believe) runs in 64 clock cycles (probably 65-66 due to initial loading logic, I'll have to doublecheck). This is purely un-optimized right now, for now I'm just getting a working SHA core and then building a bitcoin core, and finally I'll go back and tune/optimize. (right now I'm at about 50% utilization on an LX75 but my delay on my critical path is high, 11ns, so I'm limited at just under 100MHz, I'm getting one SHA hash per clock. So of I can get that to 100MHz initially, and can cram 4 of these cores into an LX150 I can get 2 bitcoin hashes per clock, at 100MHz. We'll see how it validates on the testbenches though, and if I'm able to optimize it (and how well). I'm hoping to opensource this, but want to get it to a working state first (a little embarrassed to release it in it's current state lol). Then hopefully the community can optimize further. I'll probably just release it and put up a donation address or something.

I know the LX75 is over-constraining and tends to screw up routing so I'm targeting it first as a "stress test", then once I get the design working on that I'll move it to the LX150 and see how it goes.

Also my design was is a "clean room" implementation of SHA256. I have gotten "tips" by a few on the forums here though for optimization methods. And I have looked over the ZTex code of course, but frankly I found it hard to read in places. So I figured re-implementing it would be a good learning experience to get my Verilog skills polished up anyway. I wrote it directly from the SHA2 spec without having the ZTex code open. (it's as cleanroom as you can get these days lol).

Right now I'm running into issues with the Xilinx simulator. It's being bitchy about simulating my code (even though it synthesizes fine), which is why I haven't completed a testbench sim of it yet. Also getting a lot of warnings about unconnected nets in synth, but that's because the top level module (bitcoin hashing core) isn't done yet. Just the lower level SHA core.

kano

legendary

Activity: 4634

Merit: 1851

Linux since 1997 RedHat 4

Not sure if you meant this, but a Hardware Error is simple - it's when the value returned was no good
(and cgminer keeps track of that also)
So what I meant was each time a bad share was returned you simply step the clock down as small as possible.
As I've mentioned before - I've had zero HW errors since I started mining with my 2 Icarus.

TheSeven

hero member

Activity: 504

Merit: 500

FPGA Mining LLC

Quote from: kano on March 09, 2012, 09:35:07 PM

The ZTEX bitstream allows you to adjust the MHz value via direct USB calls.

Would that be possible with the Icarus (with a new bitstream) also?

My thoughts on getting cgminer to handle the ZTEX (which I failed to get done) were somewhat similar to a part of ZTEX's Java code.
Basically in the initial run to step up the MHz 1 or 2 at a time until hardware errors, and then forever after, step it down 1 every time it got a hardware error - with some RPC API option to force it back to the initial "steeping up" status or even to set the value from the RPC API (to step down from)

With the current interface, even if it would support that, the error rate resolution would be way too low. IMHO this is something that should be handled by the µC on the ztex board, offloading the work from the miner. The Icarus doesn't have a µC, so you have no other option than a software implementation, and that might not work too well with the rather slow serial port...

kano

legendary

Activity: 4634

Merit: 1851

Linux since 1997 RedHat 4

The ZTEX bitstream allows you to adjust the MHz value via direct USB calls.

Would that be possible with the Icarus (with a new bitstream) also?

My thoughts on getting cgminer to handle the ZTEX (which I failed to get done) were somewhat similar to a part of ZTEX's Java code.
Basically in the initial run to step up the MHz 1 or 2 at a time until hardware errors, and then forever after, step it down 1 every time it got a hardware error - with some RPC API option to force it back to the initial "steeping up" status or even to set the value from the RPC API (to step down from)

Energizer

sr. member

Activity: 273

Merit: 250

Quote from: Energizer on March 09, 2012, 09:21:37 PM

Quote from: TheSeven on March 09, 2012, 09:13:42 PM

Quote from: Energizer on March 09, 2012, 09:07:19 PM

Today I've uploaded the 200MHz bitstream on 6 boards. 4 boards got high % of invalid shares! and 2 boards had normal invalid rate "1 board got 0% and another 0.1%". I have the 6 boards divided on two towers, each has 3 boards. I found out that the 2 boards that got 0 and 0.1 invalids are the top ones! "room temp: ~22C".

If anyone is planning to upgrade to the 200 MHz bitstream without side fans, it is better not to have the boards on top of each other in a tower! and make sure there is enough space between each board!

I will re-test everything soon with side fans!

Roughly how much is "a high percentage"?

around 7%, 10%, 12%, and 3% in just few mints! I then reset those 4 boards to the 190MHz bitstream, and kept the top ones @200MHz. The top ones been mining for more than 12 hours with the same invalid rate almost 0%!

The first one and last one are running @200MHz

Invalid shares│Current│Average
(K not zero) │MHash/s│MHash/s
0 (0.0%)│ 400.03│ 393.83
3 (0.1%)│ 379.83│ 365.15
2 (0.1%)│ 379.88│ 380.07
6 (0.2%)│ 380.18│ 374.18
1 (0.0%)│ 379.79│ 376.41
3 (0.1%)│ 399.92│ 399.31

Energizer

sr. member

Activity: 273

Merit: 250

Quote from: TheSeven on March 09, 2012, 09:13:42 PM

Quote from: Energizer on March 09, 2012, 09:07:19 PM

Today I've uploaded the 200MHz bitstream on 6 boards. 4 boards got high % of invalid shares! and 2 boards had normal invalid rate "1 board got 0% and another 0.1%". I have the 6 boards divided on two towers, each has 3 boards. I found out that the 2 boards that got 0 and 0.1 invalids are the top ones! "room temp: ~22C".

If anyone is planning to upgrade to the 200 MHz bitstream without side fans, it is better not to have the boards on top of each other in a tower! and make sure there is enough space between each board!

I will re-test everything soon with side fans!

Roughly how much is "a high percentage"?

around 7%, 10%, 12%, and 3% in just few mints! I then reset those 4 boards to the 190MHz bitstream, and kept the top ones @200MHz. The top ones been mining for more than 12 hours with the same invalid rate almost 0%!

TheSeven

hero member

Activity: 504

Merit: 500

FPGA Mining LLC

Quote from: Energizer on March 09, 2012, 09:07:19 PM

Today I've uploaded the 200MHz bitstream on 6 boards. 4 boards got high % of invalid shares! and 2 boards had normal invalid rate "1 board got 0% and another 0.1%". I have the 6 boards divided on two towers, each has 3 boards. I found out that the 2 boards that got 0 and 0.1 invalids are the top ones! "room temp: ~22C".

If anyone is planning to upgrade to the 200 MHz bitstream without side fans, it is better not to have the boards on top of each other in a tower! and make sure there is enough space between each board!

I will re-test everything soon with side fans!

Roughly how much is "a high percentage"?

TheSeven

hero member

Activity: 504

Merit: 500

FPGA Mining LLC

Quote from: Glasswalker on March 09, 2012, 08:26:31 PM

Yeah it could just be a terminology thing Wink

lol

So what you're saying you see 2 clock cycles within the generate loop? I had misinterpreted that then I thought each iteration of the loop was only one clock cycle.

Can you elaborate on which point in the loop is which clock cycle? (looking at it again I'm still only seeing one clock cycle lol, so either I'm badly mis-reading, or I'm just missing something).

Thanks!

Just look at https://bitcointalksearch.org/topic/m.792660 again.
If you've never done hardware design before, this might be a bit confusing, but everything that's written in that HDL file will happen in parallel, not sequentially as it would be in most programming languages.
In the loop, the previous state and data are copied to state_buf/data_buf. In parallel, the (old) contents of state_buf/data_buf are used to calculate the next state/data value.
Because of this, it will take two clock cycles for the values from S[i-1].state to propagate (though state_buf) to state. The generate loop basically just duplicates that code 64 times, but has no effect on "execution order", if there even is such a thing in HDL.

Energizer

sr. member

Activity: 273

Merit: 250

Today I've uploaded the 200MHz bitstream on 6 boards. 4 boards got high % of invalid shares! and 2 boards had normal invalid rate "1 board got 0% and another 0.1%". I have the 6 boards divided on two towers, each has 3 boards. I found out that the 2 boards that got 0 and 0.1 invalids are the top ones! "room temp: ~22C".

If anyone is planning to upgrade to the 200 MHz bitstream without side fans, it is better not to have the boards on top of each other in a tower! and make sure there is enough space between each board!

I will re-test everything soon with side fans!

Glasswalker

sr. member

Activity: 407

Merit: 250

Yeah it could just be a terminology thing Wink

lol

So what you're saying you see 2 clock cycles within the generate loop? I had misinterpreted that then I thought each iteration of the loop was only one clock cycle.

Can you elaborate on which point in the loop is which clock cycle? (looking at it again I'm still only seeing one clock cycle lol, so either I'm badly mis-reading, or I'm just missing something).

Thanks!

TheSeven

hero member

Activity: 504

Merit: 500

FPGA Mining LLC

Quote from: Glasswalker on March 09, 2012, 01:45:17 PM

In verilog, the generate block when you put a for loop in it, will synthesize that out into multiple blocks of logic (think of it as a fast way to instantiate chunks of logic multiple times over).

So when he's copying data from registers in S[i-1] to the current registers you're right he's moving it from the previous pipeline stage to the current pipeline stage. But that for loop instantiates the number of stages in the pipe as STAGES. (so 64 by default). That's the full 64 stage sha pipeline. Each individual block within a stage doesn't seem to be split further.

At least that's what I got out of his method by reading the code, and it's how I've built mine Wink

That entirely depends on your definition of the word "stage", which seems to match my definition of the word "round" in this case.
So in your terminology, each stage has a latency of two clock cycles.
In my terminology, a pipeline stage has a latency of one clock cycle per definition, so one "iteration" of that generate loop actually produces two (chained) pipeline stages. Then, afterwards, 64 of those stage pairs are chained together to form a single sha256 core, and two of those cores are chained together to form a full bitcoin hasher core. Then there's usually some more control logic around it which adds a couple of cycles of latency, so we'll end up somewhere around 260 clock cycles total latency.

Energizer

sr. member

Activity: 273

Merit: 250

Quote from: Glasswalker on March 09, 2012, 02:30:58 PM

Even if you could do it without horribly bad things happening Wink

I doubt you would get any benefit from it.

Reason being the limiting factor right now isn't the clock speed of the chips, but the delay caused by the critical path. If the critical path takes 5ns, your max clock speed attainable will be 200Mhz. If your critical path takes 10ns, it's 100Mhz, and so on... The problem is that by doubling up on rising/falling edge, you're doing work twice per clock. So if it takes 10ns, you need 20ns of clearance because of the "double work" meaning that the same 5ns critical path delay that caused a 200Mhz cap before, will now cause a 100Mhz cap (or lower due to other inefficiencies). So you will AT BEST break even with performance, but more likely make it worse.

Thank you for your detailed answer!

Inspector 2211

sr. member

Activity: 448

Merit: 250

Quote from: Energizer on March 09, 2012, 02:15:57 PM

I would be grateful if someone with good FPGA programming experience answers this question:

Is it possible to make use of both clock edges to improve the mining speed?

For example: replacing always@(posedge CLK) by always@(posedge CLK or negedge CLK)

Zhang've told me that this would lead to a disaster! I am still wondering if its possible to use a double edged clock design @ lower MHz "100->133"!

The top frequency for this fully unrolled and cascaded double SHA-256 is determined by one of the following two constraints:
- Longlines from one stage to the next (as EldenTyrell has pointed out, longlines are used in the middle of both 64 stage SHA blocks, because the FPGA is simply not wide enough)
- The 32 bit wide ripple carry adder, I think there are 6 or 7 of them per stage, two if them in series.
Thus, if, say, one of these adders took 2.3 ns (guessing), then two of them in series would take 4.6 ns and that's that.

(That said: I don't know which of these two constraints is the costliest/slowest/longest one, and I'd be glad if someone could point that out and specify both of them in nanoseconds.)

So, just because you clock the flip-flops on both clock edges and thus halve the clock, your mining is not going to get any faster.

Glasswalker

sr. member

Activity: 407

Merit: 250

Even if you could do it without horribly bad things happening Wink

I doubt you would get any benefit from it.

Reason being the limiting factor right now isn't the clock speed of the chips, but the delay caused by the critical path. If the critical path takes 5ns, your max clock speed attainable will be 200Mhz. If your critical path takes 10ns, it's 100Mhz, and so on... The problem is that by doubling up on rising/falling edge, you're doing work twice per clock. So if it takes 10ns, you need 20ns of clearance because of the "double work" meaning that the same 5ns critical path delay that caused a 200Mhz cap before, will now cause a 100Mhz cap (or lower due to other inefficiencies). So you will AT BEST break even with performance, but more likely make it worse.

Energizer

sr. member

Activity: 273

Merit: 250

I would be grateful if someone with good FPGA programming experience answers this question:

Is it possible to make use of both clock edges to improve the mining speed?

For example: replacing always@(posedge CLK) by always@(posedge CLK or negedge CLK)

Zhang've told me that this would lead to a disaster! I am still wondering if its possible to use a double edged clock design @ lower MHz "100->133"!

Glasswalker

sr. member

Activity: 407

Merit: 250

In verilog, the generate block when you put a for loop in it, will synthesize that out into multiple blocks of logic (think of it as a fast way to instantiate chunks of logic multiple times over).

So when he's copying data from registers in S[i-1] to the current registers you're right he's moving it from the previous pipeline stage to the current pipeline stage. But that for loop instantiates the number of stages in the pipe as STAGES. (so 64 by default). That's the full 64 stage sha pipeline. Each individual block within a stage doesn't seem to be split further.

At least that's what I got out of his method by reading the code, and it's how I've built mine Wink

Glasswalker

sr. member

Activity: 407

Merit: 250

Ooh! *drool*

I think I see several in that box with my name on it! lol

TheSeven

hero member

Activity: 504

Merit: 500

FPGA Mining LLC

Quote from: Glasswalker on March 09, 2012, 08:35:52 AM

Quote from: TheSeven on March 09, 2012, 04:40:39 AM

Actually twice that many pipeline stages (relevant for latency), because each sha256 round is split into two pipeline stages in the ztex core.

Really? Since I'm trying to get my head around the code anyway, can you elaborate on this? I'm not seeing it in the code I'm looking at for sha256_pipes2.v

I see the main sha256_pipe2_base module, which seems to generate the 64 SHA stages,

Then I see pipe130 (which instantiates sha256_pipe2_base with 64 stages and does a single pass)

Then I see pipe123 (which instantiates sha256_pipe2_base with 61 stages and only seems to output a single 32bit word of hash)

Then I see pipe129 (which instantiates sha256_pipe2_base with 64 stages and does a single pass and outputs a full 256bit hash)

the top module seems to instantiate sha256_pipe130 and sha256_pipe123 (as p1 and p2)

I don't see anywhere where the sha cores are split? (but as I said before, my verilog is pretty rusty, and since I'm trying to brush up and write my own sha core, if you can help me out with what I'm misinterpreting I'd appreciate it) Wink

Thanks!

I've never really known any verilog (I like VHDL much better), but this looks like the sha256_pipe2_base module consists of two pipeline stages:

Code:

	for (i = 0; i <= STAGES; i = i + 1) begin : S

		reg [511:0] data;
		reg [223:0] state;
		reg [31:0] t1_p1;

That's the first set of pipeline registers

Code:

		if(i == 0) 
		begin
[...]
		end else
		begin

			reg [511:0] data_buf;
			reg [223:0] state_buf;
			reg [31:0] data15_p1, data15_p2, data15_p3, t1;

That's the second set of pipeline resigers

Code:

			always @ (posedge clk)
			begin
				data_buf <= S[i-1].data;

Just copy the input data in the first stage

Code:

				data[479:0] <= data_buf[511:32];
				data15_p1 <= `S1( S[i-1].data[`IDX(15)] );											// 3
				data15_p2 <= data15_p1;														// 1
				data15_p3 <= ( ( i == 1 ) ? `S1( S[i-1].data[`IDX(14)] ) : S[i-1].data15_p2 ) + S[i-1].data[`IDX(9)] + S[i-1].data[`IDX(0)];	// 3
				data[`IDX(15)] <= `S0( data_buf[`IDX(1)] ) + data15_p3;										// 4

Do the actual caldulations in the second state

Code:

				state_buf <= S[i-1].state;													// 2

Just copy the input data in the first stage

Code:

				t1 <= `CH( S[i-1].state[`IDX(4)], S[i-1].state[`IDX(5)], S[i-1].state[`IDX(6)] ) + `E1( S[i-1].state[`IDX(4)] ) + S[i-1].t1_p1;	// 6

				state[`IDX(0)] <= `MAJ( state_buf[`IDX(0)], state_buf[`IDX(1)], state_buf[`IDX(2)] ) + `E0( state_buf[`IDX(0)] ) + t1;		// 7
				state[`IDX(1)] <= state_buf[`IDX(0)];												// 1
				state[`IDX(2)] <= state_buf[`IDX(1)];												// 1
				state[`IDX(3)] <= state_buf[`IDX(2)];												// 1
				state[`IDX(4)] <= state_buf[`IDX(3)] + t1;											// 2
				state[`IDX(5)] <= state_buf[`IDX(4)];												// 1
				state[`IDX(6)] <= state_buf[`IDX(5)];												// 1

Do the actual caldulations in the second state

Code:


				t1_p1 <= state_buf[`IDX(6)] + data_buf[`IDX(1)] + Ks[`IDX((127-i) & 63)];							// 2
			end

		end
	end

The synthesis software will then do some register balancing and move part of the logic from the second to the first stage in order to equalize delays between those two stages and thus achieve a higher clock rate because the individual stages' critical path delay is reduced.

Turbor

legendary

Activity: 1022

Merit: 1000