FPGA development board "Icarus" - DisContinued/ important announcement - page 19.

Glasswalker

sr. member

Activity: 407

Merit: 250

Quote from: TheSeven on March 09, 2012, 04:40:39 AM

Actually twice that many pipeline stages (relevant for latency), because each sha256 round is split into two pipeline stages in the ztex core.

Really? Since I'm trying to get my head around the code anyway, can you elaborate on this? I'm not seeing it in the code I'm looking at for sha256_pipes2.v

I see the main sha256_pipe2_base module, which seems to generate the 64 SHA stages,

Then I see pipe130 (which instantiates sha256_pipe2_base with 64 stages and does a single pass)

Then I see pipe123 (which instantiates sha256_pipe2_base with 61 stages and only seems to output a single 32bit word of hash)

Then I see pipe129 (which instantiates sha256_pipe2_base with 64 stages and does a single pass and outputs a full 256bit hash)

the top module seems to instantiate sha256_pipe130 and sha256_pipe123 (as p1 and p2)

I don't see anywhere where the sha cores are split? (but as I said before, my verilog is pretty rusty, and since I'm trying to brush up and write my own sha core, if you can help me out with what I'm misinterpreting I'd appreciate it) Wink

Thanks!

TheSeven

hero member

Activity: 504

Merit: 500

FPGA Mining LLC

Quote from: Glasswalker on March 08, 2012, 10:07:45 PM

Quick update, after re-reading the Verilog, looks like it is pipelining it (and you're right, 61 stages on both parts, he has some special cases in there, there is also a core doing full 64 stage pipe, but I am not sure what that's for lol, only going over it roughly right now)

I'm intrigued to hear more about your optimizations, since I'm writing my own verilog. Once I get it working and able to calculate hashes (slowly) I'll go over optimizing it, and then things like your suggestions could help quite a bit.

Actually twice that many pipeline stages (relevant for latency), because each sha256 round is split into two pipeline stages in the ztex core.

allinvain

legendary

Activity: 3080

Merit: 1083

Quote from: TheSeven on March 09, 2012, 02:57:36 AM

Quote from: allinvain on March 08, 2012, 07:41:45 PM

Quote from: Energizer on March 08, 2012, 02:41:31 PM

I am getting 0% invalid shares on 6 boards! I am using MPBM with jobinterval set to 11.3!

By the way I was wrong! the 11->11.3 range is as valuable as any 0.3 seconds within the 11.3 seconds range!

So is there a consensus that setting jobinterval to 11.3 results in the _best_ performance for the Icarus board?

I think there is a consensus amongst basically everyone but Energizer that it doesn't. Exactly 11.3 seconds is indeed the sweet spot, but the effective interval will always be a little bit longer than the one calculated at that line of code that Energizer pointed at, there's a bit of jitter due to various reasons.
While the penalty for going lower (and thus adding a bit of a safety margin) is pretty much zero, the penalty for exceeding those 11.3 seconds is huge. That's why the defaults should be fine, and you'll need to hack up the code to change that (jobinterval settings above 8 seconds in the configuration file will just be ignored).
See this post for details: https://bitcointalksearch.org/topic/m.780603

Thank you for clearing that up for me. That settles it for me, I will leave things as they are. I typically see 0.1% invalids which IMHO is _good_ .

kano

legendary

Activity: 4634

Merit: 1851

Linux since 1997 RedHat 4

Quote from: Glasswalker on March 08, 2012, 10:07:45 PM

Quick update, after re-reading the Verilog, looks like it is pipelining it (and you're right, 61 stages on both parts, he has some special cases in there, there is also a core doing full 64 stage pipe, but I am not sure what that's for lol, only going over it roughly right now)

I'm intrigued to hear more about your optimizations, since I'm writing my own verilog. Once I get it working and able to calculate hashes (slowly) I'll go over optimizing it, and then things like your suggestions could help quite a bit.

Well here's the output of my code (which is of course fully unrolled) before I started messing with trying to do 2 nonce at the same time.
It also doesn't do the partial Wn calculations, but it's easy to see them.

http://pastebin.com/sxdVSJF1

That has all 3 sha256()'s in it since the first one is the midstate calculation.
Also note that the last sha256() has a lot of constants at the start (that my code determined) that may also not be in the Icarus version (I don't know)
My code worked out constants and converted them to their values.

That code will run and find shares correctly.
It's not perfect in terms of register usage or optimisation of partial calculations, bit otherwise it's pretty close to complete.

TheSeven

hero member

Activity: 504

Merit: 500

FPGA Mining LLC

Quote from: allinvain on March 08, 2012, 07:41:45 PM

Quote from: Energizer on March 08, 2012, 02:41:31 PM

I am getting 0% invalid shares on 6 boards! I am using MPBM with jobinterval set to 11.3!

By the way I was wrong! the 11->11.3 range is as valuable as any 0.3 seconds within the 11.3 seconds range!

So is there a consensus that setting jobinterval to 11.3 results in the _best_ performance for the Icarus board?

I think there is a consensus amongst basically everyone but Energizer that it doesn't. Exactly 11.3 seconds is indeed the sweet spot, but the effective interval will always be a little bit longer than the one calculated at that line of code that Energizer pointed at, there's a bit of jitter due to various reasons.
While the penalty for going lower (and thus adding a bit of a safety margin) is pretty much zero, the penalty for exceeding those 11.3 seconds is huge. That's why the defaults should be fine, and you'll need to hack up the code to change that (jobinterval settings above 8 seconds in the configuration file will just be ignored).
See this post for details: https://bitcointalksearch.org/topic/m.780603

Glasswalker

sr. member

Activity: 407

Merit: 250

Quick update, after re-reading the Verilog, looks like it is pipelining it (and you're right, 61 stages on both parts, he has some special cases in there, there is also a core doing full 64 stage pipe, but I am not sure what that's for lol, only going over it roughly right now)

I'm intrigued to hear more about your optimizations, since I'm writing my own verilog. Once I get it working and able to calculate hashes (slowly) I'll go over optimizing it, and then things like your suggestions could help quite a bit.

allinvain

legendary

Activity: 3080

Merit: 1083

Quote from: Energizer on March 08, 2012, 02:41:31 PM

I am getting 0% invalid shares on 6 boards! I am using MPBM with jobinterval set to 11.3!

By the way I was wrong! the 11->11.3 range is as valuable as any 0.3 seconds within the 11.3 seconds range!

So is there a consensus that setting jobinterval to 11.3 results in the _best_ performance for the Icarus board?

kano

legendary

Activity: 4634

Merit: 1851

Linux since 1997 RedHat 4

-O 2 to -O 3 did nothing.

But my point there was that the gcc compiler is VERY good at optimisation - and applying that optimisation makes a big difference in CPU land.

However, those nonce-range optimisations are simply removing code to doing it once rather than 2^32 times (assuming the controller that distributes the work to the 2 chips does the setup work)
So without them you are wasting something like 2% ... or using that approximate figure on an Icarus: 98% = 380MH/s, then 100% = around 388MH/s
All very rough but certainly worth doing - since it doesn't increase the power usage or the amount of effort for the Icarus, it simply increases the MH/s

TheSeven

hero member

Activity: 504

Merit: 500

FPGA Mining LLC

Quote from: kano on March 08, 2012, 05:01:10 PM

Edit2: I wrote a C program many months ago to analyse the double sha256 and optimise it (and spit out an optimised C program to calculate it - that works) and that's where I get that info from - but I know it is correct coz - as I said, the output code works.
I did this for my own understanding of what optimisations there are ... and of course found them all for the normal double sha256

If you could actually fit in doing 2 nonce at a time in one chip there are also some more partial calculations across each pair of nonce (that I started working on with my code but didn't finish due to there being no actual use in the results at the time)

I'm not sure if that would make things any better. The wall that the HDL people are currently hitting seems to be mostly routing congestion, not really logic slices yet. Spartan6 routing must be awful. And this idea doesn't really sound like it would improve on that

Quote from: kano on March 08, 2012, 05:39:16 PM

As I said in my "Edit2:" above, I did this with C.
On top of all that - using gcc -O2 over the resulting code made a massive speed difference also - something close to running at twice the speed (though that was probably the optimisation of C to assembler)
And the -O2 made doing some of the code optimisations pointless since gcc worked them out itself

Running without -O tells the compiler to literally do what you say, i.e. forbids that kind of optimization (and also writes all kinds of variables to the stack for no good reason, resulting in even more slowdown). -O1 vs. -O2 vs. -O3 vs. -Os might be more interesting than comparing with no -O option at all.

kano

legendary

Activity: 4634

Merit: 1851

Linux since 1997 RedHat 4

Quote from: TheSeven on March 08, 2012, 05:26:30 PM

Quote from: kano on March 08, 2012, 05:01:10 PM

The 1st sha256 is 61 also - the first 3 'stages' are exactly the same for all nonce in a range, so repeating them 4 billion times is a waste.
There is also the nonce-constant values of W0-W2 & W4-W15 (W4-W15 are constant over all time)
Then the calculation of W16, W17 is also constant across the nonce range.
(and there are other partial calculations you can do also that are constant across a nonce-range)
Edit: the partial ones are W18 (S0), W19 (S0 and S1, S1 is a constant over all time) W20 (S1 - again a constant over all time) W21 (S1 = 0) W22-W30 (S1) all these partial calculations shouldn't be done 4 billion times if at all possible (and some of the +W values for these are also constants per range or even constants over all time)

The synthesis tools usually do a rather good job at removing logic with constant output values. So while this may not be true for the nonce-dependent ones, most of the all time constant ones have probably already been caught automatically.

A lot of it is nonce dependent - so not doing that is a BIG waste.
Also, even the ATI OpenCL compiler sux at doing this so I wouldn't be surprised if the tool is poor at optimisation.

As I said in my "Edit2:" above, I did this with C.
On top of all that - using gcc -O2 over the resulting code made a massive speed difference also - something close to running at twice the speed (though that was probably the optimisation of C to assembler)
And the -O2 made doing some of the code optimisations pointless since gcc worked them out itself

TheSeven

hero member

Activity: 504

Merit: 500

FPGA Mining LLC

Quote from: kano on March 08, 2012, 05:01:10 PM

The 1st sha256 is 61 also - the first 3 'stages' are exactly the same for all nonce in a range, so repeating them 4 billion times is a waste.
There is also the nonce-constant values of W0-W2 & W4-W15 (W4-W15 are constant over all time)
Then the calculation of W16, W17 is also constant across the nonce range.
(and there are other partial calculations you can do also that are constant across a nonce-range)
Edit: the partial ones are W18 (S0), W19 (S0 and S1, S1 is a constant over all time) W20 (S1 - again a constant over all time) W21 (S1 = 0) W22-W30 (S1) all these partial calculations shouldn't be done 4 billion times if at all possible (and some of the +W values for these are also constants per range or even constants over all time)

The synthesis tools usually do a rather good job at removing logic with constant output values. So while this may not be true for the nonce-dependent ones, most of the all time constant ones have probably already been caught automatically.

kano

legendary

Activity: 4634

Merit: 1851

Linux since 1997 RedHat 4

Quote from: Glasswalker on March 08, 2012, 03:43:38 PM

Current Icarus code (at least the released stuff) is based on the ZTex code.

The ZTex code has a central core module which has a variable number of stages. The SHA-2 (SHA256) spec calls for 64 stages per hash. But the way bitcoin uses it, it only needs a full hash on one stage, and a partial hash on the other.

So the ZTex code (and therefor the Icarus code) does 64 stages on one core, and 61 stages on the other core, for a total of 125 stages. It has all of those stages fully unrolled so it takes 125 clocks (probably slightly more, haven't looked at the UART code, and controlling logic in depth yet) to fully load the pipeline, after which it runs 1 hash per clock once the pipe is loaded.
...

Ignoring the pipeline question I asked, I hope it doesn't do 64 + 61.
(well actually I should say I hope it does do this coz then there is a speed up still available)

The 2nd sha256 is actually just 60.5 - but that is probably what you meant by 61.

The 1st sha256 is 61 also - the first 3 'stages' are exactly the same for all nonce in a range, so repeating them 4 billion times is a waste.
There is also the nonce-constant values of W0-W2 & W4-W15 (W4-W15 are constant over all time)
Then the calculation of W16, W17 is also constant across the nonce range.
(and there are other partial calculations you can do also that are constant across a nonce-range)
Edit: the partial ones are W18 (S0), W19 (S0 and S1, S1 is a constant over all time) W20 (S1 - again a constant over all time) W21 (S1 = 0) W22-W30 (S1) all these partial calculations shouldn't be done 4 billion times if at all possible (and some of the +W values for these are also constants per range or even constants over all time)

Edit2: I wrote a C program many months ago to analyse the double sha256 and optimise it (and spit out an optimised C program to calculate it - that works) and that's where I get that info from - but I know it is correct coz - as I said, the output code works.
I did this for my own understanding of what optimisations there are ... and of course found them all for the normal double sha256

If you could actually fit in doing 2 nonce at a time in one chip there are also some more partial calculations across each pair of nonce (that I started working on with my code but didn't finish due to there being no actual use in the results at the time)

Glasswalker

sr. member

Activity: 407

Merit: 250

Current Icarus code (at least the released stuff) is based on the ZTex code.

The ZTex code has a central core module which has a variable number of stages. The SHA-2 (SHA256) spec calls for 64 stages per hash. But the way bitcoin uses it, it only needs a full hash on one stage, and a partial hash on the other.

So the ZTex code (and therefor the Icarus code) does 64 stages on one core, and 61 stages on the other core, for a total of 125 stages. It has all of those stages fully unrolled so it takes 125 clocks (probably slightly more, haven't looked at the UART code, and controlling logic in depth yet) to fully load the pipeline, after which it runs 1 hash per clock once the pipe is loaded.

The Icarus has 2 FPGAs, each running independent hashing cores, which divide the nonce space between them to split the work (but each operates essentially independent). That's based on my basic understanding of the Verilog source code for it.

Edit: At second glance this may or may not be correct... I did say "Basic" understanding lol... My verilog is rusty as hell...

It may actually be fully unrolled so that it's doing the entire hash in a single clock. (for a given SHA256 Hash) and pipe lining the bitcoin (double SHA) hash (into 2 stages).

*runs back to look at the code again*

lol

TheSeven

hero member

Activity: 504

Merit: 500

FPGA Mining LLC

Quote from: kano on March 08, 2012, 08:00:15 AM

I'm referring to how cgminer works.
It gets a timeout, checks the count of how many timeouts and if it has reached some limit it will then go through the process of starting fresh work (that cgminer already has queued ready to go)
Then it starts to write new work down on the Icarus.

I think that's how just about all miners work today (including MPBM).

Quote from: kano on March 08, 2012, 08:00:15 AM

Are you suggesting that there is some period of time AFTER it starts to write the new work to the Icarus that a valid nonce could be returned?
If so then I guess we could add that to cgminer also, but that time would need to be VERY accurate to ensure the old reader isn't taking a nonce from the new work.

Yes, there is a very small period after starting to upload a job where nonces from the previous job could still be coming in.
After that, there is a longer period (about 5.6ms) during which there will be garbage nonces, if any, because the board is working on a mixture between two jobs that won't yield any sensible result.
After that, normal operation will continue with nonce 0 of the new job.

Quote from: kano on March 08, 2012, 08:00:15 AM

Icarus hashes a pair of nonce in roughly 6 nanoseconds (11.3s ~= 380MH/s ~= 3ns per nonce if it was a single device = 6ns per pair)
... though I'd be curious to know if the hashing process is a complete cycle per pair or the pairs are stepping though a stepped cycle
i.e. is there some delay before the first nonce-check completes, and then the remaining (2^31 - 1) sequential results are closer together than this initial delay?
I've still not quite got my understanding of that inside FPGA processing clear to me yet.

The FPGAs usually use a pipelined design. I don't know what the exact pipeline depth is, but I'd assume that it's somewhere between 128 and 270 stages.
So generating a full double-hash will need N clock cycles, but N nonces are being processed in parallel. Basically there's a hardware implementation of each sha256 round, an the work bubbles through that chain, one step (sha256 round) per clock cycle. See http://en.wikipedia.org/wiki/Instruction_pipeline for the general idea, just that we're having a hundred sha256 round stages instead of those 5 processor pipeline stages described there.

Energizer

sr. member

Activity: 273

Merit: 250

I am getting 0% invalid shares on 6 boards! I am using MPBM with jobinterval set to 11.3!

By the way I was wrong! the 11->11.3 range is as valuable as any 0.3 seconds within the 11.3 seconds range!

Turbor

legendary

Activity: 1022

Merit: 1000

BitMinter

Quote from: Dhomochevsky on March 08, 2012, 07:28:58 AM

That's strange. I use RG7Miner too and so far (~18 hours of continuous mining on ABCpool) I never got anything that resembles the "sleeping mode" you describe. What pool are you on? Maybe it has something to do with them? Maybe it has something to do with your connection? Is it stable enough? Maybe the pool was under some sort of ddos attack when your boards displayed that error?

[edit] - I managed to replicate your error by turning off the board then turning it back on without stopping the miner. Every share after that was invalid until I stopped and restarted the miner. Maybe there's some sort of power interruptions on the supply line and your board turns off briefly.

No, there is no power interruption. Two other FPGAs are on the same supply. I mine on GPUMAX and BitMinter. BM is rock solid but it happens there too. Antirack had the same error. Funny thing is that Icarus has the lowest stale rate of all devices i have. So the miner can't be that bad Cheesy

It's just not stable (yet)

ngzhang

hero member

Activity: 592

Merit: 501

We will stand and fight.

is there any questions i should answer? Huh

about over heat: i test a fan stop issue for 1 day, 1% invalid, no damage.
but i'm afraid, so i won't to do a heat-sink drop issue... Grin

kano

legendary

Activity: 4634

Merit: 1851

Linux since 1997 RedHat 4

Quote from: TheSeven on March 08, 2012, 03:58:38 AM

...

Quote from: kano on March 08, 2012, 02:06:26 AM

If there was a nonce coming back during the picosecond that it starts a new work (writing over the old running work) it would of course be lost.
However that would be something like only a few in 4 billion (2^32) chance of that happening.
cgminer aborts the work at a point before it completes since it doesn't know when the work will complete.

So when cgminer reaches 4 billion shares - yep you may have lost a few
i.e. cgminer may lose roughly a few shares roughly every 2,500 blocks ...

That's not entirely true. The job upload packet is 64 bytes, which is 640 bits on the serial link at 115200 baud, so that's 5.56ms.
During that time, the internal shift register of the Icarus contains invalid data, and thus there's a timeframe of 5.56ms during which the Icarus works on a garbage job, and keeps resetting the nonce over and over again. This alone causes at minimum 0.05% invalids, a bit more if you cancel work rather early.
The maximum nonce value (ignoring the MSB) that an invalid nonce caused by this could have is probably 0x0000406e, which would match the pattern that was reported above if the endianness was reversed and the MSB was ignored.
...

I'm referring to how cgminer works.
It gets a timeout, checks the count of how many timeouts and if it has reached some limit it will then go through the process of starting fresh work (that cgminer already has queued ready to go)
Then it starts to write new work down on the Icarus.

Are you suggesting that there is some period of time AFTER it starts to write the new work to the Icarus that a valid nonce could be returned?
If so then I guess we could add that to cgminer also, but that time would need to be VERY accurate to ensure the old reader isn't taking a nonce from the new work.

Icarus hashes a pair of nonce in roughly 6 nanoseconds (11.3s ~= 380MH/s ~= 3ns per nonce if it was a single device = 6ns per pair)
... though I'd be curious to know if the hashing process is a complete cycle per pair or the pairs are stepping though a stepped cycle
i.e. is there some delay before the first nonce-check completes, and then the remaining (2^31 - 1) sequential results are closer together than this initial delay?
I've still not quite got my understanding of that inside FPGA processing clear to me yet.

Dhomochevsky

sr. member

Activity: 242

Merit: 251

That's strange. I use RG7Miner too and so far (~18 hours of continuous mining on ABCpool) I never got anything that resembles the "sleeping mode" you describe. What pool are you on? Maybe it has something to do with them? Maybe it has something to do with your connection? Is it stable enough? Maybe the pool was under some sort of ddos attack when your boards displayed that error?

[edit] - I managed to replicate your error by turning off the board then turning it back on without stopping the miner. Every share after that was invalid until I stopped and restarted the miner. Maybe there's some sort of power interruptions on the supply line and your board turns off briefly.

Turbor

legendary

Activity: 1022

Merit: 1000

BitMinter

My invalids have always been around 1% with RG7Miner. There must be some little problem. Seems to be connected with that eternal sleeping mode where the board only produces invalids. I'm 99% sure that this is not hardware releated because when you restart the miner all works ok. Sadly it's unpredictable.

Topic: FPGA development board "Icarus" - DisContinued/ important announcement - page 19. (Read 207304 times)