Nanominer - Modular FPGA Mining Platform - page 3.

nelisky

legendary

Activity: 1540

Merit: 1002

Quote from: wondermine on February 12, 2012, 03:38:42 AM

Are you familiar with the term "SerDes"?

That reminds me that old internet slang word: "modem"

Defkin

member

Activity: 80

Merit: 10

interested

wondermine

newbie

Activity: 59

Merit: 0

[deleted, argumentative]

Inspector 2211

sr. member

Activity: 448

Merit: 250

Quote from: wondermine on February 11, 2012, 10:51:14 AM

Quote from: Inspector 2211 on February 11, 2012, 02:37:34 AM

wondermine, I wish you the best. I really do.

However, please take a look at the SHA-256 algorithm. http://en.wikipedia.org/wiki/Sha-256
The 32 bit values b, c, d, f, g, and h are trivially derived from the previous round, i.e. copied from a, b, c, e, f, and g, respectively.
The 32 bit value e is derived from the previous round's d, h, e, f, and g (i.e. 5/8th of the previous round's 256 bits are used to derive it).
The 32 bit value a is derived from the previous round's h, e, f, g, a, b, and c - i.e. 7/8th of the previous round's 256 bits are used to derive it.

Now think this through over just one more round. Only four 32 bit values are trivially derived from their "grandfather" round.
The other four 32 bit values are derived from brutal mixing of almost all bits of the grandfather round.

And so on.

After just 4 rounds, a single bit change in the great-great-grandfather round influences ALL bits of the current round.

Thus, any notion of shaving more than 4 or 5 rounds off the 128 total rounds is a pipe dream.

In other words, speeding up an implementation of SHA-256 cannot be done by mathematical tricks.

Rather, the operations of each round should be optimized.

There is no real reason why the clock is a measly 200 MHz (and thus the clock cycle 5 ns) in the best currently available implementations,
such as the ZTEX implementation. Think about it: 5 ns, that is a delay straight from the 70s. A TTL technology-like delay. Certainly we can do better than that?!?

Analyzing the operations for their contribution to the delay yields:
rightrotate ... instant, no delay at all
xor ... minor delay, bit by bit, probably a few dozen picoseconds
and ... minor delay, bit by bit, probably a few dozen picoseconds
not ... minor delay, bit by bit, probably a few dozen picoseconds
+ ... this should be scrutinized. A 32 bit add operation can be quite costly and the fastest possible implementation should be pursued.
Adding insult to injury, SHA-256 features not just binary or ternary adds, but 4-fold adds (in the t1 function) and 5-fold adds
(e := d+t1) and 6-fold adds (a := t1 + t2).

So, there you go. The biggest detriment to performance is probably the 6-fold 32 bit wide add in a := t1 + t2.
If you can speed this operation up, maybe by pre-computing partial results in the PREVIOUS round, then bringing them to the table when needed, the entire SHA-256 will be sped up (assuming optimal placing and routing).

I'm very familiar with the SHA-256 algorithm and understand its complexity. And I don't mean from reading Wikipedia. What you fail to capture here is that I'm not looking at this to know exact values to avoid... it's a matter of probability. Why does SHA-256 use 64 opaque rounds? Precisely so something like this does not happen. It might be a waste of time and resources if checking a value wasn't so resource-friendly, but it's not.

As far as "more standard" optimizations, I already mentioned I would do that. And I know adding is the biggest resource hog here. I'm looking into the best ways to do that, precalculation, DSP chips on custom boards (to offload logic that does not require programming and other benefits)... I may be a student but I'm not exactly lacking in mathematical or engineering understanding.

The 200MHz issue... Actually there is a reason we're down in the "70s" range of timing. Optimizing for high clockability is a huge challenge. It is a problem, also something I'm already looking into. I'm looking into quad-pumping and other technologies that major manufacturers use. I'm going to take it from your knowledge of some of this that you understand how hard it is to come by clean clocking, and then making sure that doesn't become unstable. Have you looked at commercial IP for SHA or other block ciphers? They don't run much higher than 200MHz stably, and they cost thousands of dollars and have had many engineers optimize the hell out of them.

So all that to say, yes I know what routes of action to take. I also won't say I'm looking into something unless I believe it's feasible and have done adequate research.

>to offload logic

You can't offload logic.
Not enough IO pins on an FPGA.
There are seven additions per round, 125 rounds - do you want to connect 875 external adders?
These 875 external adders would need 96 pins (64 inputs, 32 outputs) connected to the FPGA each, for a grand total of 84,000 pins...

runeks

legendary

Activity: 980

Merit: 1008

^ Just so we're clear, wouldn't a discovery like that be a major one, essentially - to a certain extent - compromising SHA-256? Wouldn't it mean that it doesn't qualify as a cryptographic hash function? And perhaps that Bitcoin (and whatever else uses SHA-256) should migrate to a new and safer hash function?

Wouldn't this be the kind of stuff someone might do as a PhD in cryptology, and you're doing it as a side-project for the mining FPGA you're developing in your spare time while studying. :=)

Nice work by the way! I'm going to send you a couple BTCs.

wondermine

newbie

Activity: 59

Merit: 0

Quote from: Inspector 2211 on February 11, 2012, 02:37:34 AM

wondermine, I wish you the best. I really do.

However, please take a look at the SHA-256 algorithm. http://en.wikipedia.org/wiki/Sha-256
The 32 bit values b, c, d, f, g, and h are trivially derived from the previous round, i.e. copied from a, b, c, e, f, and g, respectively.
The 32 bit value e is derived from the previous round's d, h, e, f, and g (i.e. 5/8th of the previous round's 256 bits are used to derive it).
The 32 bit value a is derived from the previous round's h, e, f, g, a, b, and c - i.e. 7/8th of the previous round's 256 bits are used to derive it.

Now think this through over just one more round. Only four 32 bit values are trivially derived from their "grandfather" round.
The other four 32 bit values are derived from brutal mixing of almost all bits of the grandfather round.

And so on.

After just 4 rounds, a single bit change in the great-great-grandfather round influences ALL bits of the current round.

Thus, any notion of shaving more than 4 or 5 rounds off the 128 total rounds is a pipe dream.

In other words, speeding up an implementation of SHA-256 cannot be done by mathematical tricks.

Rather, the operations of each round should be optimized.

There is no real reason why the clock is a measly 200 MHz (and thus the clock cycle 5 ns) in the best currently available implementations,
such as the ZTEX implementation. Think about it: 5 ns, that is a delay straight from the 70s. A TTL technology-like delay. Certainly we can do better than that?!?

Analyzing the operations for their contribution to the delay yields:
rightrotate ... instant, no delay at all
xor ... minor delay, bit by bit, probably a few dozen picoseconds
and ... minor delay, bit by bit, probably a few dozen picoseconds
not ... minor delay, bit by bit, probably a few dozen picoseconds
+ ... this should be scrutinized. A 32 bit add operation can be quite costly and the fastest possible implementation should be pursued.
Adding insult to injury, SHA-256 features not just binary or ternary adds, but 4-fold adds (in the t1 function) and 5-fold adds
(e := d+t1) and 6-fold adds (a := t1 + t2).

So, there you go. The biggest detriment to performance is probably the 6-fold 32 bit wide add in a := t1 + t2.
If you can speed this operation up, maybe by pre-computing partial results in the PREVIOUS round, then bringing them to the table when needed, the entire SHA-256 will be sped up (assuming optimal placing and routing).

I'm very familiar with the SHA-256 algorithm and understand its complexity. And I don't mean from reading Wikipedia. What you fail to capture here is that I'm not looking at this to know exact values to avoid... it's a matter of probability. Why does SHA-256 use 64 opaque rounds? Precisely so something like this does not happen. It might be a waste of time and resources if checking a value wasn't so resource-friendly, but it's not.

As far as "more standard" optimizations, I already mentioned I would do that. And I know adding is the biggest resource hog here. I'm looking into the best ways to do that, precalculation, DSP chips on custom boards (to offload logic that does not require programming and other benefits)... I may be a student but I'm not exactly lacking in mathematical or engineering understanding.

The 200MHz issue... Actually there is a reason we're down in the "70s" range of timing. Optimizing for high clockability is a huge challenge. It is a problem, also something I'm already looking into. I'm looking into quad-pumping and other technologies that major manufacturers use. I'm going to take it from your knowledge of some of this that you understand how hard it is to come by clean clocking, and then making sure that doesn't become unstable. Have you looked at commercial IP for SHA or other block ciphers? They don't run much higher than 200MHz stably, and they cost thousands of dollars and have had many engineers optimize the hell out of them.

So all that to say, yes I know what routes of action to take. I also won't say I'm looking into something unless I believe it's feasible and have done adequate research.

Inaba

legendary

Activity: 1260

Merit: 1000

I'm not trying to discourage you from anything, Wondermine; Maybe you will find something that's been missed, somewhere, but it seems unlikely is all I'm saying. Maybe getting a fully working implementation then refining is a good course of action, but then again maybe genius needs to take a different path, and who am I do argue.

It's all beyond my abilities anyway! I may purchase a couple DE0's just to see how it goes.

Dexter770221

legendary

Activity: 1029

Merit: 1000

I'm very curious where this goes. 0.55 BTC sent. Good luck.

Inspector 2211

sr. member

Activity: 448

Merit: 250

Quote from: wondermine on February 10, 2012, 11:36:12 PM

No problem, it wasn't the most brilliant question.

Anyways, here are the stats we have cooking, no screenshots yet, this project is requiring more math than I might have liked. One of the ways I'm optimizing the mining is checking late-round values against (to be determined, but known) constants to determine whether or not they will (or are likely to) yield a win. If not, the SHA algorithm aborts early, saving resources. It's gonna be a lot of work, and that's a big way of how we will be shrinking approximately 60 MH/s (number based on more recent data) onto a Cyclone IV Nano. The work begins today.

SHA-2 hashes are unpredictable at 128 rounds, or 64 rounds, but if we have access to the data all the way through, and know what our starting and end data should look like, we can side-channel it. I'm speaking to our school's cryptanalysis expert.

The above may sound like heresy but block ciphers are weakened by attacking their implementation, and we have full access to this one. I'm going to keep working, on all fronts of optimization.

Donation-wise, there have been about 45 BTC and $50 USD/CAD donated, allowing me to buy a couple of DE0 Nanos from our friends at Terasic and paying for a bit of the countless hours I've been pouring into learning all of this. Hopefully that's something you're all happy with.

wondermine, I wish you the best. I really do.

However, please take a look at the SHA-256 algorithm. http://en.wikipedia.org/wiki/Sha-256
The 32 bit values b, c, d, f, g, and h are trivially derived from the previous round, i.e. copied from a, b, c, e, f, and g, respectively.
The 32 bit value e is derived from the previous round's d, h, e, f, and g (i.e. 5/8th of the previous round's 256 bits are used to derive it).
The 32 bit value a is derived from the previous round's h, e, f, g, a, b, and c - i.e. 7/8th of the previous round's 256 bits are used to derive it.

Now think this through over just one more round. Only four 32 bit values are trivially derived from their "grandfather" round.
The other four 32 bit values are derived from brutal mixing of almost all bits of the grandfather round.

And so on.

After just 4 rounds, a single bit change in the great-great-grandfather round influences ALL bits of the current round.

Thus, any notion of shaving more than 4 or 5 rounds off the 128 total rounds is a pipe dream.

In other words, speeding up an implementation of SHA-256 cannot be done by mathematical tricks.

Rather, the operations of each round should be optimized.

There is no real reason why the clock is a measly 200 MHz (and thus the clock cycle 5 ns) in the best currently available implementations,
such as the ZTEX implementation. Think about it: 5 ns, that is a delay straight from the 70s. A TTL technology-like delay. Certainly we can do better than that?!?

Analyzing the operations for their contribution to the delay yields:
rightrotate ... instant, no delay at all
xor ... minor delay, bit by bit, probably a few dozen picoseconds
and ... minor delay, bit by bit, probably a few dozen picoseconds
not ... minor delay, bit by bit, probably a few dozen picoseconds
+ ... this should be scrutinized. A 32 bit add operation can be quite costly and the fastest possible implementation should be pursued.
Adding insult to injury, SHA-256 features not just binary or ternary adds, but 4-fold adds (in the t1 function) and 5-fold adds
(e := d+t1) and 6-fold adds (a := t1 + t2).

So, there you go. The biggest detriment to performance is probably the 6-fold 32 bit wide add in a := t1 + t2.
If you can speed this operation up, maybe by pre-computing partial results in the PREVIOUS round, then bringing them to the table when needed, the entire SHA-256 will be sped up (assuming optimal placing and routing).

wondermine

newbie

Activity: 59

Merit: 0

I don't expect everyone to be super gung ho on the idea, but I've been into crypto a while, and gotten pretty intimate with SHA-2. I don't mean a midstate, I mean knowing that on the 124th round a certain value for one of the 32 bit words disqualifies the hash. Quite possible. The more work I do on it, the better the miner will be. Plus, these comparative constants take advantage of the BRAM we're not using well on the Cyclone series. Some miners... maybe... use something like this, but I'm talking a pervasive mathematical analysis. MatLab is attacking it now.

It's all proportional to time I spend really, there's so much RAM to be used for constant comparison that adding checkers along the algorithm wouldn't even prevent adding cores, it would just add performance, and a lot at that. Just imagine knowing 5% of the time on round 124 to abort the calculation... you have to multiply those (logic savings or time savings) by millions of times per second. It's pretty big. It's just gonna require some expert help (check), a ****load of time (maybe?), MatLab (check), and y'alls support. I'll see if I can come up with a tangible example. Wiki side-channel attack, look around, it's not voodoo, it's just probabalistic finite field mathematics that I probably shouldn't be doing until grad school. But that hasn't stopped me before.

P.S. The midstate etc. calculations most clients use are to avoid collisions on the network, thus not getting stale packets. It's not to save clock cycles. ^^

P.P.S Also check out the SHA-2 wiki's pseudocode, if you're into that kinda thing, you'll see where this could be useful pretty fast.

P.P.P.S.
Donations have bought me 2 DE0 Nanos for cluster testing;
As always my time is hard to spare, so if you feel generous (even a little), the more the merrier. I seriously spend hours on this stuff, I'll log them if you like. Plus it's high level math you all don't have to do

.
I know results are important so I can post up some more of how the math stuff works, maybe implement a proof-of-concept, but it will take me some time.
As far as things I need to continue, donation wise, the 2 fpgas are paid for, the website and hosting are paid for, and I know it's nice to donate for tangibles but at this point I have all the tools I need, i just need motivation and hopefully some money for my time. As per usual it's an investment. If this math thing sounds like a good idea, talk to me about it, I'm happy to explain etc. Given that most people mining today don't use custom miners and I've sifted through the code of all the standard ones, this mathematical edge would put anyone using it quite a bit ahead of the game.

In fact, would access to proprietary mathematical data/code/programming files be incentive for donation? The numbers and formula I develop will be available for FPGA but I also work with coders who could integrate this benefit into your existing open source miners.

Something like:
First 100 donors have access to weekly updates to how to look for bitcoin "smarter" up until official release, then it becomes open source?
Amount and degree of advantageous code could be proportional to donation... it would be something like "if round 122 maj function has xyz properties, abort" plus pseudocode or C or VHDL.

Sound interesting?

ZodiacDragon84

sr. member

Activity: 266

Merit: 250

The king and the pawn go in the same box @ endgame

Quote from: Inaba on February 11, 2012, 12:24:51 AM

Yeah, I'm not really clear on what you're saying either, Wondermine. You should be taking the midstate, since it's precomputed for you then then iterating through the entire nonce range, looking for a hash that is under difficulty. If you're doing anything more complicated than that, it's likely you're just wasting resources unless you've come up with some magical algorithm, which would be great.

I'm all up for magic algo's

Inaba

legendary

Activity: 1260

Merit: 1000

Yeah, I'm not really clear on what you're saying either, Wondermine. You should be taking the midstate, since it's precomputed for you then then iterating through the entire nonce range, looking for a hash that is under difficulty. If you're doing anything more complicated than that, it's likely you're just wasting resources unless you've come up with some magical algorithm, which would be great.

rjk

sr. member

Activity: 448

Merit: 250

1ngldh

Quote from: wondermine on February 10, 2012, 11:36:12 PM

No problem, it wasn't the most brilliant question.

Anyways, here are the stats we have cooking, no screenshots yet, this project is requiring more math than I might have liked. One of the ways I'm optimizing the mining is checking late-round values against (to be determined, but known) constants to determine whether or not they will (or are likely to) yield a win. If not, the SHA algorithm aborts early, saving resources. It's gonna be a lot of work, and that's a big way of how we will be shrinking approximately 60 MH/s (number based on more recent data) onto a Cyclone IV Nano. The work begins today.

SHA-2 hashes are unpredictable at 128 rounds, or 64 rounds, but if we have access to the data all the way through, and know what our starting and end data should look like, we can side-channel it. I'm speaking to our school's cryptanalysis expert.

The above may sound like heresy but block ciphers are weakened by attacking their implementation, and we have full access to this one. I'm going to keep working, on all fronts of optimization.

Donation-wise, there have been about 45 BTC and $50 USD/CAD donated, allowing me to buy a couple of DE0 Nanos from our friends at Terasic and paying for a bit of the countless hours I've been pouring into learning all of this. Hopefully that's something you're all happy with.

I don't completely understand all of this, but do you mean something like implementing a midstate - e.g., precalculating the first part and bruteforcing only the last part instead of the whole share? I think some miners do this, but perhaps not all of them.

wondermine

newbie

Activity: 59

Merit: 0

No problem, it wasn't the most brilliant question.

Anyways, here are the stats we have cooking, no screenshots yet, this project is requiring more math than I might have liked. One of the ways I'm optimizing the mining is checking late-round values against (to be determined, but known) constants to determine whether or not they will (or are likely to) yield a win. If not, the SHA algorithm aborts early, saving resources. It's gonna be a lot of work, and that's a big way of how we will be shrinking approximately 60 MH/s (number based on more recent data) onto a Cyclone IV Nano. The work begins today.

SHA-2 hashes are unpredictable at 128 rounds, or 64 rounds, but if we have access to the data all the way through, and know what our starting and end data should look like, we can side-channel it. I'm speaking to our school's cryptanalysis expert.

The above may sound like heresy but block ciphers are weakened by attacking their implementation, and we have full access to this one. I'm going to keep working, on all fronts of optimization.

Donation-wise, there have been about 45 BTC and $50 USD/CAD donated, allowing me to buy a couple of DE0 Nanos from our friends at Terasic and paying for a bit of the countless hours I've been pouring into learning all of this. Hopefully that's something you're all happy with.

fizzisist

hero member

Activity: 720

Merit: 528

Quote from: wondermine on February 10, 2012, 03:10:03 PM

Is it more reasonable to quote an effective hashrate (i.e. calculated based on share production) or a hardcore rate? Since the effective rate will be a fair bit higher than the hardcode rate.

Why would this be true? Effective hashrate will vary due to luck, but with a long enough timescale, this will converge to the real hashrate. This should be the same as the hardcoded hashrate for the FPGA, or some shares are being lost somewhere. It will never be higher than the hardcoded hashrate, except in a short run of luck. For a 5% error on the hashrate, you should measure the time to submit 400 shares. At 125 MH/s, this should take 229 minutes.

Best is to quote both, where the effective hashrate is quoted with an error.

nelisky

legendary

Activity: 1540

Merit: 1002

hashrates are hashrates, the number of hashes calculated within a certain time frame. There's nothing magic about it, except being able to calculate them without making the system slower in return.

When you mine somewhere the pool could not handle the bandwidth of a miner that would submit ALL hashes, and frankly I doubt anyone has a fast enough link to submit them in the first place. So miners only submit shares that actually have hashes lower than target, and target may be different across pools (some just use diff==1, some don't). Then it becomes a statistical problem (with these many shares in the last X seconds, how fast is the miner, probably?)

You can calculate effective hashing speed easily, just check how fast the miner goes through the nonce space.

wondermine

newbie

Activity: 59

Merit: 0

There's something important I wanna ask about as far as rating hashrates: we say GPU hashrates are such and such a speed, but that is not something hardcoded into the core, that's how the GPU breaks down the instructions, and that's why it varies a little over time, and per card, as well as with other factors.
Is it more reasonable to quote an effective hashrate (i.e. calculated based on share production) or a hardcore rate? Since the effective rate will be a fair bit higher than the hardcode rate.

Let me know, I want to quote you guys the right numbers. Either way I'll show you shares/sec rates as well as hashrates.

Expect some numbers soon, what they'll be, I dunno

.

abeaulieu

sr. member

Activity: 295

Merit: 250

I may have missed this already (sorry if I did), but have you opened a repository for your work yet? If you are looking for collaboration or just some pointers from the community if might be helpful for you to do so.

I have some experience with Quartus (mainly on the DE2 boards but not is it relates to mining) and would be interested in looking at your progress. I'm sure you might find some other EEs and CEs watching the project that might be able to make your progress more fluid.

Transisto

donator

Activity: 1731

Merit: 1008

GLBSE ?? Hey, the guy merely asked for 10$ donnation

Have you read the thread ?

Hint : He's not into producing custom ASICs.

cypherdoc

legendary

Activity: 1764

Merit: 1002

i lost a total of perhaps 800 BTC thru bad investments in 2 of the original companies listed on glbse; Ubitex and SIN. both those companies had scammers running those operations; Cuddlefish and Tawsix. granted, these losses were a result of my bad choices but i soon came to realize the whole concept is flawed. unregistered companies in an unregulated space issuing unlawful securities btwn unknown individuals is the Wild West. stock mkts like this are begging for trouble and are not something you want to get involved in IMO. the guy who runs it, Nefario, takes all comers and a no look no see approach which invites fraud and corruption. he will not help you if you get into trouble.

you are a student at a respected university and seemingly on to something good here. having to get involved in managing a business with complaining shareholders isn't something you want to cloud your mind with. perhaps someday in a real IPO on Wall Street if this gets real big.

if you can get enough money via donations to keep this thing going i would stick with this model. you have no obligations to anyone for anything other than what you've promised; one on one support.

Topic: Nanominer - Modular FPGA Mining Platform - page 3. (Read 18959 times)