Nanominer Announcement | Bitcointalksearch.org

kramble

sr. member

Activity: 384

Merit: 250

Quote from: lame.duck on March 21, 2013, 05:03:10 PM

40 MHash/s is probably not possible, the 28.3 should be possible, another design by makomk is/was running at 27,5 MHz, mayby you could push it a little further using the timing headroom of the chips and/or higher volages and cooling effort.

I agree. I was really pushing it running at that speed (it was a bit of a dare to see if I could match Makomk's results), and you have to be very careful with the power and cooling (it certainly won't work just using the USB supply, and would be foolish to try).

I'm not really that much of an expert with the Quartus software, so by tweaking the compiler settings it may be possible to better this (the fmax for this 170MHz build was around 150MHz, so it should not really have worked at all). One thing to check is the PLL multiply/divide ratios as its using 50MHz * 17 / 5 which may not be an optimal way to configure it. I did try running at 180Mhz but got no hashes at all, and decided that was probably the best I could manage so I called it a day at that.

EDIT I just realized this is a year old thread, and only peripherally to do with the DE0-Nano. Perhaps Ersch would like to PM wondermine to see if he actually tested it at 40MH/s (he seems to still be active on the board, just about).

lame.duck

legendary

Activity: 1270

Merit: 1000

From the reame file:

Quote

I have also run at 170MHz (SPEED_MHZ=17) using a custom hardwired 1.2 volt core supply which
gave my maximum achived throughput of 28.3 MHash/s. Attempting to run at 180MHz gave bad hash
results, so this is the limit. Note that this draws 1.7amps which is outside the spec of the
DE0-Nano regulators, hence the custom power supply hack.

40 MHash/s is probably not possible, the 28.3 should be possible, another design by makomk is/was running at 27,5 MHz, mayby you could push it a little further using the timing headroom of the chips and/or higher volages and cooling effort.

phr33

full member

Activity: 226

Merit: 100

Quote from: wondermine on April 10, 2012, 02:00:06 PM

Quote from: phr33 on April 10, 2012, 01:37:50 PM

I'm still green when it comes to implementing these brute force cores, but I'm picking up.

I was browsing through your code and one thing struck me; you are using the 256 bit "data" input both for setting the internal state of the first sha round and as the end part of whats hashed in the first round.
In the Icarus and Open-Source-FPGA-Bitcoin-Miner code they have separate 256 bit init state and 96 bits of "data" that is appended to the nonce.

I can't quite work out what those 96 bits are. Bit 64 to 127 of the header. Reversed or not, it doesnt make much sense to me. It should at least not be the same as the init of the hash round.

Probably belongs in a SHA-256 thread, but I think what you're referring to has to do with precalculated(or -able) round values, and the initial value for those.
If there's a problem with the core.vhd, well, that's cause it's a work in progress, it'll get worked out. And smaller...

I had a second look. I think I was right in the first place. There are 76 'static' bytes in the header before the nonce. The midstate is the internal state of the hash core after hashing the first 64 bytes. The remaining 12 bytes will be paired up with the 4 byte nonce and 48 bytes of padding (The header is 80 bytes, but will be padded up to even multiple of 64 bytes - the sha256 blocksize).

So you really do need both the 32 byte midstate and the 12 byte 'data'.

But as you said; that's all in the control logic and I understand it's under development

Nice work!

Jason

member

Activity: 114

Merit: 10

I just took a quick look over Nanominer's code to see what he's doing. He appears to be implementing SHA-256 a bit differently from the other approaches I've seen. In particular, he has not unrolled the hash so it requires 64 clock cycles to complete each hash. This would be analagous to compiling fpgaminer's code with LOG_LOOP2=6. However, Nanominer then appears to be running 10 (configurable) of these cores in parallel with each other. With a clock rate of 200MHz, this would lead to 200*10/(64*2) or about 16 MH/s by my calculations (since bitcoin hashes require two SHA-256 hashes each).

Wondermine, I'm not sure how you are coming up with the higher numbers for hash rates. You would need to fit 50 of these on a 115K LE Cyclone IV along with the associated control circuitry in order to reach the same ballpark as can be achieved with fpgaminer's code with Makomk's modifications. Do you really expect to be able to fit significantly more than this on this FPGA?

Oh, and if you want to preserve logic so that the optimizer does not get rid of it, use the preserve_fanout_free_node option in the assignments editor on the pin(s) you want to preserve and then you should be able to see how much additional optimization the compiler is capable of.

phr33

full member

Activity: 226

Merit: 100

Quote from: lame.duck on April 10, 2012, 02:40:47 PM

Quote from: wondermine on April 10, 2012, 01:57:51 PM

P.S. This design does not "take up too many IOs", taking up too many IOs assumes you don't use some sort of serialization, which is rather absurd.

Well, maybe you could tell me how you would call it to produce an ip core with 288 IOs for a device that has only 167 (or so) soldered on a board with some SDRAM etc ending with 80 Usable user IOs? Btw, you could enligten me wich microcontroller you plan to use that can write 256 bits at once. There is no bottleneck in using some sort of serialization at all. And even if there were, you could always reduce the bandwith requirement by implementing roll-n-times in hardware.

The "control" entity is obviously not meant to be the top of the design. You would accompany it with some kind of interface. Check the Open source miner project. There you have both RS232 interface and through Altera's "virtual wire".
As for bandwidth, you really don't need any. You just send a bunch of bytes (256-isch) to fire off a decent sized job

lame.duck

legendary

Activity: 1270

Merit: 1000

Quote from: wondermine on April 10, 2012, 01:57:51 PM

P.S. This design does not "take up too many IOs", taking up too many IOs assumes you don't use some sort of serialization, which is rather absurd.

Well, maybe you could tell me how you would call it to produce an ip core with 288 IOs for a device that has only 167 (or so) soldered on a board with some SDRAM etc ending with 80 Usable user IOs? Btw, you could enligten me wich microcontroller you plan to use that can write 256 bits at once. There is no bottleneck in using some sort of serialization at all. And even if there were, you could always reduce the bandwith requirement by implementing roll-n-times in hardware.

wondermine

newbie

Activity: 59

Merit: 0

Quote from: phr33 on April 10, 2012, 01:37:50 PM

I'm still green when it comes to implementing these brute force cores, but I'm picking up.

I was browsing through your code and one thing struck me; you are using the 256 bit "data" input both for setting the internal state of the first sha round and as the end part of whats hashed in the first round.
In the Icarus and Open-Source-FPGA-Bitcoin-Miner code they have separate 256 bit init state and 96 bits of "data" that is appended to the nonce.

I can't quite work out what those 96 bits are. Bit 64 to 127 of the header. Reversed or not, it doesnt make much sense to me. It should at least not be the same as the init of the hash round.

Probably belongs in a SHA-256 thread, but I think what you're referring to has to do with precalculated(or -able) round values, and the initial value for those.
If there's a problem with the core.vhd, well, that's cause it's a work in progress, it'll get worked out. And smaller...

wondermine

newbie

Activity: 59

Merit: 0

Sorry to do the double-post but here are some numbers from the Cyclone V series FPGAs:

5CGXBC7D6F31C7 (Grade 7): ~~215.84~~ 219.93 MHz

I'm liking this device family already. As always, more to come.

P.S. This design does not "take up too many IOs", taking up too many IOs assumes you don't use some sort of serialization, which is rather absurd.

P.P.S. When compiled for the Stratix III EP3SL100F1152C2, the fmax is reported as 229.52 MHz, if you were wondering.

*These values are sans optimizations... if anyone can tell me how to make Quartus not synthesize away multiple cores, please let me know, and then I can give you some numbers that more likely reflect reality. (Although there's probably a problem with the core.vhd I need to fix... I work way too much and I have an exam later... I need to leave this alone.)

phr33

full member

Activity: 226

Merit: 100

I'm still green when it comes to implementing these brute force cores, but I'm picking up.

I was browsing through your code and one thing struck me; you are using the 256 bit "data" input both for setting the internal state of the first sha round and as the end part of whats hashed in the first round.
In the Icarus and Open-Source-FPGA-Bitcoin-Miner code they have separate 256 bit init state and 96 bits of "data" that is appended to the nonce.

I can't quite work out what those 96 bits are. Bit 64 to 127 of the header. Reversed or not, it doesnt make much sense to me. It should at least not be the same as the init of the hash round.

wondermine

newbie

Activity: 59

Merit: 0

Quote from: lame.duck on April 10, 2012, 05:36:52 AM

Hm, how did you get the number of Logic cells without having a qpf and qsf file?
Why do you design a control logic with 288 IOs for a chip with only 167 usable IO pins (153 usable on the DE0 (including th pins for RAM etc.).

I tried to test compile the design for a cyclone IV with 30 kLE but fiting failed due to lack of IO pins but the device usage was 88% which makes it not so certain (to me) that you could squeeze 19 hasher in the 22 k device.

While checking the numbers, i could verify that a hasher stage 'miningcore' would use 1157 LEs but this number excludes the sha256core submodule which seems a very important part to me Wink

. So you should recalculate your expectations with the number of 1925 LEs per hasher.

Correction: the .zip file comes with only a quickly thrown together sdc and no qpf, just vhdl.
-Edit-
I can admit a mistake, you're right, with zero register duplication, no optimization, no resource sharing, etc, the core takes up 1925. But let me be very clear: with optimizations that size goes down significantly, I've shut off all optimization in order to preserve logic that would otherwise be synthesized away. To think that it won't use sharing with all of the XOR, AND, and + repetitions is absurd.
Apparently, however, I need to revamp my numbers, so I'll get something back to you on the core controller soon.
Also, in the design report the multiplexer restructure savings alone are 81 LEs, so 1925 -> 1844 for the current design.

In better news, I have my mining core shrunk from 768 -> 582 LEs as of today.

Quote from: mrb on April 10, 2012, 04:26:46 AM

Interesting. Assuming your design can be ported from the Cyclone IV to the Stratix III, an EP3SL200 at 213MHz with 250 of your 800-LE cores would produce 416 Mhash/s. That would account exactly for the performance of the BitForce Single (rumored to be two EP3SL200 chips = 832 Mhash/s)...

Does it port? Yes. I don't have experience with the Stratix III series, does it use the same architecture (or similar) to that of the Stratix IV? If so the logic count would be decreased (compiling this on my Stratix IV gave me better numbers than the ones I've quoted). I haven't looked into the pricing on that unit but work will continue and if Altera devices become advantageous, then we'll use them.

nedbert9

sr. member

Activity: 252

Merit: 250

Inactive

Good effort, Wondermine.

lame.duck

legendary

Activity: 1270

Merit: 1000

Hm, how did you get the number of Logic cells without having a qpf and qsf file?
Why do you design a control logic with 288 IOs for a chip with only 167 usable IO pins (153 usable on the DE0 (including th pins for RAM etc.).

I tried to test compile the design for a cyclone IV with 30 kLE but fiting failed due to lack of IO pins but the device usage was 88% which makes it not so certain (to me) that you could squeeze 19 hasher in the 22 k device.

While checking the numbers, i could verify that a hasher stage 'miningcore' would use 1157 LEs but this number excludes the sha256core submodule which seems a very important part to me Wink

. So you should recalculate your expectations with the number of 1925 LEs per hasher.

mrb

legendary

Activity: 1512

Merit: 1028

Interesting. Assuming your design can be ported from the Cyclone IV to the Stratix III, an EP3SL200 at 213MHz with 250 of your 800-LE cores would produce 416 Mhash/s. That would account exactly for the performance of the BitForce Single (rumored to be two EP3SL200 chips = 832 Mhash/s)...

wondermine

newbie

Activity: 59

Merit: 0

I've got a preliminary Nanominer bitstream. I'd like to be clear about whether it works: the digester is functional and has been tested with fpgaminer's code. The rest of it looks fine in simulation, but the final product's *control circuitry* (i.e. state machines) may need fixing. That said, the digester, which is the heart of it all *definitely works* at the size and performance I'm quoting here. To prove this to everyone, I'm linking a .zip with the design files so you can see what I've been working on. I only guarentee the digester (working_sha256.vhd) works 100% but so far the rest looks good.
Also that digester is sitting in for a better, more pipelined one that is smaller and performs better, but is still in the works. I thought I'd let you all see something that actually can produce bitcoin. The *current* specs are as follows:

Note:
-Compiled with Web Edition, I need to go to school and put this through the subscription one
-The fmax *varies* with the chip
-I do not know if this is analogous to Xilinx logic consumption; if it is, things are going well for all of us.
-The code is run preserving nodes with lost fanout because I don't have decent input and Quartus wants to synthesize away the duplicate cores (which are actually working in parallel
-I'm posting this for your interest, peace of mind, and maybe to whet your appetite, this is not to be scrutinized; it's a work in progress. Constructive ideas, go for it, but picking the hell out of my design really won't do much good
-The new core is very promising, I expect a ~15% increase in performance in the next week

So:
Control Circuit Logic Consumption: 289 LC Registers (One per chip)
Core Logic Consumption: 1844 LC Registers (Iterative, as many as you like)
Cycles per Hash: 128
fmax @ Speed Grade 6: 201.73 MHz (Cyclone IV)
Hashrate: 1.58MH/core
--edited to fix size--

So, it's not completely groundbreaking, yet, but there's a lot more where this came from. This little announcement is more to say that I'm working, and this thing is coming. I have my core less than 800 LEs (which would mean a DE0 hashrate of >40MH/s, and a significant improvement past 210MH/s on a Spartan-6), but I need to get timing logistics down, so more to come.

In the meantime, I'll post the VHDL for you all. As always, donations are welcome, I do spend a hell of a lot of time on this and the way things are looking I'll break more records than just the DE0-Nano speed record. I haven't broken the 210MH barrier yet, but soon enough, I just need to put a little more time into it.

Edit: By the way, it's not commented, and it's got an SDC but no QPF or anything, just straight VHDL. All rights reserved me etc.

cheers!

tgmarks

donator

Activity: 490

Merit: 500

Can't wait to see news of a functioning prototype. Love the expandable modular nature with additional boards.

JWU42

legendary

Activity: 1666

Merit: 1000

Liking what I am seeing - wish there was MOAR hashes though...

rjk

sr. member

Activity: 448

Merit: 250

1ngldh

Quote from: norulezapply on March 30, 2012, 09:44:43 AM

Quote from: rjk on March 30, 2012, 07:50:44 AM

Quote from: norulezapply on March 30, 2012, 04:10:04 AM

Watching...

PS I'd rather have a "as-cheap-and-barebones-as-possible" FPGA.

No fancy screens or built in ethernet or cgminer. Just a bare FPGA board that I can connect with USB.

Ztex?

Nanominer seems like it's going to be a cheaper option than ztex is unless I'm looking at it wrong

Quite so. But if you want fewer features at a higher price, Ztex is the way to go. Grin

norulezapply

hero member

Activity: 481

Merit: 502

Quote from: rjk on March 30, 2012, 07:50:44 AM

Quote from: norulezapply on March 30, 2012, 04:10:04 AM

Watching...

PS I'd rather have a "as-cheap-and-barebones-as-possible" FPGA.

No fancy screens or built in ethernet or cgminer. Just a bare FPGA board that I can connect with USB.

Ztex?

Nanominer seems like it's going to be a cheaper option than ztex is unless I'm looking at it wrong

rjk

sr. member

Activity: 448

Merit: 250

1ngldh

Quote from: norulezapply on March 30, 2012, 04:10:04 AM

Watching...

PS I'd rather have a "as-cheap-and-barebones-as-possible" FPGA.

No fancy screens or built in ethernet or cgminer. Just a bare FPGA board that I can connect with USB.

Ztex?

norulezapply

hero member

Activity: 481

Merit: 502

Watching...

PS I'd rather have a "as-cheap-and-barebones-as-possible" FPGA.

No fancy screens or built in ethernet or cgminer. Just a bare FPGA board that I can connect with USB.

Topic: Nanominer Announcement (Read 11707 times)