Official Open Source FPGA Bitcoin Miner (Last Update: April 14th, 2013) - page 15.

2112

legendary

Activity: 2128

Merit: 1074

Quote from: kingcoin on March 31, 2013, 10:34:30 AM

I would imagine getting the fabric to run at 500MHz in a Kintex-7 device is also a challenge. Running the design as-is through Vivado with a 325 speed grade -2 target does not meet timing closure at 250MHz. But of course with careful placement and constraints it might be possible.

I just wanted to point one thing: trying to achieve a timing closure is a blind alley. What you should really aim is power optimization. To my knowledge none of the popular toolchains has such a goal available.

With the unrolled design the fanout of some registers is high enough to trigger combinatorial logic duplication when searching for the closure. I haven't tried Vivado, but ISE was even doing the register duplication. This is exactly what you don't want to do when doing an FPGA design that has to compete with an ASIC design. In the absence of pure power optimization your next-best goal is try to optimize for the area.

I guess working with the two unrolled copies of SHA-256 produces such a wild mess of ~~trees~~ primitives that it is possible to lose ones bearing in the jungle of ~~vines~~ signals.

senseless

hero member

Activity: 1118

Merit: 541

Quote from: kingcoin on March 31, 2013, 03:23:01 PM

According to Alterar NRE for 90nm was in the range of $240K to $345K, which is fairly low. http://www.altera.com/products/devices/hardcopy-asics/about/migration/hrd-migration.html

The pricing isn't bad at all; but the only open source designs available to be taken for conversion do not have a very good multi-core base design. Could easily take this design into an avalon style 1 chip per core; but seems like an awful waste of PCB space. The 20$/Ghash pricing before was with a single chip operating at 250mhz with 10 cores on it. It would be 28 chips to reach 68Gh/s as opposed to avalon's 240 chips to reach that speed. The pricing I got is still a little high. It won't be effective (competition price match) until it hits like 10$/Ghash at which point people could build their own units for less than the cost of avalon's, bfls, etc.

It should be possible to get a miner @ 800$ cost with 70Gh/s @ 200-400W (28nm-45nm).

Maybe some sort of non-profit coop to collect funds to get the initial design conversion, mask printing and chips made? Could then just sell chips on as needed basis close to cost.

kingcoin

sr. member

Activity: 262

Merit: 250

According to Alterar NRE for 90nm was in the range of $240K to $345K, which is fairly low. http://www.altera.com/products/devices/hardcopy-asics/about/migration/hrd-migration.html

senseless

hero member

Activity: 1118

Merit: 541

Quote from: kingcoin on March 31, 2013, 10:58:29 AM

Quote from: senseless on March 31, 2013, 06:33:53 AM

This would also be nice as new technologies come along (14nm altera, etc) we can move to hardcopies of those devices on the same design which should meet performance/power wise to BFL's target estimates for their asic design (or better).

...

I don't have the hardcopy prices from altera or xilinix directly. But, based on my investigation of some other fabless companies it should be somewhere around 300K chips @ 45nm and somewhere around 1M chips for 28nm to get competitive pricing. The pricing I was getting at 45NM was around 20$/GHash @ 300K units (2.5-3Gh/s per unit). Keep in mind these were from fabless companies they were just reselling someone else's services but did their own in house design conversion. The fabless companies are obviously going to be a bit higher on per unit and nre as thats where they get their cash from as opposed to going direct with altera or xilinix's structured asic processes with no middle man.

kingcoin

sr. member

Activity: 262

Merit: 250

Quote from: senseless on March 31, 2013, 06:33:53 AM

This would also be nice as new technologies come along (14nm altera, etc) we can move to hardcopies of those devices on the same design which should meet performance/power wise to BFL's target estimates for their asic design (or better).

How many e.g. Stratix devices do you have to make before the unit const including the NRE is lower for Hardcopy?

kingcoin

sr. member

Activity: 262

Merit: 250

Quote from: fpgaminer on March 30, 2013, 07:27:29 PM

Quote

But the Spartan6 fpga fabric does not run at 500HMz.

Sorry, I was talking about Kintex 7, which most certainly can. Kintex 7 has similar performance to the Virtex 6, at less cost.

I would imagine getting the fabric to run at 500MHz in a Kintex-7 device is also a challenge. Running the design as-is through Vivado with a 325 speed grade -2 target does not meet timing closure at 250MHz. But of course with careful placement and constraints it might be possible.

Ersch

newbie

Activity: 5

Merit: 0

Quote from: Ersch on March 30, 2013, 01:14:27 PM

What is the best way to run multiple devices (25 trought USB)? Is it possible to run one miner per card and change the adress maybe? (Like use USB1, USB2....)

An idea could be to run the miner into a vm and map the usb/machine but it is heavy Huh

senseless

hero member

Activity: 1118

Merit: 541

I had this idea for a bitcoin fpga design. I'm not an fpga designer/coder, but since you guys are talking about new designs I thought I would throw this in there. The code would be split into 2 segments on different clocks/plls. The first pll would be a master controller on chip that interfaces with the mining software. It will receive nonces from the code and store them in memory (as opposed to sending the nonces direct to the miners). The miners would run on their own pll (separate clock), would read nonces from memory and write any golden nonces back to a different memory segment. Each miner would provision their own pool in memory to hold N nonces and N golden nonces.

Rough flow chart:

Master controller:
Software sends nonce to master controller -> on-chip master controller saves nonce to memory under a hashing core -> on-chip master controller looks for golden nonce in separate memory area -> golden nonce send back to software for reporting to network/pool

Hasher Cores:
Hashing core reads new nonce from it's memory segment -> hashing core performs hashes on this nonce range (flipping nonce in memory) -> if golden nonce is found write to a different memory segment

Software Signals:

Reqs (Requests nonce from software)
Rest (overwrites nonces from memory on-chip (Num Core * nonce pool size per core = number of nonces to request, then overwrite in memory)
Nonc (sending nonce from software to chip)
Stat (Requesting stats on chip processing speed)
Gnon (sending golden nonce upstream to software)

Thoughts:

Nonces can be flipped in memory and then pulled to start the next hash so the nonce range. When it detects the last nonce start working on the next nonce in the memory pool. For instance, provision room in memory for 3 nonces per core once the 4 billion results of nonce 0 are completed it would start working on nonce 1. Meanwhile, master controller would see in memory that the nonce 0 is finished (completely calculated 4 billion flips) and overwrite that memory segment with a fresh nonce. The reset signal would only need to overwrite every existing memory segment with new nonces, it does not need to reset the cores or make any changes as nonces are flipped in memory.

..

The reason I came up with this sort of idea for a design is; After playing with the code the worst case slack seems to be when it reports a golden nonce up stream. Hence why you can seriously overclock the design over fmax and it works fine, other than reporting bad results upstream to the software. I'm able to push my clock rate almost up to 275mhz without the compile failing completely(with edge/corner timing errors). Using this method of allowing a master controller to be on its own separate clock/pll than the hasher cores themselves it would allow the fmax of the hashing cores to sky rocket while you can set the controller at a more conservative level for software communications.

.... Hell my chip has 8 PLLs, could probably put every core on its own PLL so a slow down in one hashing core does not affect the others. (Which would probably be ideal, every "core" would have its own fmax and timing)

..

It would be nice if we could come up with a fully functioning/optimized unrolled multi-core design so anyone could take said design and produce a top level structured asic design (print their own chips). Just make sure to release under a license which requires all modifications to be reported. It's not really THAT expensive to get your own structured asic produced from design, takes awhile to complete but < 100K should be fine at > 90nm. This would also be nice as new technologies come along (14nm altera, etc) we can move to hardcopies of those devices on the same design which should meet performance/power wise to BFL's target estimates for their asic design (or better).

Ersch

newbie

Activity: 5

Merit: 0

What is the best way to run multiple devices (25 trought USB)? Is it possible to run one miner per card and change the adress maybe? (Like use USB1, USB2....)

fpgaminer

hero member

Activity: 560

Merit: 517

Quote

But the Spartan6 fpga fabric does not run at 500HMz.

Sorry, I was talking about Kintex 7, which most certainly can. Kintex 7 has similar performance to the Virtex 6, at less cost.

Quote

What is the best way to run multiple devices (25 trought USB)?

Your best bet would be to modify the code to use a serial interface that could be chained, instead of the altsource_probe. You could do it with the current altsource_probe, but you'd need to modify this code to find multiple devices. Also, last I checked, Quartus had issues handling more than one device plugged into the same USB controller.

kingcoin

sr. member

Activity: 262

Merit: 250

Quote from: fpgaminer on March 30, 2013, 05:57:26 AM

Quote

On a single core?

Yes. DSP48E1's can run up to 500MHz

But the Spartan6 fpga fabric does not run at 500HMz. Maybe some clever interleaving might make it possible to run the fabric interface at a lower clock.

fpgaminer

hero member

Activity: 560

Merit: 517

Quote

On a single core?

Yes. DSP48E1's can run up to 500MHz, so if you unroll all the calculations, replace all the additions with DSP48E1's, register everything, and throw in misc. logic for the non-linear calculations you can get about 500MH/s, depending on the speed grade. It all fits into a Kintex-7 160 (the 160 has a higher DSP48 density than the 325). Probably some room left over for a normal hashing core, though I'm not sure. The DSP design requires quite a lot of registers.

EDIT: And yes, I implemented the design, so it's feasible. It was never fully debugged though, because at the time the Xilinx simulator couldn't handle Kintex's DSP48E1's very well.

kingcoin

sr. member

Activity: 262

Merit: 250

Quote from: fpgaminer on March 30, 2013, 04:24:56 AM

It's also possible to implement a miner using only DSP48s and misc. logic, achieving about 500MH/s. I haven't released any code for that yet.

On a single core?

fpgaminer

hero member

Activity: 560

Merit: 517

Quote

Not possible. Max per channel is 256 bits.

Quick note, kramble; the max is actually 511-bits.

Quote

Does the X6500 use JTAG for communication or does it use some more effective protocol?

It uses JTAG. There is an FTDI chip on there that allows bit-banging pins over USB, and so compatible software bit-bangs JTAG to talk to the FPGA. The entire protocol sitting on top of JTAG is described in jtag_comm.v.

Quote

I've trawled this topic and github but am still not sure what the best starting point for a new (Kintex7-325) fpga miner would be.

I would recommend starting with the X6000_ztex_comm4 project. That's the same code that generated the bitstreams on the fpgamining.com website. You'll want to remove the jtag communication related code and replace it with serial communication, or something else. You can then multi-core that and exchange some resources for DSP48s as iidx mentioned.

It's also possible to implement a miner using only DSP48s and misc. logic, achieving about 500MH/s. I haven't released any code for that yet.

iidx

newbie

Activity: 35

Merit: 0

Quote from: kintex_wibble on March 29, 2013, 01:30:26 PM

I've trawled this topic and github but am still not sure what the best starting point for a new (Kintex7-325) fpga miner would be.
Multicore is key and a pointer to a working open source software/fpga combo for a serial interface would be hugely appreciated but any sensible starting point would be fine - I expect to do some work!
I'm poking about with the verilog_xilinx port at the moment.

I started with the verilog_xilinx port back in 2011 to put on a handful of ML605s (V6 240s). I also have some K7 325s and K7 480s at work, but only did build tests for those because I didn't have permanent access to those boards.

I would suggest starting with verilog_xilinx or one of the Ztex ports. I used PCIe for mine, so I modified the interfaces to take in 32 bit words. Unfortunately that means I don't have a starting place for you to use if you are going to try and use the serial port.

However, I would not try to fit more than 3 instances of the fully unrolled verilog_xilinx version into that 325 without changing some of the adders into DSP48s. On the V6 240 I can fit 3 instances if I use most of the DSP48s to replace some of the adders in the design. Sadly, the K7 325 doesn't have that many more adders. I don't think I was successful getting 4 instances of the verilog_xilinx port to fit.

Technically you could actually just make several instances of the entire design... and just use multiple serial ports to talk to it Wink

kramble

sr. member

Activity: 384

Merit: 250

Quote from: Ltcfaucet on March 29, 2013, 03:50:10 PM

What are the 384 bits for when it sends the new nonce and buffer?

Is that wasted space or used for something?

Its the fixed part of the input to the second round of the sha256 transform (I explained it a few posts ago, probably not too well though).

And yes, there is wasted bandwidth in the altsource_probe comms, but it doesn't matter as this has no effect on the hash rate, since it only does a getwork every few seconds compared to a hash rate of millions per second.

I guess you'll need to read up on the bitcoin hashing algorithm, but I'm not really the guy to explain it (no expert me, so I'm not going to make a fool of myself trying).

All the best
Mark

Ltcfaucet

full member

Activity: 126

Merit: 100

Quote from: kramble on March 29, 2013, 04:50:54 AM

Quote from: Ltcfaucet on March 28, 2013, 11:52:06 PM

So what if I double altsource_probe width?

Not possible. Max per channel is 256 bits. Anyway why would you want to? Only 96 bits are relevant [EDIT] (for data, of course midstate uses all 256). Besides which, the X6500 uses a Xilinx device. Altsource_probe is for Altera devices.

Quote from: Ltcfaucet on March 28, 2013, 11:52:06 PM

X6500 blow up?

You should not be using this software with the X6500 as it is for general purpose FPGA development boards and the X6500 is a purpose built bitcon miner (I'm not saying it can't be used on a X6500, but I don't see why you would want to as it already has its own fully opitimized bitstream).

[EDIT] Actually it would appear that the X6500 bitstream is based on the Open Source FPGA Bitcoin Miner, see http://fpgamining.com/documentation/firmware. I really should not comment on things I know nothing about. Roll Eyes

You should be able to get help on the X6500 thread https://bitcointalksearch.org/topic/x6500-custom-fpga-miner-40058

Regards
Mark

What are the 384 bits for when it sends the new nonce and buffer?

Is that wasted space or used for something?

kintex_wibble

newbie

Activity: 8

Merit: 0

I've trawled this topic and github but am still not sure what the best starting point for a new (Kintex7-325) fpga miner would be.
Multicore is key and a pointer to a working open source software/fpga combo for a serial interface would be hugely appreciated but any sensible starting point would be fine - I expect to do some work!
I'm poking about with the verilog_xilinx port at the moment.

kingcoin

sr. member

Activity: 262

Merit: 250

Quote from: Ltcfaucet on March 28, 2013, 11:52:06 PM

So what if I double altsource_probe width?
X6500 blow up?

I don't know anything about the X6500 either. But if the RTL code is the same, then I would assume that you simply shuffle lots of unused data over JTAG.

Does the X6500 use JTAG for communication or does it use some more effective protocol?

I noticed on their web site that they stopped making these board, does anybody know why?

EDIT: Also how much did these boards cost when they were available for sale?

kramble

sr. member

Activity: 384

Merit: 250

Quote from: Ltcfaucet on March 28, 2013, 11:52:06 PM

So what if I double altsource_probe width?

Not possible. Max per channel is 256 bits. Anyway why would you want to? Only 96 bits are relevant [EDIT] (for data, of course midstate uses all 256). Besides which, the X6500 uses a Xilinx device. Altsource_probe is for Altera devices.

Quote from: Ltcfaucet on March 28, 2013, 11:52:06 PM

X6500 blow up?

You should not be using this software with the X6500 as it is for general purpose FPGA development boards and the X6500 is a purpose built bitcon miner (I'm not saying it can't be used on a X6500, but I don't see why you would want to as it already has its own fully opitimized bitstream).

[EDIT] Actually it would appear that the X6500 bitstream is based on the Open Source FPGA Bitcoin Miner, see http://fpgamining.com/documentation/firmware. I really should not comment on things I know nothing about. Roll Eyes

You should be able to get help on the X6500 thread https://bitcointalksearch.org/topic/x6500-custom-fpga-miner-40058

Regards
Mark

Topic: Official Open Source FPGA Bitcoin Miner (Last Update: April 14th, 2013) - page 15. (Read 432972 times)