Pages:
Author

Topic: DIY FPGA Mining rig for any algorithm with fast ROI - page 70. (Read 99494 times)

hero member
Activity: 1118
Merit: 541
Could someone familiar with any of those technologies make another post with relevant calculations?

Each 9P has a number of 32.75Gb transceiver GTIO. Up to 120 on the A2577 9P package. Though, the 5P B2104 would probably be more ideal for this application connecting 1 5P to 1 HMC x4....

Each HMC v1 x4 link allows (up to) 15Gb/s per pin with 64 pins being used. (480Gb/s (60GB/s) full duplex) -- There is support for x8 link in HMC v1 spec, not sure if any memory was made for x8 links though.
Each HMC v2 x4 link allows (up to) 30Gb/s per pin with 64 pins being used. (960Gb/s (120GB/s) full duplex)

The other nice thing about HMC is that their latency is closer to DDR4 than GDDR5 -- When optimally using the HMC your latency can be as low as 1ns, looking at 20ns worst case. Using a bus width of 128 bytes (cn7) you can achieve 90% bus efficiency. With the logic layer some interesting things can be done. I doubt you'll get anywhere near 90% with the 1024-bit bus width on HBM2.

Either way, not bad for $500 proto sample part (but not amazing either).


Edit:

Btw, did some reading on algebraic logic minimization last night along with a couple other techniques. This is already done, automatically, during synth (but can be turned off). Seeing the process, yes, it's something that could be added to simplify logic circuits. HOWEVER, Vivado already does it! Starting to question OP and if this bittware account is even really bittware. I might have to put my foot in my mouth in 18 days, but the more I look at it, the more I'm thinking it's not possible. Elaborate scam?
member
Activity: 144
Merit: 10

Do you really think that they have in that box a single chip dissipating 1kW with 2 USB, 1 Ethernet and 1 HDMI port?
 

To be fair, I've viewed that website in the past and I did not see their ASIC chip specifications. At least to me, their specifications were for the unit as a whole.
legendary
Activity: 2128
Merit: 1073
I get really skeptical at claims like “520MH ETH”.  Where is the 4TB/s of memory bandwidth coming from?

32 GB of installed memory = 130 GBytes/second per GB. If you go with the highest currently available pin rate (GDDR6), that would be 18Gbps per pin IO. 130GB/s -> 1 Tbit/s, so you need ~ 57 pins per GB of memory. Let’s account for overhead and say 64. That means two chips per GB, or 4Gb 32pin Chips. 64 chips, and over 2048 I/O lines. Not impossible, but definitely a lot of chips/expense.
I think you made a mistake of taking their marketing materials literally. Their marketing people wrote seriously confused and non-physical stuff like "approx 900W to 1kW per hour." It also confuses single ASIC chip parameters with the overall system parameters.

Do you really think that they have in that box a single chip dissipating 1kW with 2 USB, 1 Ethernet and 1 HDMI port?

To me it is quite obvious that they must be proposing a system containing many much smaller chips. Why this couldn't be a system with 64 chips with 0.5GB of eDRAM in each chip? Reconfigurable cryptographic processors are now being researched and produced for about 20 years.

I wouldn't rush to judgment just based on a single marketing blurb. Have you ever played "deaf telephone" as a kid? This is a standard game being played in the meetings between R&D and marketing.

Edit:

Before I succeeded at posting the above I received a warning there was a new post from somebody directed at GPUhoarder saying something to the effect "redo the calculations with HMC or HBM". I think that post had disappeared due to forum bug or possibly self-censorship.

Could someone familiar with any of those technologies make another post with relevant calculations?

Thanks.

https://en.wikipedia.org/wiki/Hybrid_Memory_Cube
https://en.wikipedia.org/wiki/High_Bandwidth_Memory
member
Activity: 154
Merit: 37

I get really skeptical at claims like “520MH ETH”.  Where is the 4TB/s of memory bandwidth coming from?

32 GB of installed memory = 130 GBytes/second per GB. If you go with the highest currently available pin rate (GDDR6), that would be 18Gbps per pin IO. 130GB/s -> 1 Tbit/s, so you need ~ 57 pins per GB of memory. Let’s account for overhead and say 64. That means two chips per GB, or 4Gb 32pin Chips. 64 chips, and over 2048 I/O lines. Not impossible, but definitely a lot of chips/expense.




hero member
Activity: 1118
Merit: 541
Lastly, I am not seeing anything on Amazons ToS that forbids mining, senseless says he does (or has done) it yet the o.p. and others have said Amazon will ban you for it.  I'll make that bet and report back if I can ever get or build a working AFI (bitstream).

As long as you don't fire up a bunch of cpu miners on shared cpu resources, you're fine. Some nodes (including F1) have dedicated resources you can mine with. I did ask before i started cpu mining on the F1 nodes. No problem, it's dedicated. CPU mining on my fpga nodes gave an extra few K/mo.

I highly recommend Altera/Intel Quartus over Vivado for “learning”, despite using Xilinx for all my “professional” projects.

I second this, quartus was always more intuitive for me. Not a fan of vivado at all. Plus quartus has that placement engine that allows you to have your CPU sit there trying various placement combinations until you get the best. Best you get out a vivado without third party is a couple passes like the OP said.



member
Activity: 154
Merit: 37
Thank you for the tip on the $99 fpga training kit.  I was considering something like this myself and this is actually very useful.  Can you recommend any sources of information on how mining is be achieved on an fpga?
Start with something simple like SHA256D used in Bitcoin:

https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner

Or something from the official Xilinx marketing publication:

http://issuu.com/xcelljournal/docs/xcell_journal_issue_84/16

Have fun!

Edit: I forgot to include standard explanation about the difference between CPU programming and FPGA programming.

CPUs have von Neumann architecture with linear memory, e.g. from 0 to 4294967296. Compiling and running your first 10 line program will take seconds on your typical usable computer with 4GB of RAM.

FPGA have a completely generic two dimensional architecture that literally has kajilion of constraints. Most of those constraints are secret to the FPGA vendor. The XCVU9P FPGA discussed in this thread has about 2104 pads, just describing the pads that are actually connected with useful signals on the VCU1525 board is a file that has 25 pages in the manual. Those are the constraints that are visible and not secret.

When compiling your first 10 line FPGA program the Xilinx toolchain will still have to process those constraints even if you use less than 1% of 1% of the whole VCU9P chip. On the other hand the small XC7A35T chip on the training board has much, much less internal constraints that needs to be read by the toolchain.

You will observe that your small 10 line FPGA program will be completely compiled and done for a small device like like 7A35T while for the large device the Xilinx toolchain is still decrypting the secret VU9P constraints file. Amazon recommends that you run their FPGA development kit for VCU9P on a computer with 32GB of RAM to get sensible performance for nontrivial programs.

Just be aware of the above.

Edit2: Trying to learn logic design on a high-end device may seriously test your patience. Start with a low-end device.

Edit3: Grammar fixes.


I highly recommend Altera/Intel Quartus over Vivado for “learning”, despite using Xilinx for all my “professional” projects. Quartus requires less patience and is easier to fool around with. Get a Max10 dev kit, super easy way to test a lot of basic designs and you can fit one pipeline for most small things.

SHA256 is actually imho harder than some of the SHA3 competition candidates. Keccak is actually pretty great as a first run - it can take about 2500 luts to implement a none-pipelines single round version. Getting enough UART Comm running to meaningfully get bits to hash on/off the chip takes more effort.

I validated my Keccak RTL on a Max10 using Quartus simply because it could be done so easily and relatively quickly. Adapting and filling out all the pipelines on the VCU118 (Ultrascale+ Virtex VUP9) takes hours to synthesize route and place for every minor change on a very wel equipped development box. The Artix 7 200T version takes about 35 minutes a shot on my MacBook Pro (in a VM). Patience is the name of the game - or set web dev and app skills aside and actually spend some serious time planning and doing pencil + paper design validation and review, then use logic simulation and ultimately timing simulation before even putting anything on a chip.

Right now I'm reading a couple of books, watching youtube video's and using the information given here.  For others reading 'DIY FPGA' in the title and expecting to do it themselves.  Intel are giving away 'fpga for dummies' if you sign up for their spam at https://plan.seek.intel.com/PSG_WW_NC_LPCD_FR_2018_FPGAforDummiesbook.

Was hoping to use the AWS F1 instance with their fpga dev ami that has Xilinx SDx 2017.4, Free license for F1 FPGA development, but it sounds like that will be jumping in at the deep end.

I saw a Keccak fpga project for Nexys 4 Artix-7 boards, it says that it uses an external frequency generator at 100Mhz and has a performance of 100Mh/s.
 https://github.com/0x2fed/FPGA-Keccak-Miner

Lastly, I am not seeing anything on Amazons ToS that forbids mining, senseless says he does (or has done) it yet the o.p. and others have said Amazon will ban you for it.  I'll make that bet and report back if I can ever get or build a working AFI (bitstream).

I took a look at the 0x2fed project for fun and to compare to mine. There are some errors in it that can catch you up on a different chip. There are also some bugs in the XDC temp monitoring interface that’ll fail on newer IP. I did run it at 200Mhz successfully on an A100T-2, but it got pretty hot without active cooling. The Comm mechanism seemed odd to me but I think it was based on the old open FPGA bitcoin miner.

Keep in mind the bitstream is only half the battle - you also need a host program with serial comm and all the midstate, etc. calculations + stratum connection. That code doesn’t document it, but if you want to play with it it expects 32 bit nonce follow by the second 32 bit of the target (upper word assumed 0), followed by 76 byte midstate. Have fun!

jr. member
Activity: 33
Merit: 1
Thank you for the tip on the $99 fpga training kit.  I was considering something like this myself and this is actually very useful.  Can you recommend any sources of information on how mining is be achieved on an fpga?
Start with something simple like SHA256D used in Bitcoin:

https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner

Or something from the official Xilinx marketing publication:

http://issuu.com/xcelljournal/docs/xcell_journal_issue_84/16

Have fun!

Edit: I forgot to include standard explanation about the difference between CPU programming and FPGA programming.

CPUs have von Neumann architecture with linear memory, e.g. from 0 to 4294967296. Compiling and running your first 10 line program will take seconds on your typical usable computer with 4GB of RAM.

FPGA have a completely generic two dimensional architecture that literally has kajilion of constraints. Most of those constraints are secret to the FPGA vendor. The XCVU9P FPGA discussed in this thread has about 2104 pads, just describing the pads that are actually connected with useful signals on the VCU1525 board is a file that has 25 pages in the manual. Those are the constraints that are visible and not secret.

When compiling your first 10 line FPGA program the Xilinx toolchain will still have to process those constraints even if you use less than 1% of 1% of the whole VCU9P chip. On the other hand the small XC7A35T chip on the training board has much, much less internal constraints that needs to be read by the toolchain.

You will observe that your small 10 line FPGA program will be completely compiled and done for a small device like like 7A35T while for the large device the Xilinx toolchain is still decrypting the secret VU9P constraints file. Amazon recommends that you run their FPGA development kit for VCU9P on a computer with 32GB of RAM to get sensible performance for nontrivial programs.

Just be aware of the above.

Edit2: Trying to learn logic design on a high-end device may seriously test your patience. Start with a low-end device.

Edit3: Grammar fixes.


I highly recommend Altera/Intel Quartus over Vivado for “learning”, despite using Xilinx for all my “professional” projects. Quartus requires less patience and is easier to fool around with. Get a Max10 dev kit, super easy way to test a lot of basic designs and you can fit one pipeline for most small things.

SHA256 is actually imho harder than some of the SHA3 competition candidates. Keccak is actually pretty great as a first run - it can take about 2500 luts to implement a none-pipelines single round version. Getting enough UART Comm running to meaningfully get bits to hash on/off the chip takes more effort.

I validated my Keccak RTL on a Max10 using Quartus simply because it could be done so easily and relatively quickly. Adapting and filling out all the pipelines on the VCU118 (Ultrascale+ Virtex VUP9) takes hours to synthesize route and place for every minor change on a very wel equipped development box. The Artix 7 200T version takes about 35 minutes a shot on my MacBook Pro (in a VM). Patience is the name of the game - or set web dev and app skills aside and actually spend some serious time planning and doing pencil + paper design validation and review, then use logic simulation and ultimately timing simulation before even putting anything on a chip.

Right now I'm reading a couple of books, watching youtube video's and using the information given here.  For others reading 'DIY FPGA' in the title and expecting to do it themselves.  Intel are giving away 'fpga for dummies' if you sign up for their spam at https://plan.seek.intel.com/PSG_WW_NC_LPCD_FR_2018_FPGAforDummiesbook.

Was hoping to use the AWS F1 instance with their fpga dev ami that has Xilinx SDx 2017.4, Free license for F1 FPGA development, but it sounds like that will be jumping in at the deep end.

I saw a Keccak fpga project for Nexys 4 Artix-7 boards, it says that it uses an external frequency generator at 100Mhz and has a performance of 100Mh/s.
 https://github.com/0x2fed/FPGA-Keccak-Miner

Lastly, I am not seeing anything on Amazons ToS that forbids mining, senseless says he does (or has done) it yet the o.p. and others have said Amazon will ban you for it.  I'll make that bet and report back if I can ever get or build a working AFI (bitstream).
newbie
Activity: 21
Merit: 0
well so much for distributed wealth. Once this happens mainstream, it'll be a top 3% scenario and everyone else fighting for crumbs.
member
Activity: 154
Merit: 37
Thank you for the tip on the $99 fpga training kit.  I was considering something like this myself and this is actually very useful.  Can you recommend any sources of information on how mining is be achieved on an fpga?
Start with something simple like SHA256D used in Bitcoin:

https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner

Or something from the official Xilinx marketing publication:

http://issuu.com/xcelljournal/docs/xcell_journal_issue_84/16

Have fun!

Edit: I forgot to include standard explanation about the difference between CPU programming and FPGA programming.

CPUs have von Neumann architecture with linear memory, e.g. from 0 to 4294967296. Compiling and running your first 10 line program will take seconds on your typical usable computer with 4GB of RAM.

FPGA have a completely generic two dimensional architecture that literally has kajilion of constraints. Most of those constraints are secret to the FPGA vendor. The XCVU9P FPGA discussed in this thread has about 2104 pads, just describing the pads that are actually connected with useful signals on the VCU1525 board is a file that has 25 pages in the manual. Those are the constraints that are visible and not secret.

When compiling your first 10 line FPGA program the Xilinx toolchain will still have to process those constraints even if you use less than 1% of 1% of the whole VCU9P chip. On the other hand the small XC7A35T chip on the training board has much, much less internal constraints that needs to be read by the toolchain.

You will observe that your small 10 line FPGA program will be completely compiled and done for a small device like like 7A35T while for the large device the Xilinx toolchain is still decrypting the secret VU9P constraints file. Amazon recommends that you run their FPGA development kit for VCU9P on a computer with 32GB of RAM to get sensible performance for nontrivial programs.

Just be aware of the above.

Edit2: Trying to learn logic design on a high-end device may seriously test your patience. Start with a low-end device.

Edit3: Grammar fixes.


I highly recommend Altera/Intel Quartus over Vivado for “learning”, despite using Xilinx for all my “professional” projects. Quartus requires less patience and is easier to fool around with. Get a Max10 dev kit, super easy way to test a lot of basic designs and you can fit one pipeline for most small things.

SHA256 is actually imho harder than some of the SHA3 competition candidates. Keccak is actually pretty great as a first run - it can take about 2500 luts to implement a none-pipelines single round version. Getting enough UART Comm running to meaningfully get bits to hash on/off the chip takes more effort.

I validated my Keccak RTL on a Max10 using Quartus simply because it could be done so easily and relatively quickly. Adapting and filling out all the pipelines on the VCU118 (Ultrascale+ Virtex VUP9) takes hours to synthesize route and place for every minor change on a very wel equipped development box. The Artix 7 200T version takes about 35 minutes a shot on my MacBook Pro (in a VM). Patience is the name of the game - or set web dev and app skills aside and actually spend some serious time planning and doing pencil + paper design validation and review, then use logic simulation and ultimately timing simulation before even putting anything on a chip.
jr. member
Activity: 322
Merit: 1
i discovered bitcore yesterday. it seems to be a pure gpu coin. therefore my question.

it is possible to mine bitcore (btx) coins with this fpga miner? algo is Timetravel10.

Yes, it's basically lyra2rev2 with only a single round of cubehash and no memory. I'd guess maybe 900mh/s-1.2gh/s.

Edit: No, sorry, it's nist5 + bmw, luffa and cube. And a randomized order to the hashes. Ya, maybe 900Mh/s with some intelligent buffering. It also depends on how long the chain is, and I'm not quite sure I understand that.


thanks for your answer. this fact makes the fpga miner even more interesting.

would it there not the high prize. hard decision.
newbie
Activity: 115
Merit: 0
So are the Bitstreams and other pieces still being built with the originally discussed 4% dev fee or not?  I'm interesting in the project and 4% certainly seems reasonable.  Most developers change 1 - 2% dev fee, but I can see work on the FPGA as requiring a bit more specialization.

I'd be interested in running an 8 - FPGA rig.  Ideally with a riserless configuration.  Would that still require USB connections or could everything communicate via PCIe?

Thanx!
jr. member
Activity: 33
Merit: 1
Thank you for the tip on the $99 fpga training kit.  I was considering something like this myself and this is actually very useful.  Can you recommend any sources of information on how mining is be achieved on an fpga?
Start with something simple like SHA256D used in Bitcoin:

https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner

Or something from the official Xilinx marketing publication:

http://issuu.com/xcelljournal/docs/xcell_journal_issue_84/16

Have fun!

Edit: I forgot to include standard explanation about the difference between CPU programming and FPGA programming.

CPUs have von Neumann architecture with linear memory, e.g. from 0 to 4294967296. Compiling and running your first 10 line program will take seconds on your typical usable computer with 4GB of RAM.

FPGA have a completely generic two dimensional architecture that literally has kajilion of constraints. Most of those constraints are secret to the FPGA vendor. The XCVU9P FPGA discussed in this thread has about 2104 pads, just describing the pads that are actually connected with useful signals on the VCU1525 board is a file that has 25 pages in the manual. Those are the constraints that are visible and not secret.

When compiling your first 10 line FPGA program the Xilinx toolchain will still have to process those constraints even if you use less than 1% of 1% of the whole VCU9P chip. On the other hand the small XC7A35T chip on the training board has much, much less internal constraints that needs to be read by the toolchain.

You will observe that your small 10 line FPGA program will be completely compiled and done for a small device like like 7A35T while for the large device the Xilinx toolchain is still decrypting the secret VU9P constraints file. Amazon recommends that you run their FPGA development kit for VCU9P on a computer with 32GB of RAM to get sensible performance for nontrivial programs.

Just be aware of the above.

Edit2: Trying to learn logic design on a high-end device may seriously test your patience. Start with a low-end device.

Edit3: Grammar fixes.


Excellent information, thank you.  I was just about to give in and get off your lawn and you go and hand me a fist full of cookies.
sr. member
Activity: 362
Merit: 250
They have already developed it long time ago but the biggest problem thay are facing is how monetize it. I read somewhere that with FPGA it will be really easy to bypass claymore style fee. It is really easy to copy the design, something like that.

It is easy to copy only the unprotected design. The protected one will not work even after copying of everything.

Yes, and the effort to protect the design takes time. If I had to guess, they are using encrypted bitstreams that include the pool  C software running on a Microblaze softcore processor.
jr. member
Activity: 59
Merit: 1
They have already developed it long time ago but the biggest problem thay are facing is how monetize it. I read somewhere that with FPGA it will be really easy to bypass claymore style fee. It is really easy to copy the design, something like that.

It is easy to copy only the unprotected design. The protected one will not work even after copying of everything.
jr. member
Activity: 252
Merit: 8
The way I read this thread is that there are some credible devs looking for a business model that makes their work sustainable and profitable, more profitable than simply mining with their own secret sauce on their own expensive hardware.

There is the 'mining with dev fee' model and the hardware provider model. Both models require popular acceptance and adoption of fpga mining to work. Early adopters take a lot of risk because of the high cost of getting started with fpga mining equipment and missing proof of acceptable ROI.

pollo e huevos ?

It'd be sweet if we had a coin whose proof of work algorithm was fpga and gpu friendly but asic resistant !








They have already developed it long time ago but the biggest problem thay are facing is how monetize it. I read somewhere that with FPGA it will be really easy to bypass claymore style fee. It is really easy to copy the design, something like that.
jr. member
Activity: 59
Merit: 1
The way I read this thread is that there are some credible devs looking for a business model that makes their work sustainable and profitable, more profitable than simply mining with their own secret sauce on their own expensive hardware.

There is the 'mining with dev fee' model and the hardware provider model. Both models require popular acceptance and adoption of fpga mining to work. Early adopters take a lot of risk because of the high cost of getting started with fpga mining equipment and missing proof of acceptable ROI.

pollo e huevos ?

It'd be sweet if we had a coin whose proof of work algorithm was fpga and gpu friendly but asic resistant !


I have a model of the dev fee. It is prepaid, nodelocked, timelimited and unavoidable, it works as online or offline activation key.
Although I am skeptical about the wide distribution of this solution without a sufficiently large number of algorithms for all of you.

The "Asic resistant" algorithm probably should involve
-  the Terabyte disk as a primary storage of some "DAG" file,,
-  the PCI Express NVM drive as a secondary storage of some "DAG" file (like cache),
-  the full bandwith of PCI Express x8/x16 slot,
-  the PCI Express GPU or FPGA.

This setup may be also usable to perform some real life tasks.
legendary
Activity: 4354
Merit: 3614
what is this "brake pedal" you speak of?

[...]

For the cryptocurrency mining addicts the road to recovery may be through actually investing $99 in an FPGA training kit like Digilent Arty 7 and actually losing their logic design virginity.


[...]

Thank you for the tip on the $99 fpga training kit.  I was considering something like this myself and this is actually very useful.  Can you recommend any sources of information on how mining can be achieved on an fpga?

im interested in that kit too, i had no idea training level fpga kits were that cheap. thanks.

amazon has that kit and has some beginner level books listed in the "frequently bought together" section. i figure it will take me about a year to make one of the onboard leds blink. then its world domination shorty after that.
newbie
Activity: 1
Merit: 0
How do you think they design ASIC's?  With FPGA's of course. FPGA's never left the scene they were used mostly to R&D for ASIC's in terms of mining.

http://www.dinigroup.com/web/index.php

Pages:
Jump to: