Pages:
Author

Topic: DIY FPGA Mining rig for any algorithm with fast ROI - page 63. (Read 99472 times)

copper member
Activity: 166
Merit: 84
I had to watch a really boring "entertainment" program and used that time to edit the above file into a working C++ program. Echo was never used. Due to the peculiar permutation order Blake and Bmw are always fixed at position 0 and 1 respectively, only the remaining 8 positions change. So using the terminology from the whitefire990's post above timetravel10 requires only 9 reconfigurable blocks assuming that only Groestl requires a double block.

This is very interesting.  I did some math and there is a very decent chance Bitcore would fit in a VU9P (while for sure it fits in a more expensive VU13P). 
legendary
Activity: 2128
Merit: 1073
I had to watch a really boring "entertainment" program and used that time to edit the above file into a working C++ program. Echo was never used. Due to the peculiar permutation order Blake and Bmw are always fixed at position 0 and 1 respectively, only the remaining 8 positions change. So using the terminology from the whitefire990's post above timetravel10 requires only 9 reconfigurable blocks assuming that only Groestl requires a double block.
Code:
#include     // std::next_permutation
#include
#include

#define HASH_FUNC_BASE_TIMESTAMP 1492973331 // BitCore: Genesis Timestamp
#define HASH_FUNC_COUNT 10 // BitCore: HASH_FUNC_COUNT of 11
#define HASH_FUNC_COUNT_PERMUTATIONS 40320  // BitCore: HASH_FUNC_COUNT!

int main()
{
    // We want to permute algorithms. To get started we
    // initialize an array with a sorted sequence of unique
    // integers where every integer represents its own algorithm.
    uint32_t permutation[HASH_FUNC_COUNT];
    for (uint32_t i=0; i < HASH_FUNC_COUNT; i++) {
        permutation[i]=i;
    }

    // Compute the next permuation
    uint32_t steps = HASH_FUNC_COUNT_PERMUTATIONS;
    for (uint32_t i=0; i < steps; i++) {
        for (uint32_t i=0; i < HASH_FUNC_COUNT; i++) {
    switch(permutation[i]) {
            case 0:
std::cout << "blake ";
            break;
            case 1:
std::cout << "bmw ";
            break;
            case 2:
std::cout << "groestl ";
            break;
            case 3:
std::cout << "skein ";
            break;
            case 4:
std::cout << "jh ";
            break;
            case 5:
std::cout << "keccak ";
            break;
            case 6:
std::cout << "luffa ";
            break;
            case 7:
std::cout << "cubehash ";
            break;
            case 8:
std::cout << "shavite ";
            break;
            case 9:
std::cout << "simd ";
            break;
            case 10:
std::cout << "echo ";
            break;
    }
        }
std::cout << std::endl;
        std::next_permutation(permutation, permutation + HASH_FUNC_COUNT);
    }
    return 0;
}
legendary
Activity: 2128
Merit: 1073
0->10, echo is 10 and the loop breaks on 11.
We must be looking at a different code then.
Code:
#define HASH_FUNC_COUNT 10 // BitCore: HASH_FUNC_COUNT of 11
    uint32_t permutation[HASH_FUNC_COUNT];
    for (uint32_t i=0; i < HASH_FUNC_COUNT; i++) {
        permutation[i]=i;
    }
    // Compute the next permuation
    ...
    for (uint32_t i=0; i < HASH_FUNC_COUNT; i++) {
        switch(permutation[i]) {
            // cases 0 to 9 here
        case 10:
            // Echo is here, but isn't ever executed
        break;
    }
The code (and comments on the marketing websites) shows that initially there were 11! permutations, but right now that was reduced to 10! .
hero member
Activity: 1118
Merit: 541
Timetravel10 fits in a single VU13P.  You partition the FPGA into 16 blocks, and store about 14 partial bitstreams for each block.  Then you do a dynamic partial reconfiguration from DDR4 to build the pipeline at the start of each block based on the current algorithm sequence.  Yielding one hash per clock (i.e. 500MH/s @ 500MHz).  You need 16 blocks because some functions like Groestl and Echo require 2 blocks.  The FPGA can reconfigure itself in 0.25 seconds.  The problem with Timetravel and X16R/X16S is the long time it takes to load the DDR4 bitstream table via USB.  And you lose it if there is a power outage and must reprogram the DDR4 on each FPGA.  This where utilizing the PCI bus would be an advantage.
My naïve reading of the above-linked code is that Echo is coded, but never used. Am I wrong?


0->10, echo is 10 and the loop breaks on 11.


legendary
Activity: 2128
Merit: 1073
Timetravel10 fits in a single VU13P.  You partition the FPGA into 16 blocks, and store about 14 partial bitstreams for each block.  Then you do a dynamic partial reconfiguration from DDR4 to build the pipeline at the start of each block based on the current algorithm sequence.  Yielding one hash per clock (i.e. 500MH/s @ 500MHz).  You need 16 blocks because some functions like Groestl and Echo require 2 blocks.  The FPGA can reconfigure itself in 0.25 seconds.  The problem with Timetravel and X16R/X16S is the long time it takes to load the DDR4 bitstream table via USB.  And you lose it if there is a power outage and must reprogram the DDR4 on each FPGA.  This where utilizing the PCI bus would be an advantage.
My naïve reading of the above-linked code is that Echo is coded, but never used. Am I wrong?

 
member
Activity: 154
Merit: 37
Timetravel10 fits in a single VU13P.  You partition the FPGA into 16 blocks, and store about 14 partial bitstreams for each block.  Then you do a dynamic partial reconfiguration from DDR4 to build the pipeline at the start of each block based on the current algorithm sequence.  Yielding one hash per clock (i.e. 500MH/s @ 500MHz).  You need 16 blocks because some functions like Groestl and Echo require 2 blocks.  The FPGA can reconfigure itself in 0.25 seconds.  The problem with Timetravel and X16R/X16S is the long time it takes to load the DDR4 bitstream table via USB.  And you lose it if there is a power outage and must reprogram the DDR4 on each FPGA.  This where utilizing the PCI bus would be an advantage.



Yep this is the big reason chain hashing doesn’t stop FPGAs - partial reconfiguration, the overhead of which can be nearly entirely latency hidden.
copper member
Activity: 166
Merit: 84
Timetravel10 fits in a single VU13P.  You partition the FPGA into 16 blocks, and store about 14 partial bitstreams for each block.  Then you do a dynamic partial reconfiguration from DDR4 to build the pipeline at the start of each block based on the current algorithm sequence.  Yielding one hash per clock (i.e. 500MH/s @ 500MHz).  You need 16 blocks because some functions like Groestl and Echo require 2 blocks.  The FPGA can reconfigure itself in 0.25 seconds.  The problem with Timetravel and X16R/X16S is the long time it takes to load the DDR4 bitstream table via USB.  And you lose it if there is a power outage and must reprogram the DDR4 on each FPGA.  This where utilizing the PCI bus would be an advantage.

hero member
Activity: 1118
Merit: 541
Ran through timetravel10 today, looks like with 8 fpgas (one dedicated to each algo) you might be able to get up into 1-10Gh/s. Bitcore definitely needs to do something. A small fpga cluster could 51% them pretty easily.
You've made some interesting optimizations.

My naïve reading is:

a) they have 11 algorithms coded
b) only first 10 are used
c) the whole hash is a nesting of always 10 sub-hashes
d) chosen without repetition
e) which gives 10! possibilities
f) the choice of permutation is keyed from the block height
g) not sequentially, but skipping up to 8! permutations

So my naïve implementation (one card dedicated to each sub-hash) would require 10 FPGA cards.

What is your secret ingredient?

Edit: Link to the source code: https://github.com/LIMXTEC/BitCore/blob/master/src/crypto/hashblock.h

I was looking at timetravel not timetravel-10. My bad. Hashrate would be the same, but yes, 10 cards would be required.


legendary
Activity: 2128
Merit: 1073
Ran through timetravel10 today, looks like with 8 fpgas (one dedicated to each algo) you might be able to get up into 1-10Gh/s. Bitcore definitely needs to do something. A small fpga cluster could 51% them pretty easily.
You've made some interesting optimizations.

My naïve reading is:

a) they have 11 algorithms coded
b) only first 10 are used
c) the whole hash is a nesting of always 10 sub-hashes
d) chosen without repetition
e) which gives 10! possibilities
f) the choice of permutation is keyed from the block height
g) not sequentially, but skipping up to 8! permutations

So my naïve implementation (one card dedicated to each sub-hash) would require 10 FPGA cards.

What is your secret ingredient?

Edit: Link to the source code: https://github.com/LIMXTEC/BitCore/blob/master/src/crypto/hashblock.h
hero member
Activity: 1118
Merit: 541

Ran through timetravel10 today, looks like with 8 fpgas (one dedicated to each algo) you might be able to get up into 1-10Gh/s. Bitcore definitely needs to do something. A small fpga cluster could 51% them pretty easily.


newbie
Activity: 8
Merit: 0
IF someone knows other sources for FPGA mining in general, send me a PM, i'm currently starting to develop HPC aplications on FPGA, starting slowly but surely raised my interest on FPGA mining, even tough it is really expensive and hard to get development boards outside USA
member
Activity: 154
Merit: 37
Found this product:
"BittWare’s XUPSVH is an UltraScale+ VU33P/35P FPGA-based PCIe card.  The UltraScale+ FPGA helps these demanding applications avoid I/O bottlenecks with integrated High Bandwidth Memory (HBM2) tiles on the FPGA that support up to 8 GBytes of memory at 460 GBytes/sec."
Each fpga device requires a unique bitstream. Think about it.


Sorry, what is "unique bitstream"?  You mean need to do unique coding for each FPGA model to do mining same coin?  Shocked

Yea my take it is like a bios for the fpga card that tells it exactly what to do, so each one is unique for every fpga.

kind of like saying gpu's are a sledge hammer where a more basic set of instructions can be sent to it to take a swing at anything.
In a fpga it would more like programming a laser to etch out exactly what you want resulting in a more precise operation.

This is somewhat accurate. The bitstream is literally the blueprint for the exact circuit you want the FPGA to currently be wired as. Every model of FPGA is like a unique building - it needs its own tailored electrical blueprint, even though the electrical blue prints for two similarly sized datacenters might look very similar. Maybe on one the main power feed comes in on the south wall and the rows run north to south, and on the other the power comes in the east wall and rows are spaced differently running west to east.

With FPGAs you can easily rewire the whole building (but not change where the fixed resources are),  with an ASIC you’re starting from flat level ground, and once the building and wires are in they can never be changed.

With a GPU you can’t change the wires, all the machines are already installled and all the manufacturing lines are already installed in stone,  and each is only good for what it is good for, you can only tell the machines what order of operations to execute.
member
Activity: 154
Merit: 37
The Phi algo change will test the theory that they can adapt the fpga soffware in a matter of hours or days??
Also I don't get why the engineering samples of these boards are cheaper and they can put them in mass production for a higher price?? Shouldn't it be the other way around?

There are not many new customers of FPGAs. They are heavily incentivized to offer “cheap” dev kits to get a company to see the value in the chip and build a product around it. First they get you hooked, then...
hero member
Activity: 1008
Merit: 1000
Found this product:
"BittWare’s XUPSVH is an UltraScale+ VU33P/35P FPGA-based PCIe card.  The UltraScale+ FPGA helps these demanding applications avoid I/O bottlenecks with integrated High Bandwidth Memory (HBM2) tiles on the FPGA that support up to 8 GBytes of memory at 460 GBytes/sec."
Each fpga device requires a unique bitstream. Think about it.


Sorry, what is "unique bitstream"?  You mean need to do unique coding for each FPGA model to do mining same coin?  Shocked

Yea my take it is like a bios for the fpga card that tells it exactly what to do, so each one is unique for every fpga.

kind of like saying gpu's are a sledge hammer where a more basic set of instructions can be sent to it to take a swing at anything.
In a fpga it would more like programming a laser to etch out exactly what you want resulting in a more precise operation.
newbie
Activity: 7
Merit: 0
Found this product:
"BittWare’s XUPSVH is an UltraScale+ VU33P/35P FPGA-based PCIe card.  The UltraScale+ FPGA helps these demanding applications avoid I/O bottlenecks with integrated High Bandwidth Memory (HBM2) tiles on the FPGA that support up to 8 GBytes of memory at 460 GBytes/sec."
Each fpga device requires a unique bitstream. Think about it.


Sorry, what is "unique bitstream"?  You mean need to do unique coding for each FPGA model to do mining same coin?  Shocked
jr. member
Activity: 154
Merit: 1
I know those talks here are being about FPGA having them at home running and configured.

But does somebody have already looked into the possibilities of Amazon EC2-F1 instances? They are also providing FPGA in their datacenters (1 instance consists of 8 pieces of  16 nm Xilinx UltraScale Plus FPGA's).

As you can build images and re-deploy on any other FPGA i was wondering that might be in interest somehow? I just wanted to bring that up as another opporunity which might be in interest.
I would definitely try that out - if i understand this might be possible to run "off-shore" rather than home but still having dedicated hardware. Therefore i would be in a need of understanding if all the work being done here is also able being used and replicated in the Amazon Cloud/Datacenter. (especially the bitstreams/firmware)

Let me know your thoughts.

Cheers,



I believe Amazon has banned using their FPGAs for mining.
newbie
Activity: 17
Merit: 0
I know those talks here are being about FPGA having them at home running and configured.

But does somebody have already looked into the possibilities of Amazon EC2-F1 instances? They are also providing FPGA in their datacenters (1 instance consists of 8 pieces of  16 nm Xilinx UltraScale Plus FPGA's).

As you can build images and re-deploy on any other FPGA i was wondering that might be in interest somehow? I just wanted to bring that up as another opporunity which might be in interest.
I would definitely try that out - if i understand this might be possible to run "off-shore" rather than home but still having dedicated hardware. Therefore i would be in a need of understanding if all the work being done here is also able being used and replicated in the Amazon Cloud/Datacenter. (especially the bitstreams/firmware)

Let me know your thoughts.

Cheers,

member
Activity: 277
Merit: 23
Mining LUX with FPGA goes at least 6 months back, this forum is not the only source 
jr. member
Activity: 266
Merit: 2
You are wrong. Smiley

Well that makes me sad. Mining Lux with an FPGA was the only reason i was following this thread!

Im not alone in being hyped to mine lux with fpga Smiley
But as i just went for ti rigg instead il be happy if they succeed in making it resistant :p Just hope no one bought to mine lux if they do succeed, others pain is not my gain.

Is Lux just changing their algo because of these FPGA threads? Because that's a lot of work to defend against something that might not even happen, and if it does it won't be such a big scale anyway. And I don't think anyone's producing an ASIC for it anytime soon. Head scratcher.
sr. member
Activity: 512
Merit: 260
The Phi algo change will test the theory that they can adapt the fpga soffware in a matter of hours or days??
Also I don't get why the engineering samples of these boards are cheaper and they can put them in mass production for a higher price?? Shouldn't it be the other way around?

How on earth can one make a prediction of time if we don't know the amount of effort required? We don't know what the change will be. we don't know who much time is even available to dedicate to the change.

All that we know is that FPGA's can be better at doing a specific job but it might be worst. There is an asic out for Eth that is less power efficient than 2 year old GPU's
Pages:
Jump to: