Author

Topic: 40 BTC+ Challenge for GPU Optimized Hashing Algorithm [1 BTC paid thus far] (Read 5215 times)

legendary
Activity: 2618
Merit: 1006
There are 32 bits of IPv4 addresses out there and 128(!!!)bits of IPv6 addresses - also NAT and Proxies are some real problems there. Maybe IMEIs or IMSIs from mobile phones might work, still not too unique either. Anything like hardware IDs is potentially fakeable (e.g. MAC addresses) and also likely tied to a certain manufacturer.

You could require PoW on a certain private key (publish nonces that are more and more difficult) to prove that you are more invested than others in that key, still there is a chance that someone might find a way to do this faster/better and then hash for several private keys instead of just one. Processing time is a bit too non-uniformly spread to be a good measurement for distribution, something Satoshi found out the hard way. There are no really better ways though, so at least props to him for trying!
legendary
Activity: 1008
Merit: 1000
The issue would then lie in ensuring that one person is really only one person and not a sock puppet... so far (besides maybe DNA hashing?) I can't think of anything that guarantees that only one single person is really only one single person in a trustless distributed way. If you allow trust, then you might trust government IDs for example or have a central service that hands out mining certificates.

Perhaps something would be possible using IP addresses? I don't have the answer, but it seems like a more concrete (and less power hungry) solution to the problem.
legendary
Activity: 2618
Merit: 1006
The issue would then lie in ensuring that one person is really only one person and not a sock puppet... so far (besides maybe DNA hashing?) I can't think of anything that guarantees that only one single person is really only one single person in a trustless distributed way. If you allow trust, then you might trust government IDs for example or have a central service that hands out mining certificates.
legendary
Activity: 1008
Merit: 1000
This is OT, but instead of trying to side-step the problem of hardware efficiency bringing abnormal mining returns (which will always be the case eventually, as technology is constantly adapting), have you considered something like limiting the total amount of possible work done per miner per day (or some other unit of time)? This amount could increase in the same way the mining difficulty does in bitcoins.

Cheers, Paul.
hero member
Activity: 770
Merit: 566
fractally
Tip sent, and algorithm updated...
hero member
Activity: 770
Merit: 566
fractally
When a GPU 'suppresses' execution of conditionals 'per stream' it stalls the streams that are executing a different data path.  

SIMD branching allows branching and looping inside a program, but because all processors must execute identical instructions, divergent branching can result in reduced performance. For example, imagine a fragment program that decides the output value of each fragment by branching using conditions based on random input numbers. The fragments will randomly take different branches, which can cause processors that are running threads for fragments that do not take the branch to stall until other processors are finished running threads for fragments that do take the branch. The end result is that many fragments will take as long as both branches together, plus the overhead of the branch instruction. SIMD branching is very useful in cases where the branch conditions are fairly "spatially" coherent, but lots of incoherent branching can be expensive. NVIDIA GeForce FX GPUs support SIMD branching in their vertex processors, and NVIDIA GeForce 6 Series GPUs support it in their fragment processors.

Condition codes (predication) are used in older architectures to emulate true branching. If-then statements compiled to these architectures must evaluate both taken and not taken branch instructions on all fragments. The branch condition is evaluated and a condition code is set. The instructions in each part of the branch must check the value of the condition code before writing their results to registers. As a result, only instructions in taken branches write their output. Thus, in these architectures all branches cost as much as both parts of the branch, plus the cost of evaluating the branch condition. Branching should be used sparingly on such architectures. NVIDIA GeForce FX Series GPUs use condition-code branch emulation in their fragment processors.
full member
Activity: 144
Merit: 100
Here is the description I was in the process of writing up while awaiting your reply:

Quote
The first part of your algorithm only looks back in the buffer by 32 or 256 bytes, so you only need a window of approximately 256 bytes to implement that part.

It seems you are counting on the second part of your algorithm which calls CityHashCrc128 on the entire 128MB buffer to require a full 128MB of memory.  But CityHashCrc128 can be computed incrementally.  Looking at the code for CityHashCrc128, it calls CityHashCrc256, which calls CityHashCrc256Long, which (aside from a bit of initialization code) reads and processes the input data completely sequentially.  Therefore you can compute the first and second parts of your algorithm at the same time, and only a buffer of about 256 bytes is required.  You'll have to execute the first part sequentially rather than in 8 parallel threads.  It should achieve extremely good parallelization on GPUs.  Perhaps the whole thing can even fit in the registers with no memory usage at all.

I'm not sure why you say it is "inherently single threaded and thus not faster on a GPU".  I see no problem running as many computations in parallel as an AMD GPU has "stream processing units" -- e.g. 3200 on a 5970.  The conditionals don't create much of a problem -- GPUs can implement conditionals by suppressing execution of instructions per stream processing unit, so the slowdown will be no worse than if you enter the conditional 100% of the time.  It should still be way faster than a CPU.

Here is an address you can send to: [removed]
hero member
Activity: 770
Merit: 566
fractally
Because you can incrementally calculate the state as you progress because the first byte does not depend upon the last byte and thus nothing that forces the user to store more than a very small amount.     

I would count this as a significant weakness that should be addressed (and I will be updating the algorithm to xor the last bytes with the first bytes), though I suspect that it would still be inherently single threaded and thus not faster on a GPU and thus not qualify for the bounty.   I will tip you 1 BTC for your post even though you didn't actually tell me how to do it.

Send me an address for your tip.   
full member
Activity: 144
Merit: 100
Will you give me the bounty if I explain why your algorithm doesn't require anywhere near the 128MB that you think it does?
hero member
Activity: 770
Merit: 566
fractally
I have created a new proof-of-work for use in BitShares that I hope is entirely GPU proof and ASIC resistant while being as fast as possible to verify.    I am working to secure an additional 60 BTC to back this challenge.  

The algorithm that I am using is posted on GitHub here:

https://github.com/bytemaster/bitshares/blob/master/src/proof_of_work.cpp

The bounty can be won if you can get any GPU to perform 25% faster than the most optimized CPU code I can produce running on the fastest CPU I can find.   The test will be based upon the total number of unique hashes that can be performed in 10 minutes.

This proof-of-work is computationally difficult even for a single hash,
 but must be so to prevent optimizations to the required memory foot print.
 
   The maximum level of parallelism achievable per GB of RAM is 8, and the highest
   end GPUs now have 4 GB of ram which means they could in theory support 32
   parallel executions of this proof-of-work.    
 
   On GPU's you only tend to get 1 instruction per 4 clock cycles in a single
   thread context.   Modern super-scalar CPU's can get more than 1 instruction
   per clock cycle and CityHash is specifically optomized to take advantage of this.
   In addition to getting more done per-cycle, CPU's have close to 4x the clock
   frequency.
 
   Based upon these characteristics alone, I estimate that a CPU can execute the
   serial portions of this algorithm at least 16x faster than a GPU which means
   that an 8-core CPU should easily compete with a 128 core GPU. Fortunately,
   a 128 core GPU would require 16 GB of RAM.  Note also, that most GPUs have
   less than 128 'real' cores that are able to handle conditionals.
 
   Further more, GPU's are not well suited for branch misprediction and code
   must be optimized to avoid branches as much as possible.  
 
   Lastly this algorithm takes advantage of a hardware instruction that is
   unlikely to be included in GPUs (Intel CRC32).  The lack of this hardware
   instruction alone is likely to give the CPU an order of magnitude advantage
   over the GPUs.  That said, the algorithm does not depend solely on the performance of this instruction.

   I am speculating that the number of CPU cores will grow at about the same rate as available RAM in GPUs and therefore GPUs will never gain the upper hand over CPUs for this hash.

I am also interested in any feedback on how to improve the algorithm further, make it run even faster on CPU's without compromising GPU difficulty.  It currently takes about 0.75 seconds on a 2.3 Ghz Core i7 Macbook Pro for a single hash.  

*edit*   If you intend on tackling this challenge and actually writing the code to prove it, please contact me so that we can have a meeting of the minds regarding what is required and we can lock in the algorithm.
Jump to: