Momentum Proof-of-Work Bounty - 30 BTC [PAID]

tromp

legendary

Activity: 990

Merit: 1108

Quote from: p2p on January 29, 2016, 07:36:33 AM

What's the status of Momentum POW? Was it used in Bitshares? Is it used in any coin?

It is broken beyond repair according to

https://eprint.iacr.org/2015/946.pdf

Their Proposition 9 states a sublinear time-memory tradeoff.
For instance, collisions can be found in memory O(sqrt(N)) and time O(N^(5/4)),
instead of memory and time O(N).

This Proposition is based on the golden collision search technique, described in Section 4.2 of

http://people.scs.carleton.ca/~paulv/papers/JoC97.pdf

p2p

newbie

Activity: 32

Merit: 0

What's the status of Momentum POW? Was it used in Bitshares? Is it used in any coin?

BootstrapCoinDev

full member

Activity: 154

Merit: 100

Quote from: gigawatt on November 07, 2013, 06:56:28 PM

I've managed to make a huge step forward in proving Momentum being not nearly as good for proof-of-work as intended.

The magic lies in using a Bloom filter to store the intermediate hashes.
As a result, instead of using 12 bytes per hash/nonce in a Semi-Ordered Map (which results in ~750 MB of memory), the required memory is (-1 * (2^26 * ln(0.01)) / (ln(2)^2)), or about ~76 MB.

This number can be reduced arbitrarily if we're willing to have a false positive rate greater than 1%. For example, if we allowed up to a 50% chance of having a false positive, the memory requirement drops to ~11 MB.

Here's a overview of how the algorithm works:

Make a "main" bloom filter of size 2^26 with a false positive rate 1%: ~76 MB
Make a tiny "clash" bloom filter of size 2^16 and false positive rate 2^-26: ~0.7 MB
Make a vector of pairs< hash, nonce > to store candidate birthday collisions.
For each nonce in the search space, check if its hash exists in the "main" bloom filter. If it is, add it's entry to the "clash" bloom filter.
The "main" bloom is no longer required and can be discarded.
For each nonce in the search check if its hash exists in the "clash" bloom filter. If it does, add < hash, nonce > to a candidate list for investigation.
Sort the list of candidates by hash.
For each pair in the candidate list, see if the previous element has the same hash. If it does, add it to the output list. This step removes false positives by comparing the actual hash instead of the bloom filter's idea of a hash.
Return the output list as normal.

For your testing pleasure, I also built a working proof of concept.
(Most of the magic is right here. The bloom filter is a modified version of "bloom.cpp" called "bigbloom.cpp")

Unmodified source: https://i.imgur.com/k2cNrmd.png

Source using bloom filters: https://i.imgur.com/w8Enf9e.png

In exchange for lowering the memory requirements by a factor of ~10, the algorithm runs at about 1/4th speed, mainly due to the doubling calls to SHA512 and the hash calls within the bloom filters. The overall result is a net efficiency gain of approximately 2.5
The reduction in memory requirement means that if we could fit N instances of Momentum in GPU memory, we can instead fit 10*N instances. If we up the false positive rate in exchange for more time spent sorting, we can achieve ratios of up to 70*N.

Given that bloom filters, SHA512, and sorting data are all parallel/GPU friendly, we can conclude that Momentum as a proof-of-work isn't nearly as GPU-hard as initially intended.

Yes, vey interesting concept. With so many problems with folks pushing unfair bots into this, right now ... it looks pretty bad. There are so many coins right now that are easy & inexpensive to setup and start running fast, making money & doing more lifting up this whole thing. With the money you make and helping others know what can be done. As long as you feel comfortable with it, then go for it!

tromp

legendary

Activity: 990

Merit: 1108

bytemaster

hero member

Activity: 770

Merit: 568

fractally

Quote from: smolen on November 12, 2013, 11:36:59 PM

Quote from: gigawatt on November 07, 2013, 06:56:28 PM

The magic lies in using a Bloom filter to store the intermediate hashes.

For sake of history

Bloom filters were first mentioned in this thread in mysteriously disappeared message by iruu, it was right before your post.

Quote

Algorithm:
We have pw parallel workers and one m-bit bloom filter, initialized to zero.

This bounty has been award to gigawatt for his excellent work. Great progress all!

smolen

hero member

Activity: 524

Merit: 500

Quote from: gigawatt on November 07, 2013, 06:56:28 PM

The magic lies in using a Bloom filter to store the intermediate hashes.

For sake of history

Bloom filters were first mentioned in this thread in mysteriously disappeared message by iruu, it was right before your post.

Quote

Algorithm:
We have pw parallel workers and one m-bit bloom filter, initialized to zero.

gigawatt

full member

Activity: 168

Merit: 100

Quote from: bytemaster on November 07, 2013, 07:51:22 PM

All further discussion in this thread should be moved here: http://bitsharestalk.org/index.php?topic=22.0

I do not have time to follow two threads or repeat myself... :-)

gigawatt: join us over at bitsharestalk, I will pay you a 0.5 BTC tip for your contribution.

I scooted my conversation. My address is in my signature. Cheesy

bytemaster

hero member

Activity: 770

Merit: 568

fractally

All further discussion in this thread should be moved here: http://bitsharestalk.org/index.php?topic=22.0

I do not have time to follow two threads or repeat myself... :-)

gigawatt: join us over at bitsharestalk, I will pay you a 0.5 BTC tip for your contribution.

Etlase2

hero member

Activity: 798

Merit: 1000

Quote from: bytemaster on November 07, 2013, 07:34:41 PM

I wonder what would happen if we used NESTED momentum proof of work?

Change the nonce-space of the outer proof of work to a pair of 16 bit numbers that result in a X-bit collision?

Now you have a more memory intensive inner hash that is still quick to validate, but would significantly complicate GPU miners.

AnonyMint suggested this very thing but with scrypt at one point. However, I think it would have made verifying even more difficult. He would know better. He would probably also be a reasonable judge of this corollary to that idea.

bytemaster

hero member

Activity: 770

Merit: 568

fractally

I wonder what would happen if we used NESTED momentum proof of work?

Change the nonce-space of the outer proof of work to a pair of 16 bit numbers that result in a X-bit collision?

Now you have a more memory intensive inner hash that is still quick to validate, but would significantly complicate GPU miners.

bytemaster

hero member

Activity: 770

Merit: 568

fractally

I suppose the performance of your algorithm would also suffer if instead of SHA512(X), SCRYPT(X) was used because the cost of doing this step twice would be much higher and less GPU friendly.

bytemaster

hero member

Activity: 770

Merit: 568

fractally

gigawatt,
   Thank you for providing the first innovative algorithm for reducing the memory requirements. Let me attempt to post mitigating factors to your algorithm.

   From a CPU miner perspective, your reduction in memory comes at the expense of performance so does break the algorithmic complexity of the algorithm.

   From a GPU perspective you have to populate a bloom filter with 2^26 results... based upon my understanding of how bloom filters operate this would require updating a common data-structure from every thread and the resulting memory race conditions could create false negatives. If you have to do this step sequentially, then you might as well use a CPU with memory.

   So do you have any solid algorithms that can populate a bloom filter with 2^26 results in parallel?

gigawatt

full member

Activity: 168

Merit: 100

I've managed to make a huge step forward in proving Momentum being not nearly as good for proof-of-work as intended.

The magic lies in using a Bloom filter to store the intermediate hashes.
As a result, instead of using 12 bytes per hash/nonce in a Semi-Ordered Map (which results in ~750 MB of memory), the required memory is (-1 * (2^26 * ln(0.01)) / (ln(2)^2)), or about ~76 MB.

This number can be reduced arbitrarily if we're willing to have a false positive rate greater than 1%. For example, if we allowed up to a 50% chance of having a false positive, the memory requirement drops to ~11 MB.

Here's a overview of how the algorithm works:

Make a "main" bloom filter of size 2^26 with a false positive rate 1%: ~76 MB
Make a tiny "clash" bloom filter of size 2^16 and false positive rate 2^-26: ~0.7 MB
Make a vector of pairs< hash, nonce > to store candidate birthday collisions.
For each nonce in the search space, check if its hash exists in the "main" bloom filter. If it is, add it's entry to the "clash" bloom filter.
The "main" bloom is no longer required and can be discarded.
For each nonce in the search check if its hash exists in the "clash" bloom filter. If it does, add < hash, nonce > to a candidate list for investigation.
Sort the list of candidates by hash.
For each pair in the candidate list, see if the previous element has the same hash. If it does, add it to the output list. This step removes false positives by comparing the actual hash instead of the bloom filter's idea of a hash.
Return the output list as normal.

For your testing pleasure, I also built a working proof of concept.
(Most of the magic is right here. The bloom filter is a modified version of "bloom.cpp" called "bigbloom.cpp")

Unmodified source: https://i.imgur.com/k2cNrmd.png

Source using bloom filters: https://i.imgur.com/w8Enf9e.png

In exchange for lowering the memory requirements by a factor of ~10, the algorithm runs at about 1/4th speed, mainly due to the doubling calls to SHA512 and the hash calls within the bloom filters. The overall result is a net efficiency gain of approximately 2.5
The reduction in memory requirement means that if we could fit N instances of Momentum in GPU memory, we can instead fit 10*N instances. If we up the false positive rate in exchange for more time spent sorting, we can achieve ratios of up to 70*N.

Given that bloom filters, SHA512, and sorting data are all parallel/GPU friendly, we can conclude that Momentum as a proof-of-work isn't nearly as GPU-hard as initially intended.

smolen

hero member

Activity: 524

Merit: 500

Re-posting here, just in case

Quote from: smolen on November 06, 2013, 12:36:28 PM

- hash nonce space into birthdays space in parallel, get array of (nonce, birthday)
- sort it with bitonic
- find dublicates
Bonus: don't store birthdays at all, calculate them ad hoc during the sort
This is not an attack on the algorithm itself, it's just a way to fit existing algorithm into GPU

bytemaster

hero member

Activity: 770

Merit: 568

fractally

I have updated the terms of the bounty... http://bitsharestalk.org/index.php?topic=22.0

bytemaster

hero member

Activity: 770

Merit: 568

fractally

Further discussion on the Momentum Proof-of-Work should be moved here: http://bitsharestalk.org/index.php?topic=21.msg24#msg24

bytemaster

hero member

Activity: 770

Merit: 568

fractally

The memory access pattern is 'random' so CPU cache does not help improve the locality of memory access.

azwccc

sr. member

Activity: 243

Merit: 250

What exactly is the "memory bus speed"? Is the RAM speed like PC12800 or the QPI become the bottleneck?

Will larger CPU cache size be able to help a little bit in the performance?

FreeTrade

legendary

Activity: 1470

Merit: 1030

Quote from: FreeTrade on October 19, 2013, 06:44:26 AM

This assumes ruling out the trivial A=B, of course.

Yeah, I think we're on top of that one Wink

bytemaster

hero member

Activity: 770

Merit: 568

fractally

note I added to my last reply.. I certainly don't want you wishing you didn't reply.

webr3

full member

Activity: 168

Merit: 100

Quote from: bytemaster on October 26, 2013, 09:57:52 PM

Yep... there are many checks I left out of the verify method

1) I need to check that nonce < MAX
2) check that a != b

Aye, you've got me wishing I'd just kept my mouth shut and waited to see if it was in the live code, could have mined a bounty out if not Wink

Regardless, it's a good PoW system and will be considering using it, or a variation of it with a customized salsa20 in the future.

bytemaster

hero member

Activity: 770

Merit: 568

fractally

Yep... there are many checks I left out of the verify method

1) I need to check that nonce < MAX
2) check that a != b

Your feedback is useful from an open source development point of view and of course I cannot ever prove that these checks were on my TODO list. And I am glad you brought it up because we should never assume others didn't miss something. Look at what recently happened to Namecoin!

Anyway my plan is to offer smaller bounties for implementation bugs once I have completed my implementation. I will send you a tip for pointing it out because after thinking about it your feedback did provide non-zero value.

Thanks for looking at things.

webr3

full member

Activity: 168

Merit: 100

Quote from: bytemaster on October 26, 2013, 09:45:00 PM

Webr3, thank you for the feedback. It would certainly add clarity that such a thing must be checked. Other things that we are checking that are not referenced in the white paper is that A and B are both below some MAX NONCE value. I will update the white paper with additional clarity and enhancements regarding the ideal parameters for various goals.

Yes, do remember to update that verify function too, sometimes it's the most obvious things that catch you out further down the line Wink

bytemaster

hero member

Activity: 770

Merit: 568

fractally

Webr3, thank you for the feedback. It would certainly add clarity that such a thing must be checked. Other things that we are checking that are not referenced in the white paper is that A and B are both below some MAX NONCE value. I will update the white paper with additional clarity and enhancements regarding the ideal parameters for various goals.

webr3

full member

Activity: 168

Merit: 100

so currently you have:

1) Given a block of data D, calculate H = Hash(D).
2) Find nonce A and nonce B such that BirthdayHash(A +H) == BirthdayHash( B+H)
3) If Hash(H + A + B) < TargetDifficulty then a result has been found, otherwise keep
searching.

Step 2 never needs to be run, BirthdayHash is never executed.

All you do is increment A and run Hash(H + A + A) until a result has been found that is under the TargetDifficulty. As soon as you've found this, then B == A and of course BirthdayHash(A+H) == BirthdayHash(B+H).

To fix this, all you need to do is create a rule such that A and B must be different.

Rather than just assuming it and not hard coding it or building it in to the algo that A != B. Seems quite a critical thing to miss from a white paper.

You'll also note that you didn't check for this in your own validation function

Quote from: bytemaster on October 21, 2013, 11:57:39 PM

Code:

bool momentum_verify( pow_seed_type head, uint32_t a, uint32_t b )
{
uint32_t ia = (a / 8) * 8;
fc::sha512::encoder enca;
enca.write( (char*)&ia, sizeof(ia) );
enca.write( (char*)&head, sizeof(head) );
auto ar = enca.result();

uint32_t ib = (b / 8) * 8;
fc::sha512::encoder encb;
encb.write( (char*)&ib, sizeof(ib) );
encb.write( (char*)&head, sizeof(head) );
auto br = encb.result();

return (ar._hash[a%8]>>14) == (br._hash[b%8]>>14);
}

Very surprised to see it missing given your comment to me that:

Quote from: bytemaster on October 27, 2013, 02:35:33 AM

What you have just stipulated was implied by using variables A and B and simply pointing out that A cannot be equal to B is IMPLIED and also OBVIOUS. It does not change the algorithm or the idea. So obvious in fact that I suspect everyone else reading it also assumed that requirement.

Sorry, you don't get 30 BTC because I am doing NOTHING DIFFERENT and your feedback provided 0 value.

Hopefully it is of some value, and you correct the code and add it in to the paper, else of course the whole approach is pointless is A can equal B Wink

FreeTrade

legendary

Activity: 1470

Merit: 1030

Quote from: FreeTrade on October 20, 2013, 12:20:42 PM

Quote from: bytemaster on October 20, 2013, 12:17:05 PM

The Birthday Hash will never find a 256 bit collision, so you would truncate it to 64 bit of which perhaps 34-38 bits will be used. Saturating the PCI bus is not best approach here.

Or you might use the hash to describe 8 32 bit birthdays, or 7 36 bit birthdays . . . modifying it as you wanted to modify the difficulty. Not a problem.

Hmmm - I'm reconsidering this. While the 256 bit hash is cryptologically secure . . . can we say the same thing about just the first 32 bits of it? Could an attacker shape the data to generate a hash where he controlled the first 32 bits of it?

dna_gym

member

Activity: 71

Merit: 10

I think that Momentum also has Time-Memory-Trade-Off (TMTO).
There are some birthday attack algorithms, one particular example is Floyd's cycle finding algorithm, which can be used to find hash collisions using memory of O(1).

Fortunately, because the nonce space is limited to 2^26 bits, and the collision difficulty is 50 bits, the "memoryless" attacker is forced to try about 16 million (2^(50-26)) times harder to find one 50-bit-collision than he was given full nonce size of 50 bits.

bytemaster

hero member

Activity: 770

Merit: 568

fractally

Quote from: smolen on October 22, 2013, 05:10:49 AM

Quote from: bytemaster on October 21, 2013, 11:57:39 PM

use the existing sha256 difficulty from bitcoin to adjust the difficulty.

Could existing sha256 ASICs be used to pre-compute nonce filed in block header?

No. Finding the collision depends upon the block header and everything is based on sha512 for finding collisions.

smolen

hero member

Activity: 524

Merit: 500

Quote from: bytemaster on October 21, 2013, 11:57:39 PM

use the existing sha256 difficulty from bitcoin to adjust the difficulty.

Could existing sha256 ASICs be used to pre-compute nonce filed in block header?

bytemaster

hero member

Activity: 770

Merit: 568

fractally

THe 5 BTC bounty from this thread has been supplanted by: https://bitcointalksearch.org/topic/protoshares-bounty-30-btc-start-mining-bitshares-in-november-315973

bytemaster

hero member

Activity: 770

Merit: 568

fractally

Code:

#define MAX_NONCE (1<<26)

std::vector< std::pair > momentum_search( pow_seed_type head )
{
std::unordered_map found;
found.reserve( MAX_NONCE );
std::vector< std::pair > results;

for( uint32_t i = 0; i < MAX_NONCE; )
{
fc::sha512::encoder enc;
enc.write( (char*)&i, sizeof(i) );
enc.write( (char*)&head, sizeof(head) );

auto result = enc.result();

for( uint32_t x = 0; x < 8; ++x )
{
uint64_t birthday = result._hash[x] >> 14;
uint32_t nonce = i+x;
auto itr = found.find( birthday );
if( itr != found.end() )
{
results.push_back( std::make_pair( itr->second, nonce ) );
}
else
{
found[birthday] = nonce;
}
}
i += 8;
}
return results;
}

bool momentum_verify( pow_seed_type head, uint32_t a, uint32_t b )
{
uint32_t ia = (a / 8) * 8;
fc::sha512::encoder enca;
enca.write( (char*)&ia, sizeof(ia) );
enca.write( (char*)&head, sizeof(head) );
auto ar = enca.result();

uint32_t ib = (b / 8) * 8;
fc::sha512::encoder encb;
encb.write( (char*)&ib, sizeof(ib) );
encb.write( (char*)&head, sizeof(head) );
auto br = encb.result();

return (ar._hash[a%8]>>14) == (br._hash[b%8]>>14);
}

I have put together this code which is not optimized in the least, but I estimate once fully optimized will require 2^26 * 12 = 768 MB of RAM minimum and will run in one or two seconds on a Core i7. The nonce space is limited to 2^26 items and the collision difficulty is 50 bits which on average finds 2-3 pairs per run. This would require memory bandwidth of 512 MB/sec sustained. Assuming you built an ASIC for SHA512 it would be at most 100x faster given the highest memory bus speeds available.

To integrate this with a new blockchain you simply add 2 nonces to the block header, call momentum_verify and then use the existing sha256 difficulty from bitcoin to adjust the difficulty.

Etlase2

hero member

Activity: 798

Merit: 1000

Quote from: bytemaster on October 20, 2013, 10:08:41 PM

We have solved this problem by limiting the nonce search space so that the maximum amount of useful memory for gaining non-linear performance increases is around 4 GB. Assuming you have exhausted all possible nonces you would have to clear the memory and start over. Now it is a race to see how fast you can fill 4 GB of memory. While an ASIC could certainly do it quickly, memory bus speeds would still be the limiting factor.

Doesn't this mean that the PoW loses its "momentum" property then, though? I think it's probably for the better anyway, as relying on an unusual mechanic of the PoW algorithm to reduce some gameable property of bitShares is probably a short-sighted idea. Or am I mistaken about the momentum?

bytemaster

hero member

Activity: 770

Merit: 568

fractally

The question becomes what is the fastest possible cryptographically secure hash that can be performed by a CPU? It doesn't have to be the most secure of cryptographic hash functions... I am thinking that there could probably be some combination of cryptographic hash function extended by a non cryptographically secure hashing algorithm such as city.

For example: You could generate a large number of birthdays per nonce like so:

P = sha512( H + nonce )
CITY256( P + 1 ) + CITY256(P+2) + ...

If you generated 16KB of nonces per sha512()... you would get a massive speed up and I suspect suffer little if any from the potential of attacks on CITY.

FreeTrade

legendary

Activity: 1470

Merit: 1030

Quote from: bytemaster on October 21, 2013, 09:09:52 AM

But lets see if we can keep things non-linear. Suppose your BirthdayHash has 32 bits, but your nonce search space only has 16 bits. You will run out of nonces long before you are able to generate enough birthdays to get very high on the q(n) curve. I would guess you could tune the parameters in the following way:

Limit the nonce search space such that the expected number of birthday matches is 1. If this were the actual birthday problem, we would set the nonce nonce search space to ~50 and number of days to 365. 99% of the time you would find 1 match, though sometimes you might find 2 or 0.

Once all nonces of been searched, you have to start over again.

Okay, yes, by limiting the nonce search space, and keeping the memory bus full, I don't think adding hashing power will be helpful. Thanks for the explanation.

bytemaster

hero member

Activity: 770

Merit: 568

fractally

Quote from: FreeTrade on October 21, 2013, 05:37:14 AM

Quote from: Nathaniel B on October 20, 2013, 12:41:55 PM

The problem with your p(n) and q(n) curves are that they measure the probability of at least one match. What we really want to measure is the expected number of matches since due to a difficulty setting having more than one match is important. This is what gives a relationship of number of hashes/second * amount of RAM. Even if we only increase the hashes/second by a factor of two we still get twice as many matches giving linear scaling in exactly the same way as just using whatever birthdayhash algorithm you choose by itself.

Actually I've given it some more thought, and now I'm finding this very convincing.

Let's say 365 possible birthdays and a CPU can generate 10 per second, and process a maximum of 10 per second into memory.

Add a GPU that can generate 100 per second, reduce the search space by 90%, discard 90 of the birthdays that aren't in the smaller search space, and send the 10 to the CPU to process into a search space of 36 days.

So the q(n) might only be half as effective as the p(n), but hashing can be added *infinity and we're getting linear scaling.

We need to prevent easy reduction of the search space.

You want hash-speed to be linear once the target memory has been consumed. As the number of items in memory grows the performance of the algorithm approaches the linear speed of the birthday hash alone; however, it does not eliminate the need for the memory to be present. Putting the momentum property-aside, having a linear hash function that requires multiple GB of RAM to perform at peak performance means that the cost of going parallel with the hash function is an order of magnitude higher.

But lets see if we can keep things non-linear. Suppose your BirthdayHash has 32 bits, but your nonce search space only has 16 bits. You will run out of nonces long before you are able to generate enough birthdays to get very high on the q(n) curve. I would guess you could tune the parameters in the following way:

Limit the nonce search space such that the expected number of birthday matches is 1. If this were the actual birthday problem, we would set the nonce nonce search space to ~50 and number of days to 365. 99% of the time you would find 1 match, though sometimes you might find 2 or 0.

Once all nonces of been searched, you have to start over again.

Using these parameters you would require 50x as much memory to find the target as to verify the target. If you were to grow both the nonce search space and the bits of the BirthdayHash you would maintain the same principle while requiring any amount of RAM.

The performance of the hash function would now be a matter of how quickly one could fill the RAM. This is of course linear with respect to processing power. The goal is to require as much RAM as possible per unit of processing power. All we are really achieving with this hash function is to increase the number and type of transistors necessary.

Some goals:
1) get the fastest possible BirthdayHash algorithm because this will generate data via a CPU at a rate that nears the memory throughput. Any ASIC that could perform the HASH faster would be stuck waiting on the memory bus.
Once the memory bus becomes the bottleneck in performance we have won.

FreeTrade

legendary

Activity: 1470

Merit: 1030

Quote from: Nathaniel B on October 20, 2013, 12:41:55 PM

The problem with your p(n) and q(n) curves are that they measure the probability of at least one match. What we really want to measure is the expected number of matches since due to a difficulty setting having more than one match is important. This is what gives a relationship of number of hashes/second * amount of RAM. Even if we only increase the hashes/second by a factor of two we still get twice as many matches giving linear scaling in exactly the same way as just using whatever birthdayhash algorithm you choose by itself.

Actually I've given it some more thought, and now I'm finding this very convincing.

Let's say 365 possible birthdays and a CPU can generate 10 per second, and process a maximum of 10 per second into memory.

Add a GPU that can generate 100 per second, reduce the search space by 90%, discard 90 of the birthdays that aren't in the smaller search space, and send the 10 to the CPU to process into a search space of 36 days.

So the q(n) might only be half as effective as the p(n), but hashing can be added *infinity and we're getting linear scaling.

We need to prevent easy reduction of the search space.

td services

sr. member

Activity: 448

Merit: 250

black swan hunter

Quote from: FreeTrade on October 20, 2013, 01:59:52 PM

So I think the fundamental idea is sound, I don't make arguments against it, and I don't see convincing ones here. We do differ in the details.

My current thoughts for my preferred implementation are as follows -

1. Require between 256MB - 1GB memory
->To allow for use on very low end equiment

2. Limit A and B such that all possible matches are exhausted in less than 1 second - H would need to change every second and could include new transactions.
->So that faster computers are not disproportionately advantaged over slower ones.

3. Change
Hash(H + A + B) < TargetDifficulty
to
Hash(H + A + B + MemoryHash(A) + MemoryHash(B)) < TargetDifficulty
->To prevent GPUs evaluating this step before adding B to memory.

I agree with this configuration. It encourages the use of low power, low cost, fanless boards for always on use which are also useful for personal clouds, self hosting, and community networks. I have had a hard time finding low cost ARM boards with more than .5 gb per core. The 4 gb configuration favors desktops and laptops which use a lot more power and are prone to cooling problems over time due to dust accumulation from always on use. Failed power supplies from overheating with dust clogged vents are by far the most common hardware failure I've seen in 12 years of servicing computers as a business.

bytemaster

hero member

Activity: 770

Merit: 568

fractally

Quote from: BitshireHashaway on October 20, 2013, 07:53:47 PM

Quote from: Nathaniel B on October 18, 2013, 10:03:44 PM

Once the RAM is full everything is a read so it'll be perfectly parallel with some negligible constant time overhead.

Improvements against scrypt for the birthdayhash will help a bit, but like you said any tricks generally translate fine into ASICs.

For your example, since the mac pro has 16 times more RAM and hashing power it will be exactly 16^2 or 256 times better. Well worth the 50x cost. Since this is quadratic it will just get worse when you get into 128GB servers.

I like the elegance of having the RAM be the main resource for hashing, but I think we have a ways to go before making it practical. I'll keep thinking about it.

Nathaniel

The problem with memory is that setups will then turn to server motherboards, some of which can hold an insane amount of RAM. http://www.newegg.com/Product/Product.aspx?Item=N82E16813182346 This motherboard has 24 Ram Slots for $490.00

For best value, we can go with 16gb memory sticks at $150.00 each also available on newegg. http://www.newegg.com/Product/Product.aspx?Item=N82E16820226412

With 24 of these, you would have 384gb of Ram. Granted, your cost would be ~$6000 when you finally put the system together. However, with 384gb of ram, this system is 144 times as good as your Mac Pro for around 1.2x the cost. People would just have systems like this lying around.

Also, that doesn't even count one of Intel's Server Motherboards which can hold theoretically up to 1.5TB of ram (48 slots at 32gb each). However, at current prices your looking at 30 grand just in ram for that.

However, that being said, I would support a memory based coin. I kind of like the idea of having my house covered with servers each having hundreds of gb of ram versus the idea of my house covered with Asics

.

We have solved this problem by limiting the nonce search space so that the maximum amount of useful memory for gaining non-linear performance increases is around 4 GB. Assuming you have exhausted all possible nonces you would have to clear the memory and start over. Now it is a race to see how fast you can fill 4 GB of memory. While an ASIC could certainly do it quickly, memory bus speeds would still be the limiting factor.

bytemaster

hero member

Activity: 770

Merit: 568

fractally

Thanks for the reference, I will gladly give credit where it is due.

Sergio_Demian_Lerner

hero member

Activity: 555

Merit: 654

bytemaster: Your idea of using hash collisions is good, but is not new!

I proposed it on my paper MAVEPAY, April 16, 2012 (and even then I thought I wasn't something new).

Here is the part of the paper where the Colission POW (as I called it) is proposed, available in http://bitslog.files.wordpress.com/2012/04/mavepay1.pdf

The Collision POW:
The collision POW is based on the difficulty of finding partial preimage collisions of two messages. The packet broadcast has the following properties:
•packet=< command, nonce1, nonce2, bn >
•tmp=Hash(command||bn).
•pow1=Hash(tmp||nonce1).
•pow2=Hash(tmp||nonce2).
•pow1 and pow2 must be prefixed by some predefined number of zero bits and share some number of following bits (the remaining bits can differ).
The related attack is the partial collision attack.

Birthday attacks require large amounts of memory and so preclude the use of GPUs and other massively parallel computer architectures. An attacker is therefore less capable of using these technologies to mount a command flooding attack.

Could you please add a reference in your paper to the Collision Pow of MAVEPAY ?

Best regards, Sergio.

BitshireHashaway

full member

Activity: 182

Merit: 100

Quote from: Nathaniel B on October 18, 2013, 10:03:44 PM

Once the RAM is full everything is a read so it'll be perfectly parallel with some negligible constant time overhead.

Improvements against scrypt for the birthdayhash will help a bit, but like you said any tricks generally translate fine into ASICs.

For your example, since the mac pro has 16 times more RAM and hashing power it will be exactly 16^2 or 256 times better. Well worth the 50x cost. Since this is quadratic it will just get worse when you get into 128GB servers.

I like the elegance of having the RAM be the main resource for hashing, but I think we have a ways to go before making it practical. I'll keep thinking about it.

Nathaniel

The problem with memory is that setups will then turn to server motherboards, some of which can hold an insane amount of RAM. http://www.newegg.com/Product/Product.aspx?Item=N82E16813182346 This motherboard has 24 Ram Slots for $490.00

For best value, we can go with 16gb memory sticks at $150.00 each also available on newegg. http://www.newegg.com/Product/Product.aspx?Item=N82E16820226412

With 24 of these, you would have 384gb of Ram. Granted, your cost would be ~$6000 when you finally put the system together. However, with 384gb of ram, this system is 144 times as good as your Mac Pro for around 1.2x the cost. People would just have systems like this lying around.

Also, that doesn't even count one of Intel's Server Motherboards which can hold theoretically up to 1.5TB of ram (48 slots at 32gb each). However, at current prices your looking at 30 grand just in ram for that.

However, that being said, I would support a memory based coin. I kind of like the idea of having my house covered with servers each having hundreds of gb of ram versus the idea of my house covered with Asics

.

smolen

hero member

Activity: 524

Merit: 500

Quote from: bytemaster on October 20, 2013, 09:51:10 AM

Quote from: smolen on October 20, 2013, 09:49:10 AM

Quote from: bytemaster on October 20, 2013, 09:22:04 AM

The GPU can produce blocks of 100K nonce/hash pairs which a CPU will then push into the hash table at 100K / sec and the result would be significant GPU advantage.

Just a quick thought, you'd better to oversaturate PCIe 3.0 x16 bandwidth to be sure. But that will require really fast hash function...

This is why I hobble the GPU with AES and unpredictable run time. It destroys the massive parallelism by forcing most of the threads to idle.

Sorry for some confusion, I'll clarify. Your idea is to make GPU useless by using GPU-hard hash function; all hashes can be uploaded from GPU to the host, but they are produced very slow, to give GPU miner no (economic) advantage. Mine is to make hash function so fast that the produced results cannot be uploaded to the host in reasonable time. So GPU will have ability to produce hashes, but not enough memory to find collision, and it would be cheaper to calculate hashes on CPU than to upload them from GPU.

FreeTrade

legendary

Activity: 1470

Merit: 1030

Quote from: bytemaster on October 20, 2013, 02:21:06 PM

Due to moores law, the memory requirements should be targeted 1-2 years in the future and then have some means to adjust over time.

I'm thinking its better to set the requirements low for a single process and more capable computers can just run multiple processes . . . it'll scale for powerful computers (linearly with cost of hardware and energy hopefully) but it won't scale for GPUs as they'll quickly run out of both memory and memory bus access.

bytemaster

hero member

Activity: 770

Merit: 568

fractally

Quote from: FreeTrade on October 20, 2013, 02:38:53 PM

Quote from: bytemaster on October 20, 2013, 02:17:14 PM

This is relatively good, except that I want the momentum property to prevent gaming of the BitShares market.

That's a feature for you. It's a bug for me.

Quote from: bytemaster on October 20, 2013, 02:17:14 PM

I think specifying time to exhaust all nonces is pointless because it depends upon implementation and hardware. I would suggest using 32 bit nonces and letting the time be what it will.

For an unlimited search on a single H, faster computers have a disproportionate advantage because they climb the probability graph much quicker. Limiting the time allowed on a single H, slower computers spend the same proportion of time in the higher part of the probability graph as faster computers.

Quote from: bytemaster on October 20, 2013, 02:17:14 PM

I would make sure that the BirthdayHash(X) method used AES instructions which are hardware optimized on x86 and ARM and thus leave as little room as possible for ASIC enhancement.

Don't know much about that - pretty sure I wouldn't be using SHA-256 though.

I think we are in basic agreement... I didn't say unlimited search, I said 32 bit limited search... though I suppose you could tie the bit limit to the memory limit so would recommend 29 bits minimum. This would allow you to fill 4 GB of memory worth of unique nonces. Within a year or two every cell phone will have 4 GB of memory and 64 bit quad-core processors with graphics units able to fill it.

FreeTrade

legendary

Activity: 1470

Merit: 1030

Quote from: bytemaster on October 20, 2013, 02:17:14 PM

This is relatively good, except that I want the momentum property to prevent gaming of the BitShares market.

That's a feature for you. It's a bug for me.

Quote from: bytemaster on October 20, 2013, 02:17:14 PM

I think specifying time to exhaust all nonces is pointless because it depends upon implementation and hardware. I would suggest using 32 bit nonces and letting the time be what it will.

For an unlimited search on a single H, faster computers have a disproportionate advantage because they climb the probability graph much quicker. Limiting the time allowed on a single H, slower computers spend the same proportion of time in the higher part of the probability graph as faster computers.

Quote from: bytemaster on October 20, 2013, 02:17:14 PM

I would make sure that the BirthdayHash(X) method used AES instructions which are hardware optimized on x86 and ARM and thus leave as little room as possible for ASIC enhancement.

Don't know much about that - pretty sure I wouldn't be using SHA-256 though.

bytemaster

hero member

Activity: 770

Merit: 568

fractally

By targeting the memory and CPU abilities of the common computer as the means of reaching maximum performance you have eliminated ASIC and super computers. Due to moores law, the memory requirements should be targeted 1-2 years in the future and then have some means to adjust over time.

bytemaster

hero member

Activity: 770

Merit: 568

fractally

Quote from: FreeTrade on October 20, 2013, 01:59:52 PM

So I think the fundamental idea is sound, I don't make arguments against it, and I don't see convincing ones here. We do differ in the details.

My current thoughts for my preferred implementation are as follows -

1. Require between 256MB - 1GB memory
->To allow for use on very low end equiment

2. Limit A and B such that all possible matches are exhausted in less than 1 second - H would need to change every second and could include new transactions.
->So that faster computers are not disproportionately advantaged over slower ones.

3. Change
Hash(H + A + B) < TargetDifficulty
to
Hash(H + A + B + MemoryHash(A) + MemoryHash(B)) < TargetDifficulty
->To prevent GPUs evaluating this step before adding B to memory.

This is relatively good, except that I want the momentum property to prevent gaming of the BitShares market.

By limiting the nonce search space you eliminate super computers from dominating (good) while still making the cost of an ASIC dominated by RAM (good).

I think specifying time to exhaust all nonces is pointless because it depends upon implementation and hardware. I would suggest using 32 bit nonces and letting the time be what it will.

I would make sure that the BirthdayHash(X) method used AES instructions which are hardware optimized on x86 and ARM and thus leave as little room as possible for ASIC enhancement.

FreeTrade

legendary

Activity: 1470

Merit: 1030

So I think the fundamental idea is sound, I don't make arguments against it, and I don't see convincing ones here. We do differ in the details.

My current thoughts for my preferred implementation are as follows -

1. Require between 256MB - 1GB memory
->To allow for use on very low end equiment

2. Limit A and B such that all possible matches are exhausted in less than 1 second - H would need to change every second and could include new transactions.
->So that faster computers are not disproportionately advantaged over slower ones.

3. Change
Hash(H + A + B) < TargetDifficulty
to
Hash(H + A + B + MemoryHash(A) + MemoryHash(B)) < TargetDifficulty
->To prevent GPUs evaluating this step before adding B to memory.

bytemaster

hero member

Activity: 770

Merit: 568

fractally

bytemaster

hero member

Activity: 770

Merit: 568

fractally

Quote

As the number of stored birthdays approach the number of days per year the efficiency approaches that of the Birthday Hash alone which if it were SHA256 would be little different than Bitcoin.

I am actually going to back off on this a bit, it would still be much better than bitcoin because checking the RAM is a point of centralization that would significantly complicate the degree to which the algorithm could be made parallel.

bytemaster

hero member

Activity: 770

Merit: 568

fractally

Quote

The problem with your p(n) and q(n) curves are that they measure the probability of at least one match. What we really want to measure is the expected number of matches since due to a difficulty setting having more than one match is important. This is what gives a relationship of number of hashes/second * amount of RAM. Even if we only increase the hashes/second by a factor of two we still get twice as many matches giving linear scaling in exactly the same way as just using whatever birthdayhash algorithm you choose by itself.

This is a very insightful comment. I would like to address that these curves show the efficiency of the algorithm as memory increases. As the number of stored birthdays approach the number of days per year the efficiency approaches that of the Birthday Hash alone which if it were SHA256 would be little different than Bitcoin.

What this also implies is that if you use 32 bit birthday hash with SHA256() it would be nearly identical to Bitcoin except that it would require a constant 4 GB of memory to be filled before reaching full speed and you would have extra overhead associated with accessing that memory. Because 4 GB of memory is cheap, this would not be enough to stop ASICs or effect any real change aside from some slight momentum required to get up to speed after changing the block.

If you use a 36 bit birthday hash then it would require 256GB constant memory to get up to speed.
If you use a 256 bit birthday hash then you may never find a solution.

Conclusion:
ASIC or not, the momentum POW does make it more difficult to game markets based on block chains.
The number of bits in the birthday hash may want to be one factor in adjusting difficulty.

FreeTrade

legendary

Activity: 1470

Merit: 1030

Quote from: bytemaster on October 20, 2013, 12:49:17 PM

Yes, that is true. All it means is that the ideal mining rig is now outfitted with MAX ram and the highest end GPU that is able to fill it.

It would significantly increase the memory requirements for ASIC designs, that is certainly true. Ok, I think you might be on to something here. The limit on performance will now be based upon how much RAM you can put on a given machine. I would like to point out that once you run out of RAM you still gain performance by adding more processing power, it just grows linearly instead of non-linerally.

So there's always going to be an ideal mining rig but the best we can hope for is that its cost scales linearly with its performance. The memory bus is the bottleneck that is hardest to expand - that's why we want to keep it full as much as possible.

bytemaster

hero member

Activity: 770

Merit: 568

fractally

Quote from: FreeTrade on October 20, 2013, 12:35:25 PM

Quote from: bytemaster on October 20, 2013, 12:26:24 PM

I suppose if your goal was to accelerate birthday generation then this would be a very GPU friendly POW.

Oh I disagree.

With traditional proof-of-work systems likes SHA256 or Scrypt it was possible to gain
performance through parallelism alone. However, regardless of how efficiently an ASIC
can run the individual birthday hash steps in parallel, your use of memory must scale to
store the result of every parallel run or you will lose an algorithmic advantage that
cannot be overcome by increasing levels of parallelism

Yes, that is true. All it means is that the ideal mining rig is now outfitted with MAX ram and the highest end GPU that is able to fill it.

It would significantly increase the memory requirements for ASIC designs, that is certainly true. Ok, I think you might be on to something here. The limit on performance will now be based upon how much RAM you can put on a given machine. I would like to point out that once you run out of RAM you still gain performance by adding more processing power, it just grows linearly instead of non-linerally.

Nathaniel B

newbie

Activity: 19

Merit: 0

Quote from: bytemaster on October 19, 2013, 10:38:34 AM

Quote from: PassTheAmmo on October 19, 2013, 10:33:59 AM

Quote from: bytemaster on October 19, 2013, 10:18:48 AM

I had considered the potential for such a design. DHT will not perform well enough, you would need 100BT ethernet or perhaps gigabit before the hash table could be filled quickly enough. From what I can tell a SUPER COMPUTER with a ton of RAM and high-speed interconnects would give you the best performance. Here is the difference, such a super computer could be built from commodity parts.

Well, if the hash is so fast you could just collaborate by splitting the birthday space in N ranges and each computer just throws away any result that isn't in its range. This trades hashing speed for memory, so works with rather than against your pow.

What range do you suggest for N and how fast is the hash?
This should be possible to test.

Any time you throw away hashes because they fall outside of your memory range you move the computational complexity from the p(n) curve to the q(n) curve in my white paper. You save on memory but the cost is far greater. Memory and CPU power must always be paired proportionally or you will have idle memory or waisted CPU cycles.

The problem with your p(n) and q(n) curves are that they measure the probability of at least one match. What we really want to measure is the expected number of matches since due to a difficulty setting having more than one match is important. This is what gives a relationship of number of hashes/second * amount of RAM. Even if we only increase the hashes/second by a factor of two we still get twice as many matches giving linear scaling in exactly the same way as just using whatever birthdayhash algorithm you choose by itself.

FreeTrade

legendary

Activity: 1470

Merit: 1030

Quote from: bytemaster on October 20, 2013, 12:26:24 PM

I suppose if your goal was to accelerate birthday generation then this would be a very GPU friendly POW.

Oh I disagree.

With traditional proof-of-work systems likes SHA256 or Scrypt it was possible to gain
performance through parallelism alone. However, regardless of how efficiently an ASIC
can run the individual birthday hash steps in parallel, your use of memory must scale to
store the result of every parallel run or you will lose an algorithmic advantage that
cannot be overcome by increasing levels of parallelism

bytemaster

hero member

Activity: 770

Merit: 568

fractally

Quote from: FreeTrade on October 20, 2013, 12:20:42 PM

Quote from: bytemaster on October 20, 2013, 12:17:05 PM

The Birthday Hash will never find a 256 bit collision, so you would truncate it to 64 bit of which perhaps 34-38 bits will be used. Saturating the PCI bus is not best approach here.

Or you might use the hash to describe 8 32 bit birthdays, or 7 36 bit birthdays . . . modifying it as you wanted to modify the difficulty. Not a problem.

I suppose if your goal was to accelerate birthday generation then this would be a very GPU friendly POW.

FreeTrade

legendary

Activity: 1470

Merit: 1030

Quote from: bytemaster on October 20, 2013, 12:17:05 PM

The Birthday Hash will never find a 256 bit collision, so you would truncate it to 64 bit of which perhaps 34-38 bits will be used. Saturating the PCI bus is not best approach here.

Or you might use the hash to describe 8 32 bit birthdays, or 7 36 bit birthdays . . . modifying it as you wanted to modify the difficulty. Not a problem.

bytemaster

hero member

Activity: 770

Merit: 568

fractally

Quote from: FreeTrade on October 20, 2013, 11:54:36 AM

Quote from: smolen on October 20, 2013, 09:49:10 AM

Just a quick thought, you'd better to oversaturate PCIe 3.0 x16 bandwidth to be sure. But that will require really fast hash function...

So . . . some numbers . . . let's say SHA-256 because a lot of numbers are readily available - that's a 256 bit hash result, 32 bytes. If we wanted to a gb/sec from that, we want 1024x1024x1024/32 hashes . . . . or 32 M/h per second. There are faster hashes, and actually 0.5/gig per second would probably more than suffice.

The Birthday Hash will never find a 256 bit collision, so you would truncate it to 64 bit of which perhaps 34-38 bits will be used. Saturating the PCI bus is not best approach here.

FreeTrade

legendary

Activity: 1470

Merit: 1030

Quote from: smolen on October 20, 2013, 09:49:10 AM

Just a quick thought, you'd better to oversaturate PCIe 3.0 x16 bandwidth to be sure. But that will require really fast hash function...

So . . . some numbers . . . let's say SHA-256 because a lot of numbers are readily available - that's a 256 bit hash result, 32 bytes. If we wanted to a gb/sec from that, we want 1024x1024x1024/32 hashes . . . . or 32 M/h per second. There are faster hashes, and actually 0.5/gig per second would probably more than suffice.

bytemaster

hero member

Activity: 770

Merit: 568

fractally

Quote from: smolen on October 20, 2013, 09:49:10 AM

Quote from: bytemaster on October 20, 2013, 09:22:04 AM

The GPU can produce blocks of 100K nonce/hash pairs which a CPU will then push into the hash table at 100K / sec and the result would be significant GPU advantage.

Just a quick thought, you'd better to oversaturate PCIe 3.0 x16 bandwidth to be sure. But that will require really fast hash function...

This is why I hobble the GPU with AES and unpredictable run time. It destroys the massive parallelism by forcing most of the threads to idle.

smolen

hero member

Activity: 524

Merit: 500

Quote from: bytemaster on October 20, 2013, 09:22:04 AM

The GPU can produce blocks of 100K nonce/hash pairs which a CPU will then push into the hash table at 100K / sec and the result would be significant GPU advantage.

Just a quick thought, you'd better to oversaturate PCIe 3.0 x16 bandwidth to be sure. But that will require really fast hash function...

bytemaster

hero member

Activity: 770

Merit: 568

fractally

Quote from: FreeTrade on October 20, 2013, 08:09:44 AM

Quote from: bytemaster on October 19, 2013, 01:42:06 PM

My aim was to create the fastest hash function I could that met one key criteria... unpredictable run length. The unpredictable run length means an unpredictable number of iterations of some fast algorithm.

Ok, and why is that important to you. I'm probably thinking in narrower terms (just for an altcoin) so I don't see the virtue in this - can you explain more please.

Quote from: bytemaster on October 19, 2013, 01:42:06 PM

The other criteria was to use just enough memory to foil GPU parallelization by consuming a lot of GPU cache. The point of Scrypt was to make parallel operation difficult and its primary weakness is that validation time suffers as you scale it.

It sounds like you're employing two methods to make the coin GPU resistant - contention for access to the memory/hashtable and requiring GPU cache. I think this actually doubles the 'attack' space . . . suggest focusing on one, and I recommend the first one (y'now, your new brilliant idea), as the second has already seen a lot of effort.

Unpredictable run length foils GPU threads by having many of the data-parallel threads idle waiting on the longer-running threads.

Fortunately I defined the white paper in generic terms so we can use what ever hash method you like. I will point out though that there are means of avoiding synchronization on the hash table through pipelining. The GPU can produce blocks of 100K nonce/hash pairs which a CPU will then push into the hash table at 100K / sec and the result would be significant GPU advantage.

FreeTrade

legendary

Activity: 1470

Merit: 1030

Quote from: bytemaster on October 19, 2013, 01:42:06 PM

My aim was to create the fastest hash function I could that met one key criteria... unpredictable run length. The unpredictable run length means an unpredictable number of iterations of some fast algorithm.

Ok, and why is that important to you. I'm probably thinking in narrower terms (just for an altcoin) so I don't see the virtue in this - can you explain more please.

Quote from: bytemaster on October 19, 2013, 01:42:06 PM

The other criteria was to use just enough memory to foil GPU parallelization by consuming a lot of GPU cache. The point of Scrypt was to make parallel operation difficult and its primary weakness is that validation time suffers as you scale it.

It sounds like you're employing two methods to make the coin GPU resistant - contention for access to the memory/hashtable and requiring GPU cache. I think this actually doubles the 'attack' space . . . suggest focusing on one, and I recommend the first one (y'now, your new brilliant idea), as the second has already seen a lot of effort.

bytemaster

hero member

Activity: 770

Merit: 568

fractally

Quote from: prophetx on October 19, 2013, 09:53:29 PM

Quote from: bytemaster on October 19, 2013, 01:42:06 PM

Quote from: FreeTrade on October 19, 2013, 01:22:22 PM

Quote from: bytemaster on October 19, 2013, 01:07:11 PM

Good luck filling 1 GB with cryptographically secure data in 1 second. In debug mode I cannot even memset 1 GB in 1 second.

Okay, we better inform the author of the white paper who wrote,

>For example, simply populating 1 Gigabyte of memory with crypto-graphically secure pseudorandom data can take a second to perform.

Anyway, the exact speed is not important - the point is still the same, the slower the hash, the more room for paralellization.

LOL, what was the author who wrote that thinking. My aim was to create the fastest hash function I could that met one key criteria... unpredictable run length. The unpredictable run length means an unpredictable number of iterations of some fast algorithm. The other criteria was to use just enough memory to foil GPU parallelization by consuming a lot of GPU cache. The point of Scrypt was to make parallel operation difficult and its primary weakness is that validation time suffers as you scale it.

You may wish to reconsider foiling efforts that create the most efficient market outcomes and use of resources.

Imagine a world where all transactions and communications are conducted through p2p networks. This world needs a massive amount of computing power, networks which are costlier to operate lose out to networks which can be operated cost efficiently. Also keep in mind that GPU farms keep out botnets.

Efficiency and decentralization... are generally at odds with each other until we can come up with an alternative to proof of work. Ripple and their consensus approach seems like a promising alternative.

prophetx

legendary

Activity: 1666

Merit: 1010

he who has the gold makes the rules

Quote from: bytemaster on October 19, 2013, 01:42:06 PM

Quote from: FreeTrade on October 19, 2013, 01:22:22 PM

Quote from: bytemaster on October 19, 2013, 01:07:11 PM

Good luck filling 1 GB with cryptographically secure data in 1 second. In debug mode I cannot even memset 1 GB in 1 second.

Okay, we better inform the author of the white paper who wrote,

>For example, simply populating 1 Gigabyte of memory with crypto-graphically secure pseudorandom data can take a second to perform.

Anyway, the exact speed is not important - the point is still the same, the slower the hash, the more room for paralellization.

LOL, what was the author who wrote that thinking. My aim was to create the fastest hash function I could that met one key criteria... unpredictable run length. The unpredictable run length means an unpredictable number of iterations of some fast algorithm. The other criteria was to use just enough memory to foil GPU parallelization by consuming a lot of GPU cache. The point of Scrypt was to make parallel operation difficult and its primary weakness is that validation time suffers as you scale it.

You may wish to reconsider foiling efforts that create the most efficient market outcomes and use of resources.

Imagine a world where all transactions and communications are conducted through p2p networks. This world needs a massive amount of computing power, networks which are costlier to operate lose out to networks which can be operated cost efficiently. Also keep in mind that GPU farms keep out botnets.

bytemaster

hero member

Activity: 770

Merit: 568

fractally

Quote from: FreeTrade on October 19, 2013, 01:22:22 PM

Quote from: bytemaster on October 19, 2013, 01:07:11 PM

Good luck filling 1 GB with cryptographically secure data in 1 second. In debug mode I cannot even memset 1 GB in 1 second.

Okay, we better inform the author of the white paper who wrote,

>For example, simply populating 1 Gigabyte of memory with crypto-graphically secure pseudorandom data can take a second to perform.

Anyway, the exact speed is not important - the point is still the same, the slower the hash, the more room for paralellization.

LOL, what was the author who wrote that thinking. My aim was to create the fastest hash function I could that met one key criteria... unpredictable run length. The unpredictable run length means an unpredictable number of iterations of some fast algorithm. The other criteria was to use just enough memory to foil GPU parallelization by consuming a lot of GPU cache. The point of Scrypt was to make parallel operation difficult and its primary weakness is that validation time suffers as you scale it.

FreeTrade

legendary

Activity: 1470

Merit: 1030

Quote from: bytemaster on October 19, 2013, 01:07:11 PM

Good luck filling 1 GB with cryptographically secure data in 1 second. In debug mode I cannot even memset 1 GB in 1 second.

Okay, we better inform the author of the white paper who wrote,

>For example, simply populating 1 Gigabyte of memory with crypto-graphically secure pseudorandom data can take a second to perform.

Anyway, the exact speed is not important - the point is still the same, the slower the hash, the more room for paralellization.

bytemaster

hero member

Activity: 770

Merit: 568

fractally

Quote from: FreeTrade on October 19, 2013, 01:02:32 PM

Quote from: bytemaster on October 19, 2013, 12:43:23 PM

about 2 GB in 5 minutes...

Hmm - I think you're going to want something that uses 1GB in 1 second, run 300 times. If you're not keeping the memory bus or hashtable running at maximum capacity, you can just add extra cores until it is up to full capacity. That surely is what we're trying to avoid.

Good luck filling 1 GB with cryptographically secure data in 1 second. In debug mode I cannot even memset 1 GB in 1 second.

FreeTrade

legendary

Activity: 1470

Merit: 1030

Quote from: bytemaster on October 19, 2013, 12:43:23 PM

about 2 GB in 5 minutes...

Hmm - I think you're going to want something that uses 1GB in 1 second, run 300 times. If you're not keeping the memory bus or hashtable running at maximum capacity, you can just add extra cores until it is up to full capacity. That surely is what we're trying to avoid.

bytemaster

hero member

Activity: 770

Merit: 568

fractally

Quote from: FreeTrade on October 19, 2013, 12:39:34 PM

Quote from: bytemaster on October 19, 2013, 12:19:59 PM

I added some more instrumentation, the speed is closer to 50 k/hash per second for the Birthday Hash algorithm.

Ok, so 50,000 x 60 x 5 x 36 (bits) - 540000000 bits - So writing 64MB to memory over the course of 5 minutes . . . . not so taxing. I think we'd want to be maxing out the memory bus by creating hash collision candidates such that we're writing in the region of 1GB/Sec to memory.

In my test benchmark, running on a single CPU core I utilize about 2 GB in 5 minutes... though I am not using the most efficient storage possible. For starters, I am using a hash table with 64bit (birthday) 64 bit value (nonce) so that is already about 4x as much memory as you calculated. Now I suspect that some of that 2 GB is from the hash table not being 'full'. There is obviously a Memory/CPU/synchronization tradeoff that can be made with the hash table.

Going to 8 cores on an i7 will probably use at least 8GB of RAM which is consumer grade right now.

FreeTrade

legendary

Activity: 1470

Merit: 1030

Quote from: bytemaster on October 19, 2013, 12:19:59 PM

I added some more instrumentation, the speed is closer to 50 k/hash per second for the Birthday Hash algorithm.

Ok, so 50,000 x 60 x 5 x 36 (bits) - 540000000 bits - So writing 64MB to memory over the course of 5 minutes . . . . not so taxing. I think we'd want to be maxing out the memory bus by creating hash collision candidates such that we're writing in the region of 1GB/Sec to memory.

bytemaster

hero member

Activity: 770

Merit: 568

fractally

I added some more instrumentation, the speed is closer to 50 k/hash per second for the Birthday Hash algorithm.

FreeTrade

legendary

Activity: 1470

Merit: 1030

Quote from: bytemaster on October 19, 2013, 10:27:01 AM

The "+" operator is concatenation rather than addition so the math you are doing is not valid.

Ah - of course! Thank you.

Quote from: bytemaster on October 19, 2013, 10:27:01 AM

BirthdayHash is only memory intensive enough to limit the cache available to the GPU. Scrypt-like hash functions have some benefit at fighting ASIC designs as long as their validation time does not exceed 1ms. By combining Scrypt-like functions with Momentum you get the best of all currently known techniques.

So if we assume 1ms to generate a BirthdayHash, this gives us 1000x60x5 hashes for a single core during the target time for a block - 300,000 hashes - very little memory required. The CPU and Memory will be idling, waiting for more completed hashes from the GPU - you want it the other way round, a super fast hash to fill up the memory fast.

bytemaster

hero member

Activity: 770

Merit: 568

fractally

Quote from: PassTheAmmo on October 19, 2013, 11:11:54 AM

Quote from: bytemaster on October 19, 2013, 11:00:10 AM

My Scrypt-like hashing function runs on a single core at about 1 k/hash with relatively few optimizations.

Ok. At that speed nodes in a DHT would only need a few kb/s of internet bandwidth to reach the full potential of the DHT. How much gain that would give would depend on the difficulty of finding a match. I think it would be cool if you made it even slower (if only by repeating it enough times) to make it suitable for one huge DHT. I don't know if it's possible, but for other reasons than the ones you brought up. Anyway, good luck, I think that you could have a great proof of work function with some modifications. I'm outta here!

It all comes down to a tradeoff with validation time. DHT like KAD would require much more bandwidth than the 'raw data' being published.

PassTheAmmo

newbie

Activity: 23

Merit: 0

Quote from: bytemaster on October 19, 2013, 11:00:10 AM

My Scrypt-like hashing function runs on a single core at about 1 k/hash with relatively few optimizations.

Ok. At that speed nodes in a DHT would only need a few kb/s of internet bandwidth to reach the full potential of the DHT. How much gain that would give would depend on the difficulty of finding a match. I think it would be cool if you made it even slower (if only by repeating it enough times) to make it suitable for one huge DHT. I don't know if it's possible, but for other reasons than the ones you brought up. Anyway, good luck, I think that you could have a great proof of work function with some modifications. I'm outta here!

bytemaster

hero member

Activity: 770

Merit: 568

fractally

FYI... I do not consider the algorithm broken if software enhancements can improve performance. I fully expect that with all of the smart minds out there that the algorithmic performance on commodity hardware will improve by several orders of magnitude.

bytemaster

hero member

Activity: 770

Merit: 568

fractally

Quote from: PassTheAmmo on October 19, 2013, 10:51:30 AM

Quote from: bytemaster on October 19, 2013, 10:38:34 AM

Any time you throw away hashes because they fall outside of your memory range you move the computational complexity from the p(n) curve to the q(n) curve in my white paper. You save on memory but the cost is far greater. Memory and CPU power must always be paired proportionally or you will have idle memory or waisted CPU cycles.

I don't wanna pollute your thread with many small replies. If you answer what hashing-power we're looking at in mhashes/s I could make a more thorough reply with numbers. Also, is 32 bit output for birthdayhash right? That would only take storing 77000 entries for 50% chance of finding a collision.

My Scrypt-like hashing function runs on a single core at about 1 k/hash with relatively few optimizations.

bytemaster

hero member

Activity: 770

Merit: 568

fractally

Quote from: PassTheAmmo on October 19, 2013, 10:51:30 AM

Quote from: bytemaster on October 19, 2013, 10:38:34 AM

Any time you throw away hashes because they fall outside of your memory range you move the computational complexity from the p(n) curve to the q(n) curve in my white paper. You save on memory but the cost is far greater. Memory and CPU power must always be paired proportionally or you will have idle memory or waisted CPU cycles.

I don't wanna pollute your thread with many small replies. If you answer what hashing-power we're looking at in mhashes/s I could make a more thorough reply with numbers. Also, is 32 bit output for birthdayhash right? That would only take storing 77000 entries for 50% chance of finding a collision.

There is a trade off between latency and rate of finding a collision. I am actually using something closer to 36 bits. Beyond 36 bits and the time it takes for a Core i7 to find a single collision would be too long. Remember the collision rate is only the rate at which lottery tickets are found. Difficulty is the rate at which blocks could be found. So this algorithm could be tuned to require no more than 4 GB of RAM after which it produces a collision for every single BirthdayHash... however, there are still benefits for storing and comparing all matches and not just single collisions because every potential match is a lottery ticket. Therefore the number of bits in the collision is almost irrelevant beyond a certain point.

bytemaster

hero member

Activity: 770

Merit: 568

fractally

Is there a way that we can quantify the cost vs gain associated with developing specialized solutions to SHA, Scrypt, and Momentum?

SHA is merely a game of transistors per chip and has no limits associated with a memory bus. The speed of light is not a significant factor as the chips only have to 'communicate' matching results to the outside world.

Scrypt is effectively the same as SHA with the exception that your density per ASIC is reduced. Once again there are no speed-of-light limitations because only matching results need to be communicated outside the chip itself.

The proof of the nature of these algorithms is the existence of a mining pool which effortlessly aggregates any number of computational units at a global scale to produce what is effectively one giant computer.

Attempt to build a mining pool super computer for Momentum and you will quickly discover that bandwidth and the speed-of-light make it so expensive as to be infeasible. So while there may be a theoretical quadratic return in algorithm performance for increasing the amount of RAM you have, the speed of light and density limitations kick in and limit performance beyond a certain point.

PassTheAmmo

newbie

Activity: 23

Merit: 0

Quote from: bytemaster on October 19, 2013, 10:38:34 AM

Any time you throw away hashes because they fall outside of your memory range you move the computational complexity from the p(n) curve to the q(n) curve in my white paper. You save on memory but the cost is far greater. Memory and CPU power must always be paired proportionally or you will have idle memory or waisted CPU cycles.

I don't wanna pollute your thread with many small replies. If you answer what hashing-power we're looking at in mhashes/s I could make a more thorough reply with numbers. Also, is 32 bit output for birthdayhash right? That would only take storing 77000 entries for 50% chance of finding a collision.

bytemaster

hero member

Activity: 770

Merit: 568

fractally

Quote from: PassTheAmmo on October 19, 2013, 10:33:59 AM

Quote from: bytemaster on October 19, 2013, 10:18:48 AM

I had considered the potential for such a design. DHT will not perform well enough, you would need 100BT ethernet or perhaps gigabit before the hash table could be filled quickly enough. From what I can tell a SUPER COMPUTER with a ton of RAM and high-speed interconnects would give you the best performance. Here is the difference, such a super computer could be built from commodity parts.

Well, if the hash is so fast you could just collaborate by splitting the birthday space in N ranges and each computer just throws away any result that isn't in its range. This trades hashing speed for memory, so works with rather than against your pow.

What range do you suggest for N and how fast is the hash?
This should be possible to test.

Any time you throw away hashes because they fall outside of your memory range you move the computational complexity from the p(n) curve to the q(n) curve in my white paper. You save on memory but the cost is far greater. Memory and CPU power must always be paired proportionally or you will have idle memory or waisted CPU cycles.

PassTheAmmo

newbie

Activity: 23

Merit: 0

Quote from: bytemaster on October 19, 2013, 10:18:48 AM

I had considered the potential for such a design. DHT will not perform well enough, you would need 100BT ethernet or perhaps gigabit before the hash table could be filled quickly enough. From what I can tell a SUPER COMPUTER with a ton of RAM and high-speed interconnects would give you the best performance. Here is the difference, such a super computer could be built from commodity parts.

Well, if the hash is so fast you could just collaborate by splitting the birthday space in N ranges and each computer just throws away any result that isn't in its range. This trades hashing speed for memory, so works with rather than against your pow.

What range do you suggest for N and how fast is the hash?
This should be possible to test.

bytemaster

hero member

Activity: 770

Merit: 568

fractally

Quote from: PassTheAmmo on October 19, 2013, 10:06:19 AM

I agree with FreeTrade, why use a memory intensive algorithm for BirthdayHash if you're gonna fill up your ram with hashes anyway?

But either way, here is why it is worse than scrypt:
Your proposed algorithm (momentum) scales super-linearly with the number of computers that can be dedicated to solving the proof-of-work. Set up N computers storing generated hashes in a distributed hash table between them. They now have N times the memory capacity of a single computer and will fill up their memory just as fast. This could scale to pretty large N's, you don't need low latency, you could accumulate hashes on each computer and send them in batches.

And this is bad of course (would lead to centralization).

Suggested modification:
Embrace this and encourage every single miner to be part of the same DHT (Not that simple, but you get the picture), share reward between the computer generating the hash and the one that had it stored. Could work on the current P2P system already in place.

I agree that if there were a way to efficiently merge the memory of all computers on the internet this would be great. Unfortunately, the memory bandwidth would be limited by the internet and thus not pratical at all. However, I suspect that every office environment could achieve this at a slightly smaller scale.

What is the cost per unit of computational power / memory as the amount of memory / power approaches infinity? I suspect that it goes non-linear for larger and larger sizes and thus the limits on Super Computers today. Eventually speed-of-light becomes your problem.

bytemaster

hero member

Activity: 770

Merit: 568

fractally

Quote from: FreeTrade on October 19, 2013, 06:44:26 AM

>Assuming a cryptographically secure hashing function Hash(x) and a Sequental-
>Memory-Hard hashing function BirthdayHash(x) such as scrypt the algorithm for this
>proof of work can be defined as follows:
>Given a block of data D, calculate H = Hash(D).
>Find nonce A and nonce B such that BirthdayHash(A +H) == BirthdayHash( B+H)
>If Hash(H + A + B) < TargetDifficulty then a result has been found, otherwise keep
>searching

Okay - just wanted to mention two issues, more to check whether I've correctly understood than anything else.

1. My first instinct would be to choose a fixed number, say F and build a rainbow table for values of x where

BirthdayHash(x) = F

Then in order to solve
BirthdayHash(A +H) == BirthdayHash( B+H)

setting A and B to values such that A=x1-H and B=x2-H

This assumes ruling out the trivial A=B, of course.

This is trivial to correct of course, with a salt or half a dozen other ways, but I just wanted to confirm my understanding is correct.

2. I'm confused as to why we'd want a memory intensive hash for BirthdayHash(x) - it seems the harder this hash is, the more opportunity for parallelization. As noted by Nathaniel, the ideal setup would be GPU working on the hashes, with the CPU looking for collisions. In order to create the bottleneck on the CPU/Memory, we'd want to flood it with potential matches so it was maxed out, and to do that we'd want to make the hash as simple as possible.

Again - have I understood that correctly, or am I missing something?

BirthdayHash(x) where X = 256 bit (or 512 bit) number and produces a 32 bit number would be hard to build a rainbow table for. Such a rainbow table would probably use far more memory.

The "+" operator is concatenation rather than addition so the math you are doing is not valid.

BirthdayHash is only memory intensive enough to limit the cache available to the GPU. Scrypt-like hash functions have some benefit at fighting ASIC designs as long as their validation time does not exceed 1ms. By combining Scrypt-like functions with Momentum you get the best of all currently known techniques.

bytemaster

hero member

Activity: 770

Merit: 568

fractally

bytemaster

hero member

Activity: 770

Merit: 568

fractally

Quote from: PassTheAmmo on October 19, 2013, 10:06:19 AM

I agree with FreeTrade, why use a memory intensive algorithm for BirthdayHash if you're gonna fill up your ram with hashes anyway?

But either way, here is why it is worse than scrypt:
Your proposed algorithm (momentum) scales super-linearly with the number of computers that can be dedicated to solving the proof-of-work. Set up N computers storing generated hashes in a distributed hash table between them. They now have N times the memory capacity of a single computer and will fill up their memory just as fast. This could scale to pretty large N's, you don't need low latency, you could accumulate hashes on each computer and send them in batches.

And this is bad of course (would lead to centralization).

Suggested modification:
Embrace this and encourage every single miner to be part of the same DHT (Not that simple, but you get the picture), share reward between the computer generating the hash and the one that had it stored. Could work on the current P2P system already in place.

I had considered the potential for such a design. DHT will not perform well enough, you would need 100BT ethernet or perhaps gigabit before the hash table could be filled quickly enough. From what I can tell a SUPER COMPUTER with a ton of RAM and high-speed interconnects would give you the best performance. Here is the difference, such a super computer could be built from commodity parts.

So the question becomes how would such a super computer compete with say a corporate office that works together via the high-speed local network to mine at night? The other question becomes, if you are going to build a SUPER COMPUTER to perform this mining computation then you probably have a dual-use machine that could be used for all kinds of research. This is far better than Scrypt or SHA256.

PassTheAmmo

newbie

Activity: 23

Merit: 0

I agree with FreeTrade, why use a memory intensive algorithm for BirthdayHash if you're gonna fill up your ram with hashes anyway?

But either way, here is why it is worse than scrypt:
Your proposed algorithm (momentum) scales super-linearly with the number of computers that can be dedicated to solving the proof-of-work. Set up N computers storing generated hashes in a distributed hash table between them. They now have N times the memory capacity of a single computer and will fill up their memory just as fast. This could scale to pretty large N's, you don't need low latency, you could accumulate hashes on each computer and send them in batches.

And this is bad of course (would lead to centralization).

Suggested modification:
Embrace this and encourage every single miner to be part of the same DHT (Not that simple, but you get the picture), share reward between the computer generating the hash and the one that had it stored. Could work on the current P2P system already in place.

bg002h

donator

Activity: 1466

Merit: 1048

I outlived my lifetime membership:)

Quote from: prophetx on October 19, 2013, 07:56:06 AM

You should change it kor stick to your original offer. Also "convince me" is very subjective, sounds like you ant someone to do work for free for you.

You must not know Dan. He's a real programmer (darn good too...really darn good) with a real reputation.

prophetx

legendary

Activity: 1666

Merit: 1010

he who has the gold makes the rules

FreeTrade

legendary

Activity: 1470

Merit: 1030

>Assuming a cryptographically secure hashing function Hash(x) and a Sequental-
>Memory-Hard hashing function BirthdayHash(x) such as scrypt the algorithm for this
>proof of work can be defined as follows:
>Given a block of data D, calculate H = Hash(D).
>Find nonce A and nonce B such that BirthdayHash(A +H) == BirthdayHash( B+H)
>If Hash(H + A + B) < TargetDifficulty then a result has been found, otherwise keep
>searching

Okay - just wanted to mention two issues, more to check whether I've correctly understood than anything else.

1. My first instinct would be to choose a fixed number, say F and build a rainbow table for values of x where

BirthdayHash(x) = F

Then in order to solve
BirthdayHash(A +H) == BirthdayHash( B+H)

setting A and B to values such that A=x1-H and B=x2-H

This assumes ruling out the trivial A=B, of course.

This is trivial to correct of course, with a salt or half a dozen other ways, but I just wanted to confirm my understanding is correct.

2. I'm confused as to why we'd want a memory intensive hash for BirthdayHash(x) - it seems the harder this hash is, the more opportunity for parallelization. As noted by Nathaniel, the ideal setup would be GPU working on the hashes, with the CPU looking for collisions. In order to create the bottleneck on the CPU/Memory, we'd want to flood it with potential matches so it was maxed out, and to do that we'd want to make the hash as simple as possible.

Again - have I understood that correctly, or am I missing something?

bytemaster

hero member

Activity: 770

Merit: 568

fractally

Quote from: Nathaniel B on October 18, 2013, 11:24:31 PM

But you can still gain linear increases in matches with just increasing the hashrate so at the very least you can just have an ASIC or GPU that hashes and sends the results to a CPU that manages the memory and checks for matches. GPUs will still be crazy good for it since they could hash 100s of times faster than the CPU and with 2-3GB of memory on board, the CPU would need too much memory to pull even. I think a GPU/CPU hybrid would be best in class until ASICs with the GPU doing mostly hashing and then sending the results to system RAM and the CPU which checks for matches.

With this linear increase just by increasing the hashrate the Momentum ASICs will be at least as good as the scrypt and probably much better since they could just extend an off the self system with lots of RAM and use a CPU to manage the hashtable and match checking part. Design might be a bit harder, but that's just a short term concern.

The reason it is GPU resistant has to do with bus speeds and the unpredictable runtime of my scrypt-like hash function. I did not discuss this in the white paper which was meant to be more generic. I also use hardware accelerated AES to give the CPU an edge over GPU.

Nathaniel B

newbie

Activity: 19

Merit: 0

But you can still gain linear increases in matches with just increasing the hashrate so at the very least you can just have an ASIC or GPU that hashes and sends the results to a CPU that manages the memory and checks for matches. GPUs will still be crazy good for it since they could hash 100s of times faster than the CPU and with 2-3GB of memory on board, the CPU would need too much memory to pull even. I think a GPU/CPU hybrid would be best in class until ASICs with the GPU doing mostly hashing and then sending the results to system RAM and the CPU which checks for matches.

With this linear increase just by increasing the hashrate the Momentum ASICs will be at least as good as the scrypt and probably much better since they could just extend an off the self system with lots of RAM and use a CPU to manage the hashtable and match checking part. Design might be a bit harder, but that's just a short term concern.

bytemaster

hero member

Activity: 770

Merit: 568

fractally

I have considered the potential of using hardware optimizations and believe that they could be built and with some of the advanced content addressable memory could be very quick.

There are TWO things that are necessary to be optimized via an ASIC.

1) An ASIC that is able to handle Scrypt-like hash functions.
2) A large amount of transistors dedicated to memory.

Optimizing one without the other is pointless.

Both of these things are possible to build and the result would be a mining beast. However, memory uses electricity and the scrypt-like hash functions are much harder to create ASICs for.

So I contend that the economic costs would make the threshold at which investing in an ASIC makes sense to be much higher than what would occur with either Bitcoin or Scrypt.

So the real question becomes what is the computational advantage per additional transistor in Sha256 vs Scrypt vs Momentum and what is the cost of deploying it.

One thing is clear, this thing is likely GPU proof and a much harder ASIC to develop than Scrypt.

Nathaniel B

newbie

Activity: 19

Merit: 0

Once the RAM is full everything is a read so it'll be perfectly parallel with some negligible constant time overhead.

Improvements against scrypt for the birthdayhash will help a bit, but like you said any tricks generally translate fine into ASICs.

For your example, since the mac pro has 16 times more RAM and hashing power it will be exactly 16^2 or 256 times better. Well worth the 50x cost. Since this is quadratic it will just get worse when you get into 128GB servers.

I like the elegance of having the RAM be the main resource for hashing, but I think we have a ways to go before making it practical. I'll keep thinking about it.

Nathaniel

bytemaster

hero member

Activity: 770

Merit: 568

fractally

Nathaniel,
Thank you for your insightful response. Here is what I already know to be true about the algorithm.

1) The faster you can fill the RAM to the max amount, the more collisions per second you can find.
2) Once RAM is full, the performance is equal to SCRYPT with the exception that checking the centralized store is a synchronization point. If a way can be found to require sequential access to the stored birthdays then that would be the holy grail because you couldn't fill the RAM in parallel.
3) In my particular implementation I do not use SCRYPT, but an algorithm based upon the same sequential-memory-hard design. The twist is that I use AES (hardware accelerated on CPU) + 'unpredictable execution time' to trip up GPUs and other parallel architectures. ASIC could probably design around it. I left the white paper more abstract, but my particular implementation is better than Scrypt even at he Birthday Hash level.

The most compelling argument you make is that there is a non-linear gain for building larger systems. Ie: doubling the RAM and doubling the CPU to match would result in more than 2x performance gain. Though I would argue that the cost of building larger systems is also non-linear. Ie: to double the RAM and CPU of a high-end system would cost more than 2x the base system.

Lets compare the range of mining machines:

1) Mac Pro with 12 cores and 32 GB of RAM
2) 1.6 Ghz 4 Core Arm A9 with 2 GB of RAM

I would estimate that the ratio of RAM to CPU is the same in both cases, but the Mac Pro would generate significantly more collisions per second because it has the RAM + ability to fill it in a timely manner. The question becomes which is a better value: 50 ARM systems at $100 each or 1 Mac Pro at $5000+?

Nathaniel B

newbie

Activity: 19

Merit: 0

bytemaster,

First, I think the idea is quite interesting and novel. Let me summarize what you are proposing to make sure we’re on the same page. Then, I will raise my concerns with the current proposal (partly in hopes of collecting a bounty and partly because it’s a really cool idea that could be the grounds for some interesting new cryptocurrencies).

The question is whether the Momentum or scrypt give worse performance increases for specialized hardware design over typical commodity hardware already in the hands of users.

Paraphrase of your section 3 in the paper with some notes:
Take data D (transactions, previous block hash, etc), calculate H = Hash(D).
Find nonce A and nonce B such that BirthdayHash(A+H) == BirthdayHash(B+H) (in practice a nonce C, D, etc could also be required to match)
If Hash(H+A+B) < TargetDifficulty then you win.

My argument against Momentum compared to scrypt:
The way you describe the algorithm and present it in the paper is all in terms of finding pairs of birthdays that match from a large set and the probability of finding a match. While this is the intuition behind the idea of how you can require lots of RAM while still being able to verify a correct answer quickly, it does not show the real goal of a miner. A miner does not want to find one match, but as many matches as possible. This is because a miner wants as many chances at finding a hash under the difficulty as possible. When looking at number of matches, we should talk in terms of expected value.

Note that since Hash() is supposed to be trivially fast and is called so much less than other functions I just assume we have an oracle that gives us a win or loss answer for each match for clarity even though technically each match hash to be Hash(H + A + B) to determine if it’s a winner. For the rest of this explanation I’ll just talk in terms of expected matches.

Let B = the number of ‘birthdays’ (hashspace of BirthdayHash).
Let S = the number of BirthdayHash(nonce+H) stored in RAM.
Let M = maximum number that S can be given a particular system’s RAM.

If we find a new BirthdayHash(A + H), the expected number of matches ( BirthdayHash(A+H)==BirthdayHash(B+H) ) = S / B

Now we calculate the total number of expected matches during a time frame (average block creation time for instance). we take the number of BirthdayHash() we compute in that time (N). S will increase from 0 to N unless M
N * Mean(S) / B = Total expected matches
If M>=N:
Mean(S) = (1 + 2 + 3 + … + N)/N = (N+1)/2
If NMean(S) = ((1 + 2 + 3 + … + M) + M(N-M) )/N = M, if N >> M

If M>=N: Then the system scales the same as scrypt just with the amount of hashes so I make an scrypt ASIC until N>M. At this point they would be the same basically.

If N > M but not by much: Keep deploying scrypt ASICs even though there is a minor hit once RAM is full it doesn’t hurt too bad yet. If the goal is to land in this small sliver, I think that will be very hard to predict where it will be ahead of time and will likely vary drastically between CPU/GPU/ASIC so even hitting it right on for one may miss the others.

If N >> M:
This is when the lack of RAM kills straight scrypt ASICs as a possibility so I assume this is the scenario that Momentum is designed and would be tweaked to get into.
Now the expected value is: N * M / B = Total expected matches

This means that two variables affect the total amount of mining: N the number of BirthdayHash() computed per time and M which is basically the amount of RAM. The fundamental problem I see here is that we now have quadratic scaling with increased resources rather than linear.
For example:
You have a standard system that performs N BirthdayHash() and has M RAM which then expects N*M/B matches. Someone who buys a system that costs five times as much now has 5N*5M/B matches or 25 times as much output while only paying 5 times as much.

With off the shelf hardware there will be a natural limit on RAM perhaps such that the cost increases dramatically as people have to start buying high end servers, but custom ASICs can use standard commodity RAM and while no doubt they would face some scaling and bus issues, even if ASIC makers could achieve a few multiples of the amount of RAM normal users have it would multiply by the amount of hashing power. At the same time, even if RAM isn’t increased past commodity systems, if ASIC makers can use just as much RAM as commodity systems, they can still add the same multiplier as a straight scrypt ASIC would as just increasing the number of hashes per second still scales the number of matches linearly. For this reason I believe that Momentum is worse than scrypt in terms of customized hardware in the long run. In the short term, such hardware would no doubt be harder to design than just scrypt ASICs.

Another interesting piece of the algorithm to look at is the number of nonces that have to have matching birthdays. Scaling this up to 3 or 4+ shouldn’t affect the fundamental hashes per second * RAM scaling (I don’t have time to do the math but each should be so reliant on the pair matching as to be almost identically) although it will affect the storage strategy and how you use the RAM. In fact once the number of birthdays all matching is greater than 2, pools could use this to gain advantage. Consider a pool that in addition to hashing on the same block also share any partial matches (when BirthdayHash(A+H) ==BirthdayHash(B+H) and you need C and D matching as well). Each user of the pool can then improve their odds by saving partial matches and junking some single results they had. In this scenario it makes since to join the largest pool who gives you better partial matches so your expected matches also go up increasing your contribution. This is a dangerous incentive as even with BTC we have seen some large pools without any particular incentive beyond lower variance.

All that said, it is an interesting idea and I’ll keep thinking about ways to utilize/improve it. It’s also quite possible that I’ve misunderstood your algorithm so let me know what you think. These are the kinds of ideas that we need to share and discuss in order to build new innovative cryptocurrencies and improve upon existing ones.

Let me know if anything is unclear. I'm not sure how this board does math notation.

Thanks,

Nathaniel

FreeTrade

legendary

Activity: 1470

Merit: 1030

+1 - This is a very interesting development for CPU coins.

I'll need a while to digest the white paper, but I'm liking the sound of it.

Your parameters sound similar to (the now dead) MemoryCoin - so I wonder if we have similar ideas in mind.
http://memorycoin.org/manifesto/

Some people had expressed doubts that the MemoryCoin hash was GPU resistant so very open to a new algorithm.

It so happens I'm just back from holiday and considering whether to try to restart MemoryCoin or try something new. Not particularly interested in the bounty, but if you wanted to put that towards services and promotion, we might have something.

bytemaster

hero member

Activity: 770

Merit: 568

fractally

In the linked white paper I have documented a new proof-of-work system that can be used for BitMessage, altcoins, or other distributed autonomous corporations. This new proof of work can require gigabytes of memory to solve, but almost no memory to verify and as a result the ideal ASIC is memory. If one were to create a Scrypt ASIC capable of 1 Giga-hash / second then one would require 10 terabytes of RAM to most efficiently find one solution at the target difficulty for the momentum proof-of-work every 10 minutes on average.

I have provided a proof of concept implementation here: https://github.com/InvictusInnovations/BitShares/blob/master/src/momentum.cpp

This 30 BTC bounty is for two different goals.

One goal is to produce an implementation of this proof of work system that negates the algorithmic advantage given to the use of memory and allows much faster solutions. If you are able to convince me this proof-of-work is no better than Scrypt then you will win the 30 btc bounty. The objective proof that someone has convinced me will be whether or not I use momentum for BitShares. I have far more on the line for choosing a bad proof of work than not paying this out.

~~Edit: It has been pointed out to me that this could be implemented in 1 day by a motivated individual, so the bounty for creating the following alt-coin has been reduced to 5 BTC~~
THe 5 BTC bounty from this thread has been supplanted by: https://bitcointalksearch.org/topic/protoshares-bounty-30-btc-start-mining-bitshares-in-november-315973

The other way to win the bounty is to create a fork of the latest Bitcoin codebase that leverages the proof of work implementation in my repository. I will be more than willing to help with any such implementation. The condition on this forked code base is that the resultant 'alt-coin' must not be launched or have any pre-mining and I get to set the release date. It must have 5 minute block intervals, a maximum money supply of 2 million, a 4 year mining period until the full money supply has been mined with a halving of the mining reward every 3 months.

You can read the full white paper for the proof-of-work here: http://invictus-innovations.com/s/MomentumProofOfWork.pdf

This bounty expires November 15th, 2013.

~~Only one 30 BTC bounty will be paid... either to the creator of the alt-coin or to the person who breaks the proof-of-work.~~
~~With the reduction in alt-coin generation bounty, I will pay both bounties if both are won.~~

Topic: Momentum Proof-of-Work Bounty - 30 BTC [PAID] (Read 14110 times)