MemoryCoin 2.0 Proof Of Work - page 4.

AnonyMint

hero member

Activity: 518

Merit: 521

Okay that seems correct. I was focused on the GPU eliminating the L2 memory advantages at a guessed 10X factor and I didn't know how much faster GPUs could compute AES (now you show evidence of only 2 - 3X relative to Intel's AES-NI), but then if computation bound is the goal, it seems the 1GB memory is not necessary except I guess you are aiming at complexity to implement an ASIC that interfaces with DRAM.

And ASICs can provide dedicated AES circuits.

So the main threat will come from ASICs but that won't be until your market cap is large enough to justify the ASIC development.

The reason I don't favor being compute bound is because I want to eliminate ASICs entirely.

FreeTrade

legendary

Activity: 1470

Merit: 1030

Quote from: AnonyMint on December 15, 2013, 08:20:49 AM

My wild guesstimate is you see 10X instead of 3X and have a 3X disadvantage on hash per watt. Readers I don't know. I am only wild guessing.

I think to really boil it down - it's going to be about how fast GPUs can perform AES compression.

If we have a look at at gKrypt -

http://gkrypt.com/

they're talking about 80 gigabits per second on a single GPU on their hopepage. That's 10 GB/s

On an i7 4770, haswell, I'm seeing about 4 hashes per minute - that's 200GB per minute, or about 3.3GB/s

So after some optimization, and 64bit compile, we'll hopefully see that up to 4 or 5GB/s.

So I'm sticking with my 2X or 3X - and interesting point you make about power consumption - much more power efficient on CPU.

AnonyMint

hero member

Activity: 518

Merit: 521

Quote from: FreeTrade on December 15, 2013, 08:02:18 AM

Quote from: AnonyMint on December 15, 2013, 07:47:09 AM

Even Haswell has 10X less FLOPS than top-of-the-line GPUs. Memory latency will be masked away to 0, if the GPU can run enough parallel copies. Your algorithm is going to be hobbled relative to the GPU by the 10X slower bandwidth of 20 GB/s speed to write out the initial 1 GB.

The 1GB write is a small fraction of the overall time required - the 50GB AES encryption is the lion's share of the hash.

So the hash rate is slower than 1 hash per second per CPU core?

All CPU cores share the same main memory bottleneck, so on an 8 core CPU the 1 GB per core is 8GB relative to the 50 GB.

Also the GPU can use its 10X greater FLOPS to likely remove the speed advantage of the AES instructions on the CPU. And this massive parallelization will likely eliminate the L2 advantage on memory latency, while (although I am not sure without studying in detail the code) cache memory bandwidth will not be the bottleneck rather computation of the AES.

I know you don't expect to beat the GPU, perhaps you are only hoping to be near par on hashes per watt.

My wild guesstimate is you see 10X instead of 3X and have a 3X disadvantage on hash per watt. Readers I don't know. I am only wild guessing.

Quote from: FreeTrade on December 15, 2013, 08:02:18 AM

Quote from: AnonyMint on December 15, 2013, 07:47:09 AM

The use of dedicated AES instructions on CPUs might help a little but I doubt enough to stop the GPU from being 10 - 100X faster, and ASICs could implement AES to run faster.

The latency on L2 is still several cycles and the latency on the CPU will asymptotically go to 0, although you have that 1 GB to limit the number of copies the GPU can run, i.e. 6 parallel copies on a 6 GB GPU (which might be sufficient to eliminate most latency).

You might get the 3X faster expected result, but I am not confident of that.

So there are two possible bottlenecks - the memory-bus access and AES speed. A GPU miner will need to solve both.

Parallelization attacks main memory latency and computation, but not maximum main memory bandwidth. L2 might win on cumulative cache bandwidth with multiple cores (since each cache is independent). Some GPUs have large caches now though.

Quote from: FreeTrade on December 15, 2013, 08:02:18 AM

Agreed both bottlenecks can be addressed in a GPU miner - but with the lack of AES-NI, slower cores, and lack of L2 cache to match the cores, I'm hoping it'll be 2X to 3X max. And it should take some time too.

Maybe. I can't say for sure. I am just giving you feedback.

Quote from: FreeTrade on December 15, 2013, 08:02:18 AM

Quote from: AnonyMint on December 15, 2013, 07:47:09 AM

How does this algorithm validate faster and with less memory? I think I know, but I don't want to say. I want to know what you came up with.

The algorithm essentially looks for a pattern in the data. The validation is told where the pattern is, so only produces a fraction of the psuedo-random data, and doesn't need to search for the start location of the pattern.

That is what I expected. That is what I am doing.

FreeTrade

legendary

Activity: 1470

Merit: 1030

Quote from: AnonyMint on December 15, 2013, 07:47:09 AM

How is the difficulty altered if the solution is a fixed value and not instead a variable value which the output of the hash must be less than?

So the solution is then SHA256'd and this value is the variable.

Quote from: AnonyMint on December 15, 2013, 07:47:09 AM

Even Haswell has 10X less FLOPS than top-of-the-line GPUs. Memory latency will be masked away to 0, if the GPU can run enough parallel copies. Your algorithm is going to be hobbled relative to the GPU by the 10X slower bandwidth of 20 GB/s speed to write out the initial 1 GB.

The 1GB write is a small fraction of the overall time required - the 50GB AES encryption is the lion's share of the hash.

Quote from: AnonyMint on December 15, 2013, 07:47:09 AM

The use of dedicated AES instructions on CPUs might help a little but I doubt enough to stop the GPU from being 10 - 100X faster, and ASICs could implement AES to run faster.

The latency on L2 is still several cycles and the latency on the CPU will asymptotically go to 0, although you have that 1 GB to limit the number of copies the GPU can run, i.e. 6 parallel copies on a 6 GB GPU (which might be sufficient to eliminate most latency).

You might get the 3X faster expected result, but I am not confident of that.

So there are two possible bottlenecks - the memory-bus access and AES speed. A GPU miner will need to solve both. Agreed both bottlenecks can be addressed in a GPU miner - but with the lack of AES-NI, slower cores, and lack of L2 cache to match the cores, I'm hoping it'll be 2X to 3X max. And it should take some time too.

Quote from: AnonyMint on December 15, 2013, 07:47:09 AM

How does this algorithm validate faster and with less memory? I think I know, but I don't want to say. I want to know what you came up with.

The algorithm essentially looks for a pattern in the data. The validation is told where the pattern is, so only produces a fraction of the psuedo-random data, and doesn't need to search for the start location of the pattern.

AnonyMint

hero member

Activity: 518

Merit: 521

I didn't see this until just now.

Quote from: FreeTrade on December 13, 2013, 03:18:09 AM

3. Use the last 32bits%2^14 as the solution. If the solution==1968, block solved

How is the difficulty altered if the solution is a fixed value and not instead a variable value which the output of the hash must be less than?

Quote from: FreeTrade on December 13, 2013, 03:18:09 AM

Ok, here's the latest modification to the PoW.

1. Generate 1GB PsuedoRandom data using SHA512
2. For each 64K block - Repeat 10 50 times
2.1 Use the last 32bits as a pointer to another 64K block
2.2 XOR the two 64K blocks together
2.3 AES CBC encrypt the result using the last 256 bits as a key
3. Use the last 32bits%2^14 as the solution. If the solution==1968, block solved

Expect 1 solution per set.

This will offer a good level of GPU resistance for the following reasons -
1. Complexity - requires SHA512 hashing and AES CBC encryption
2. CPU instruction sets - many have specific AES instruction sets, GPU's don't
3. SHA512 - more efficient with 64 bit operations, but most GPUs are 32bit
4. Multiple AES encryption and XORing will keep the L2 cache busy, GPUs will be forced to use slower memory for massive parallelization.

I think more efficient GPU miners will be possible, but they should be delayed and not offer performance gains of more than 2X or 3X.

Even Haswell has 10X less FLOPS than top-of-the-line GPUs. Memory latency will be masked away to 0, if the GPU can run enough parallel copies. Your algorithm is going to be hobbled relative to the GPU by the 10X slower bandwidth of 20 GB/s speed to write out the initial 1 GB.

The use of dedicated AES instructions on CPUs might help a little but I doubt enough to stop the GPU from being 10X faster, and ASICs could implement AES to run faster.

The latency on L2 is still several cycles and the latency on the CPU will asymptotically go to 0, although you have that 1 GB to limit the number of copies the GPU can run, i.e. 6 parallel copies on a 6 GB GPU (which might be sufficient to eliminate most latency).

You might get the 3X faster expected result, but I am not confident of that.

How does this algorithm validate faster and with less memory? I think I know, but I don't want to say. I want to know what you came up with.

On the rough guesstimate (note I am quite sleepy at the moment), I can't see you've accomplished anything the Litecoin did not already except the extra DRAM requirement, i.e. probably 10X faster on GPU and vulnerable to ASICs (running with DRAM). Or did I miss something?

Note Litecoin ASICs are apparently in development now.

Sharky444

hero member

Activity: 724

Merit: 500

Yes, I did mean those.

p.s.

Freetrade please read the PM I've sent you yesterday.

FreeTrade

legendary

Activity: 1470

Merit: 1030

Quote from: Sharky444 on December 13, 2013, 03:23:30 AM

A dedicated CPU miner that uses hardware AES should be at least 10 times faster. I hope such a miner is released to the public soon after launch and not being kept private.

Can you explain more about what you mean by hardware AES? Are you talking about the AES instructions built into CPUs? If so then I'm hoping these will be compiled into the QT client so we should have a pretty efficient miner there off the bat for any chips with the AES instruction sets - more details here -

http://software.intel.com/en-us/articles/intel-advanced-encryption-standard-instructions-aes-ni

and

http://en.wikipedia.org/wiki/AES_instruction_set#Supporting_CPUs

Sharky444

hero member

Activity: 724

Merit: 500

A dedicated CPU miner that uses hardware AES should be at least 10 times faster. I hope such a miner is released to the public soon after launch and not being kept private.

FreeTrade

legendary

Activity: 1470

Merit: 1030

Ok, here's the latest modification to the PoW.

1. Generate 1GB PsuedoRandom data using SHA512
2. For each 64K block - Repeat 10 50 times
2.1 Use the last 32bits as a pointer to another 64K block
2.2 XOR the two 64K blocks together
2.3 AES CBC encrypt the result using the last 256 bits as a key
3. Use the last 32bits%2^14 as the solution. If the solution==1968, block solved

Expect 1 solution per set.

This will offer a good level of GPU resistance for the following reasons -
1. Complexity - requires SHA512 hashing and AES CBC encryption
2. CPU instruction sets - many have specific AES instruction sets, GPU's don't
3. SHA512 - more efficient with 64 bit operations, but most GPUs are 32bit
4. Multiple AES encryption and XORing will keep the L2 cache busy, GPUs will be forced to use slower memory for massive parallelization.

I think more efficient GPU miners will be possible, but they should be delayed and not offer performance gains of more than 2X or 3X.

Sharky444

hero member

Activity: 724

Merit: 500

Quote from: FreeTrade on December 05, 2013, 03:17:08 AM

I'm coming to the conclusion that the only place (other than size of main memory) where CPUs can keep pace with GPUs is the L2 cache (either size or bandwidth of L2), but that the GPU can compensate by running more processes slowly from main memory. Maybe it is only possible to delay GPUs with sheer complexity - ala Quark.

I came to the same conclusion after thinking about it for a week. You can only get an advantage from latency (the L2 bandwith does no matter much, as GPUs have also L2 for every core block, but a smaller size than CPU), which means you would need a shitload of different memory operations within L2 (I would make it about ~ 200 KB, not 256, to ensure that it really sticks to L2). GPU could compensate by running 200 threads at a much slower pace, but still at 10x the speed of the CPU overall. If you make the algo complex enough a GPU miner will still take at least 1 month to code, maybe 3-5, so CPU users get a head start.

ludd

newbie

Activity: 21

Merit: 0

Any news about MEC 2.0?

AnonyMint

hero member

Activity: 518

Merit: 521

Maybe I can help you soon FreeTrade, or you can help me. Lets see when there is something tangible to evaluate. Feel free to delete if off-topic. I don't like to leave this hanging, but I better not speak more at this time.

cryptrol

hero member

Activity: 637

Merit: 500

Quote from: FreeTrade on December 05, 2013, 03:17:08 AM

I'm coming to the conclusion that the only place (other than size of main memory) where CPUs can keep pace with GPUs is the L2 cache (either size or bandwidth of L2), but that the GPU can compensate by running more processes slowly from main memory. Maybe it is only possible to delay GPUs with sheer complexity - ala Quark.

I think the conclusion you are coming to is the right one.

IMHO trying to defeat GPU, Botnets or future ASICs is nonsense, and a waste of resources, it just can't be done. That's specially true for botnets for obvious reasons and in the end GPU's are just massively parrallell slow CPUs.

I would just focus on improving some of the well known PoW algorithms, most are just fine but can be fine tuned to get better results or more GPU resistance (you did it with scrypt).

Just my 2 cents.

FreeTrade

legendary

Activity: 1470

Merit: 1030

Warning - offtopic and inflammatory posts will be removed. Please stay on topic.

FreeTrade

legendary

Activity: 1470

Merit: 1030

Quote from: FreeTrade on December 04, 2013, 06:18:17 AM

Quote from: Sharky444 on December 04, 2013, 05:45:40 AM

Freetrade you forget that GPUs can have up to 300GB/s main memory bandwidth. So you get only a latency advantage with L3, not a bandwidth advantage. The data will not be in L2 at all, as you have only 256KB/core.

Wow, yes actually I hadn't realized the differential between newer GPUs and CPUs was so great. I'll need to reconsider.

I'm coming to the conclusion that the only place (other than size of main memory) where CPUs can keep pace with GPUs is the L2 cache (either size or bandwidth of L2), but that the GPU can compensate by running more processes slowly from main memory. Maybe it is only possible to delay GPUs with sheer complexity - ala Quark.

AnonyMint

hero member

Activity: 518

Merit: 521

Quote from: Etlase2 on December 04, 2013, 03:37:21 PM

Quote from: AnonyMint on December 04, 2013, 04:15:23 AM

Excuse me I meant 256 KB. I will correct my typo.

It wasn't a typo, there was a logic error there too, and this is the second time you've made this exact mistake. Curious for someone who seems to be very well versed in the subject matter.

Possibly. I've been sleepless lately, even just woke up with a headache and dizzy. Feel free to point it out if you want. I hadn't been working on the overall aspect of the CPU-only mining aspect lately (note there are 3 components to Scrypt the overall ROMix, the inner BlockMix, and the innermost Salsa20 choice), so I had to reload into mind the various issues, and I was simultaneously taking on a wide range of topics throughout the forums.

Note I challenged some rich folks on the forum to see if they can spend their money to develop a CPU-only independent of me. So perhaps someone will beat me to it.

There are also very wealthy Chinese investors lurking behind the scenes who want to buy into altcoin development, for example you see the $500,000 that was injected into bytemaster's corporation by Chinese investors. I heard his ProtoShares launch already has a market cap of $24 million but I did not verify.

We could possibly see a proliferation of altcoins soon. I am hoping the quality ones can still be distinguished from the chaff.

Etlase2 remember I told you in April or so that it was urgent and your altcoin needed to be finished within 2013 or mid-2014 at the latest. And you scoffed at me. Do I need to go quote that post for you? I hope you are progressing well and I wish you the best of course. I wish you would stop the occasional spiteful remarks. Let everything be decided on the merits of the code released. "Talk is cheap, show us the code." - Linus Torvalds.

superresistant

legendary

Activity: 2156

Merit: 1131

Quote from: Adamlm on December 04, 2013, 02:14:00 PM

TL;DR
I was mining first MemoryCoin - will I be able to keep the wallet from that client and convert all mined coins to 2.0 ?

No.

Etlase2

hero member

Activity: 798

Merit: 1000

Quote from: AnonyMint on December 04, 2013, 04:15:23 AM

Excuse me I meant 256 KB. I will correct my typo.

It wasn't a typo, there was a logic error there too, and this is the second time you've made this exact mistake. Curious for someone who seems to be very well versed in the subject matter.

Adamlm

hero member

Activity: 823

Merit: 1002

TL;DR

I was mining first MemoryCoin - will I be able to keep the wallet from that client and convert all mined coins to 2.0 ?

AnonyMint

hero member

Activity: 518

Merit: 521

Quote from: FreeTrade on December 04, 2013, 06:18:17 AM

Quote from: Sharky444 on December 04, 2013, 05:45:40 AM

Freetrade you forget that GPUs can have up to 300GB/s main memory bandwidth. So you get only a latency advantage with L3, not a bandwidth advantage. The data will not be in L2 at all, as you have only 256KB/core.

Wow, yes actually I hadn't realized the differential between newer GPUs and CPUs was so great. I'll need to reconsider.

As far as I can see, Scrypt won't suffice. I think Perceival was mistaken when he wrote 512 MB would defeat the GPU, because the GPU has much faster main memory bandwidth. And latency can be masked on the GPU by running many threads. And computation is faster on the GPU by running many threads.

You could try to run a 4GB Scrypt to cut down on the number of threads the GPU can run, but it will be so slow (1 minute per hash) your denial of service rejection and pools likely won't work well. Also its still vulnerable because Scrypt requires your latency be significantly less than your BlockMix execution time, else "lookup gap" (cpuminer) can replace the memory with computation, and GPUs can accelerate the computation of BlockMix because Salsa20 is partially parallelizable.

Perceival thinks Litecoin reduced the advantage of ASICs by a factor of 10.

Topic: MemoryCoin 2.0 Proof Of Work - page 4. (Read 21451 times)