MemoryCoin 2.0 Proof Of Work - page 3.

reorder

sr. member

Activity: 462

Merit: 250

Quote from: FreeTrade on December 19, 2013, 09:11:46 AM

Quote from: reorder on December 18, 2013, 05:17:39 PM

And by the way, when trying to build an optimized Quark miner, I have noticed that AES-NI version of Groestl hash performed worse than AVX version on Intel when called by multiple threads For single thread it was other way around. It could be an implementation fault, but it could also mean, for example, a single on-chip AES module shared by hyperthreads with serialized access. Maybe it has some implications for MemoryCoin as well.

Hmm - seeing hashing improvements linear with number of cores, so think those NI must be part of each core.

Well, here is what I get on two E5-2620 Xeons (6 cores, 12 threads each):

Code:

[root@xxx ~]# openssl speed aes-256-cbc -multi 12
...
aes-256 cbc 382580.34k 517842.22k 521875.46k 525670.06k 527021.40k
[root@xxx ~]# openssl speed aes-256-cbc -multi 24
...
aes-256 cbc 588586.78k 611764.04k 617288.53k 618816.17k 619241.47k

Not linear at all. Hashing does also not scale linearly with number of threads.

FreeTrade

legendary

Activity: 1470

Merit: 1030

Quote from: reorder on December 19, 2013, 09:07:45 AM

A bit of (not so) bad news: by coalescing main RAM access I have sped it up by ~40%, and opened some more vectors for minor optimizations. For now it runs at 5.86hpm on 7870.

Okay thanks. Worth noting our CPU algorithm hasn't been optimised or tuned at all, so we may have some room to catch-up. What are you using for the aes encryption?

FreeTrade

legendary

Activity: 1470

Merit: 1030

Quote from: reorder on December 18, 2013, 05:17:39 PM

And by the way, when trying to build an optimized Quark miner, I have noticed that AES-NI version of Groestl hash performed worse than AVX version on Intel when called by multiple threads For single thread it was other way around. It could be an implementation fault, but it could also mean, for example, a single on-chip AES module shared by hyperthreads with serialized access. Maybe it has some implications for MemoryCoin as well.

Hmm - seeing hashing improvements linear with number of cores, so think those NI must be part of each core.

reorder

sr. member

Activity: 462

Merit: 250

A bit of (not so) bad news: by coalescing main RAM access I have sped it up by ~40%, and opened some more vectors for minor optimizations. For now it runs at 5.86hpm on 7870.

reorder

sr. member

Activity: 462

Merit: 250

And by the way, when trying to build an optimized Quark miner, I have noticed that AES-NI version of Groestl hash performed worse than AVX version on Intel when called by multiple threads For single thread it was other way around. It could be an implementation fault, but it could also mean, for example, a single on-chip AES module shared by hyperthreads with serialized access. Maybe it has some implications for MemoryCoin as well.

reorder

sr. member

Activity: 462

Merit: 250

Quote from: FreeTrade on December 18, 2013, 04:05:46 PM

Quote from: reorder on December 18, 2013, 01:31:28 PM

Guys, I have implemented a GPU miner for the coin and have some numbers to share. So far it yields 4hpm on 7870 gigaherz and just above 10hpm on 280X. Even with some handcrafted prefetch it is still heavily RAM latency-bound. I believe there is not much space for optimization left.

Thanks so much for sharing. It's really good news for the coin - any plans for your GPU miner?

For now, to try to make an Nvidia build and try some Amazon mining, but no further plans yet.

FreeTrade

legendary

Activity: 1470

Merit: 1030

Quote from: markj113 on December 18, 2013, 01:48:52 PM

Quote from: NineLives on December 15, 2013, 02:24:04 PM

Is PTS crew behind this?

yes

also no.

FreeTrade

legendary

Activity: 1470

Merit: 1030

Quote from: reorder on December 18, 2013, 01:31:28 PM

Guys, I have implemented a GPU miner for the coin and have some numbers to share. So far it yields 4hpm on 7870 gigaherz and just above 10hpm on 280X. Even with some handcrafted prefetch it is still heavily RAM latency-bound. I believe there is not much space for optimization left.

Thanks so much for sharing. It's really good news for the coin - any plans for your GPU miner?

markj113

legendary

Activity: 2254

Merit: 1043

Quote from: NineLives on December 15, 2013, 02:24:04 PM

Is PTS crew behind this?

yes

Sharky444

hero member

Activity: 724

Merit: 500

Quote from: reorder on December 18, 2013, 01:31:28 PM

Guys, I have implemented a GPU miner for the coin and have some numbers to share. So far it yields 4hpm on 7870 gigaherz and just above 10hpm on 280X. Even with some handcrafted prefetch it is still heavily RAM latency-bound. I believe there is not much space for optimization left.

Good news! This is what we've hoped for!

reorder

sr. member

Activity: 462

Merit: 250

Guys, I have implemented a GPU miner for the coin and have some numbers to share. So far it yields 4hpm on 7870 gigaherz and just above 10hpm on 280X. Even with some handcrafted prefetch it is still heavily RAM latency-bound. I believe there is not much space for optimization left.

AnonyMint

hero member

Activity: 518

Merit: 521

Quote from: Sharky444 on December 15, 2013, 01:11:40 PM

Quote from: AnonyMint on December 15, 2013, 11:04:57 AM

It appears you bought some time against GPUs and ASICs, but as far as I see you don't have a CPU-only coin forever into the future.

P.S. I am guessing 1968 is the year you were born. Grin

There cannot be a home-CPU only algo for ever, no matter what you do, since ASICs are CPUs too

I assure you it can be done.

Quote from: FreeTrade on December 15, 2013, 01:28:43 PM

Quote from: AnonyMint on December 15, 2013, 11:04:57 AM

So the GPU will be compute bound on AES, and the ASIC will likely be memory bandwidth bound.

I don't see how L2 is even being employed in your algorithm. You are reading 64KB chunks of data from 1 GB, so you are not even in L3. So it appears you are compute bound on AES.

So there's some XORing going on before the AES, hoping the L2 cache gives a bit of an advantage there too.

I don't think so, because appears you are reading both source chunks from 1GB and writing back to 1GB. I see you copying to separate buffers and I don't understand why you do that instead of xoring directly from their original memory location.

Quote from: FreeTrade on December 15, 2013, 01:28:43 PM

Quote from: AnonyMint on December 15, 2013, 11:04:57 AM

I don't know what it will cost to put a fast memory bandwidth interface together with an ASIC. It should be orders-of-magnitude faster and lower power than the CPU.

The only thing holding the GPU back is the lack of specialized AES circuitry, but note my prior post documenting that such circuitry doesn't require many transistors. GPU manufacturers could decide to add this perhaps or perhaps someone figure out a way to piggyback a cheap $50 ASIC on a mid-range GPU memory bus to get 50X performance.

Also perhaps someone can put an ASIC (or FPGA) or several Intel Core i5 on a PCIe card, since there is no sequential memory bound in this algorithm.

Note there is a way to make an scrypt-like hash sequential, CPU-only, and fast validating. That was my major breakthrough recently.

It appears you bought some time against GPUs and ASICs, but as far as I see you don't have a CPU-only coin forever into the future.

I'll take that as high praise! Appreciate all your analysis. ASIC design and production is not my area of expertise, but I'm skeptical that an ASIC can be designed and manufactured at a rate that make mining a viable business against the vast multitude of CPU owners with zero capital costs.

Well imho yes you've apparently done better than Litecoin. The ASIC can definitely be done if your coin has a high enough market cap and at power efficiency looks to be several orders-of-magnitude so it should wipe out the CPUs, but by then you will be rich any way. Wink

Your near-term threat is botnets. They could 51% attack your coin.

NineLives

sr. member

Activity: 252

Merit: 250

Is PTS crew behind this?

FreeTrade

legendary

Activity: 1470

Merit: 1030

Quote from: AnonyMint on December 15, 2013, 11:04:57 AM

So the GPU will be compute bound on AES, and the ASIC will likely be memory bandwidth bound.

I don't see how L2 is even being employed in your algorithm. You are reading 64KB chunks of data from 1 GB, so you are not even in L3. So it appears you are compute bound on AES.

So there's some XORing going on before the AES, hoping the L2 cache gives a bit of an advantage there too.

Quote from: AnonyMint on December 15, 2013, 11:04:57 AM

I don't know what it will cost to put a fast memory bandwidth interface together with an ASIC. It should be orders-of-magnitude faster and lower power than the CPU.

The only thing holding the GPU back is the lack of specialized AES circuitry, but note my prior post documenting that such circuitry doesn't require many transistors. GPU manufacturers could decide to add this perhaps or perhaps someone figure out a way to piggyback a cheap $50 ASIC on a mid-range GPU memory bus to get 50X performance.

Also perhaps someone can put an ASIC (or FPGA) or several Intel Core i5 on a PCIe card, since there is no sequential memory bound in this algorithm.

Note there is a way to make an scrypt-like hash sequential, CPU-only, and fast validating. That was my major breakthrough recently.

It appears you bought some time against GPUs and ASICs, but as far as I see you don't have a CPU-only coin forever into the future.

I'll take that as high praise! Appreciate all your analysis. ASIC design and production is not my area of expertise, but I'm skeptical that an ASIC can be designed and manufactured at a rate that make mining a viable business against the vast multitude of CPU owners with zero capital costs.

Sharky444

hero member

Activity: 724

Merit: 500

Quote from: AnonyMint on December 15, 2013, 11:04:57 AM

It appears you bought some time against GPUs and ASICs, but as far as I see you don't have a CPU-only coin forever into the future.

P.S. I am guessing 1968 is the year you were born. Grin

There cannot be a home-CPU only algo for ever, no matter what you do, since ASICs are CPUs too, although not programmable. But this algo is way more difficult to implement in a cheap way compared to SHA256 or scrypt. So it buys a lot of time for home-cpu users. (and botnets, but since is uses 1 GB of RAM and 90% CPU time many users that had their PCs overtaken by a botnet should notice that something is wrong).

AnonyMint

hero member

Activity: 518

Merit: 521

Okay so now I understand you are sharing the same 1 GB for all threads and starting their walk from one of the 16,384 chunks in the 1GB. Chunk size is 64 KB.

So the GPU and ASIC will only need 1 GB for up to 16,384 (1<<14) threads. This was one of the criticisms about massive parallelization I made against Momentum for ProtoShares.

So the GPU will be compute bound on AES, and the ASIC will likely be memory bandwidth bound.

I don't see how L2 is even being employed in your algorithm. You are reading 64KB chunks of data from 1 GB, so you are not even in L3. So it appears you are compute bound on AES.

The main memory bandwidth on the CPU is 20 GB/s on desktop grade Intel Core. So clearly your algorithm is AES compute bound at 3 GB/s.

I don't know what it will cost to put a fast memory bandwidth interface together with an ASIC. It should be orders-of-magnitude faster and lower power than the CPU.

The only thing holding the GPU back is the lack of specialized AES circuitry, but note my prior post documenting that such circuitry doesn't require many transistors. GPU manufacturers could decide to add this perhaps or perhaps someone figure out a way to piggyback a cheap $50 ASIC on a mid-range GPU memory bus to get 50X performance.

Also perhaps someone can put an ASIC (or FPGA) or several Intel Core i5 on a PCIe card, since there is no sequential memory bound in this algorithm.

Note there is a way to make an scrypt-like hash sequential, CPU-only, and fast validating. That was my major breakthrough recently.

It appears you bought some time against GPUs and ASICs, but as far as I see you don't have a CPU-only coin forever into the future.

P.S. I am guessing 1968 is the year you were born. Grin

FreeTrade

legendary

Activity: 1470

Merit: 1030

Quote from: AnonyMint on December 15, 2013, 10:39:32 AM

Does that mean each thread searches a different section of the pseudo-random 1GB? But wouldn't that mean the result found could vary depending on the number of threads? Since the first value of 1968 found is taken as the solution.

I am thinking that is a design bug. Or perhaps I just don't understand the algorithm employed yet.

P.S. you typo-ed pseudo as 'psuedo' in the code.

There are 16,000 different starting points - each thread takes a section to search. But there are 50 steps from each starting point, and each step can range over the whole 1GB, so every thread needs to have random access to the whole 1GB.

Every 1968 found is a solution, and can create a different SHA256 result. On average there should be 1 per 1GB data, but there might be 0 or 2 or more.

AnonyMint

hero member

Activity: 518

Merit: 521

I don't understand this.

https://github.com/memorycoin/memorycoin/blob/psforkinit/src/momentum.cpp#L71

Code:

int searchNumber=comparisonSize/totalThreads;
int startLoc=threadNumber*searchNumber;

Does that mean each thread searches a different section of the pseudo-random 1GB? But wouldn't that mean the result found could vary depending on the number of threads? Since the first value of 1968 found is taken as the solution.

I am thinking that is a design bug. Or perhaps I just don't understand the algorithm employed yet.

P.S. you typo-ed pseudo as 'psuedo' in the code.

AnonyMint

hero member

Activity: 518

Merit: 521

I see 150 Kgates and 2 milliWatts per GBps (not Gbps) of throughput with ASICs:

http://www.martes-itea.org/public/papers/Hamalainen-Design_and_Implementation_2.pdf#page=6

ASICs have fast caches.

That should be very inexpensive to produce and obliterate the AES-NI both on performance and hashes per watt. Perhaps the memory bandwidth of the cache becomes the limiting factor.

I don't know about the DRAM memory controller.

AnonyMint

hero member

Activity: 518

Merit: 521

3X would be a significant improvement over Litecoin's 15X w.r.t. to GPUs and an order-of-magnitude better than the 30X for Bitcoin:

https://bitsharestalk.org/index.php?topic=22.msg2663#msg2663

However, both Litecoin and Bitcoin will be ASICs dominated, so it is irrelevant except that both got their start by being CPU, then GPU.

Appears MemoryCoin will be CPU then ASIC and mostly skip the GPU stage (due to GPU being less energy efficient than Intel's AES-NI even though slightly faster). This conclusion hinges on gkrypt being fully optimized for GPUs.

Topic: MemoryCoin 2.0 Proof Of Work - page 3. (Read 21451 times)