Pages:
Author

Topic: MemoryCoin 2.0 Proof Of Work - page 3. (Read 21430 times)

sr. member
Activity: 462
Merit: 250
December 19, 2013, 09:24:37 AM
#82
And by the way, when trying to build an optimized Quark miner, I have noticed that AES-NI version of Groestl hash performed worse than AVX version on Intel when called by multiple threads  For single thread it was other way around. It could be an implementation fault, but it could also mean, for example, a single on-chip AES module shared by hyperthreads with serialized access. Maybe it has some implications for MemoryCoin as well.

Hmm - seeing hashing improvements linear with number of cores, so think those NI must be part of each core.

Well, here is what I get on two E5-2620 Xeons (6 cores, 12 threads each):
Code:
[root@xxx ~]# openssl speed aes-256-cbc -multi 12
...
aes-256 cbc     382580.34k   517842.22k   521875.46k   525670.06k   527021.40k
[root@xxx ~]# openssl speed aes-256-cbc -multi 24
...
aes-256 cbc     588586.78k   611764.04k   617288.53k   618816.17k   619241.47k

Not linear at all. Hashing does also not scale linearly with number of threads.
legendary
Activity: 1470
Merit: 1030
December 19, 2013, 09:14:33 AM
#81
A bit of (not so) bad news: by coalescing main RAM access I have sped it up by ~40%, and opened some more vectors for minor optimizations. For now it runs at 5.86hpm on 7870.

Okay thanks. Worth noting our CPU algorithm hasn't been optimised or tuned at all, so we may have some room to catch-up. What are you using for the aes encryption?
legendary
Activity: 1470
Merit: 1030
December 19, 2013, 09:11:46 AM
#80
And by the way, when trying to build an optimized Quark miner, I have noticed that AES-NI version of Groestl hash performed worse than AVX version on Intel when called by multiple threads  For single thread it was other way around. It could be an implementation fault, but it could also mean, for example, a single on-chip AES module shared by hyperthreads with serialized access. Maybe it has some implications for MemoryCoin as well.

Hmm - seeing hashing improvements linear with number of cores, so think those NI must be part of each core.
sr. member
Activity: 462
Merit: 250
December 19, 2013, 09:07:45 AM
#79
A bit of (not so) bad news: by coalescing main RAM access I have sped it up by ~40%, and opened some more vectors for minor optimizations. For now it runs at 5.86hpm on 7870.
sr. member
Activity: 462
Merit: 250
December 18, 2013, 05:17:39 PM
#78
And by the way, when trying to build an optimized Quark miner, I have noticed that AES-NI version of Groestl hash performed worse than AVX version on Intel when called by multiple threads  For single thread it was other way around. It could be an implementation fault, but it could also mean, for example, a single on-chip AES module shared by hyperthreads with serialized access. Maybe it has some implications for MemoryCoin as well.
sr. member
Activity: 462
Merit: 250
December 18, 2013, 04:54:21 PM
#77
Guys, I have implemented a GPU miner for the coin and have some numbers to share. So far it yields 4hpm on 7870 gigaherz and just above 10hpm on 280X. Even with some handcrafted prefetch it is still heavily RAM latency-bound. I believe there is not much space for optimization left.

Thanks so much for sharing. It's really good news for the coin - any plans for your GPU miner?

For now, to try to make an Nvidia build and try some Amazon mining, but no further plans yet.
legendary
Activity: 1470
Merit: 1030
legendary
Activity: 1470
Merit: 1030
December 18, 2013, 04:05:46 PM
#75
Guys, I have implemented a GPU miner for the coin and have some numbers to share. So far it yields 4hpm on 7870 gigaherz and just above 10hpm on 280X. Even with some handcrafted prefetch it is still heavily RAM latency-bound. I believe there is not much space for optimization left.

Thanks so much for sharing. It's really good news for the coin - any plans for your GPU miner?
legendary
Activity: 2254
Merit: 1043
December 18, 2013, 01:48:52 PM
#74
hero member
Activity: 724
Merit: 500
December 18, 2013, 01:35:34 PM
#73
Guys, I have implemented a GPU miner for the coin and have some numbers to share. So far it yields 4hpm on 7870 gigaherz and just above 10hpm on 280X. Even with some handcrafted prefetch it is still heavily RAM latency-bound. I believe there is not much space for optimization left.

Good news! This is what we've hoped for!
sr. member
Activity: 462
Merit: 250
December 18, 2013, 01:31:28 PM
#72
Guys, I have implemented a GPU miner for the coin and have some numbers to share. So far it yields 4hpm on 7870 gigaherz and just above 10hpm on 280X. Even with some handcrafted prefetch it is still heavily RAM latency-bound. I believe there is not much space for optimization left.
hero member
Activity: 518
Merit: 521
December 16, 2013, 01:02:26 AM
#71
It appears you bought some time against GPUs and ASICs, but as far as I see you don't have a CPU-only coin forever into the future.

P.S. I am guessing 1968 is the year you were born.  Grin

There cannot be a home-CPU only algo for ever, no matter what you do, since ASICs are CPUs too

I assure you it can be done.

So the GPU will be compute bound on AES, and the ASIC will likely be memory bandwidth bound.

I don't see how L2 is even being employed in your algorithm. You are reading 64KB chunks of data from 1 GB, so you are not even in L3. So it appears you are compute bound on AES.

So there's some XORing going on before the AES, hoping the L2 cache gives a bit of an advantage there too.

I don't think so, because appears you are reading both source chunks from 1GB and writing back to 1GB. I see you copying to separate buffers and I don't understand why you do that instead of xoring directly from their original memory location.

I don't know what it will cost to put a fast memory bandwidth interface together with an ASIC. It should be orders-of-magnitude faster and lower power than the CPU.

The only thing holding the GPU back is the lack of specialized AES circuitry, but note my prior post documenting that such circuitry doesn't require many transistors. GPU manufacturers could decide to add this perhaps or perhaps someone figure out a way to piggyback a cheap $50 ASIC on a mid-range GPU memory bus to get 50X performance.

Also perhaps someone can put an ASIC (or FPGA) or several Intel Core i5 on a PCIe card, since there is no sequential memory bound in this algorithm.

Note there is a way to make an scrypt-like hash sequential, CPU-only, and fast validating. That was my major breakthrough recently.

It appears you bought some time against GPUs and ASICs, but as far as I see you don't have a CPU-only coin forever into the future.

I'll take that as high praise! Appreciate all your analysis. ASIC design and production is not my area of expertise, but I'm skeptical that an ASIC can be designed and manufactured at a rate that make mining a viable business against the vast multitude of CPU owners with zero capital costs.

Well imho yes you've apparently done better than Litecoin. The ASIC can definitely be done if your coin has a high enough market cap and at power efficiency looks to be several orders-of-magnitude so it should wipe out the CPUs, but by then you will be rich any way. Wink

Your near-term threat is botnets. They could 51% attack your coin.
sr. member
Activity: 252
Merit: 250
December 15, 2013, 02:24:04 PM
#70
Is PTS crew behind this?
legendary
Activity: 1470
Merit: 1030
December 15, 2013, 01:28:43 PM
#69
So the GPU will be compute bound on AES, and the ASIC will likely be memory bandwidth bound.

I don't see how L2 is even being employed in your algorithm. You are reading 64KB chunks of data from 1 GB, so you are not even in L3. So it appears you are compute bound on AES.

So there's some XORing going on before the AES, hoping the L2 cache gives a bit of an advantage there too.

I don't know what it will cost to put a fast memory bandwidth interface together with an ASIC. It should be orders-of-magnitude faster and lower power than the CPU.

The only thing holding the GPU back is the lack of specialized AES circuitry, but note my prior post documenting that such circuitry doesn't require many transistors. GPU manufacturers could decide to add this perhaps or perhaps someone figure out a way to piggyback a cheap $50 ASIC on a mid-range GPU memory bus to get 50X performance.

Also perhaps someone can put an ASIC (or FPGA) or several Intel Core i5 on a PCIe card, since there is no sequential memory bound in this algorithm.

Note there is a way to make an scrypt-like hash sequential, CPU-only, and fast validating. That was my major breakthrough recently.

It appears you bought some time against GPUs and ASICs, but as far as I see you don't have a CPU-only coin forever into the future.

I'll take that as high praise! Appreciate all your analysis. ASIC design and production is not my area of expertise, but I'm skeptical that an ASIC can be designed and manufactured at a rate that make mining a viable business against the vast multitude of CPU owners with zero capital costs.
hero member
Activity: 724
Merit: 500
December 15, 2013, 01:11:40 PM
#68
It appears you bought some time against GPUs and ASICs, but as far as I see you don't have a CPU-only coin forever into the future.

P.S. I am guessing 1968 is the year you were born.  Grin

There cannot be a home-CPU only algo for ever, no matter what you do, since ASICs are CPUs too, although not programmable. But this algo is way more difficult to implement in a cheap way compared to SHA256 or scrypt. So it buys a lot of time for home-cpu users. (and botnets, but since is uses 1 GB of RAM and 90% CPU time many users that had their PCs overtaken by a botnet should notice that something is wrong).
hero member
Activity: 518
Merit: 521
December 15, 2013, 11:04:57 AM
#67
Okay so now I understand you are sharing the same 1 GB for all threads and starting their walk from one of the 16,384 chunks in the 1GB. Chunk size is 64 KB.

So the GPU and ASIC will only need 1 GB for up to 16,384 (1<<14) threads. This was one of the criticisms about massive parallelization I made against Momentum for ProtoShares.

So the GPU will be compute bound on AES, and the ASIC will likely be memory bandwidth bound.

I don't see how L2 is even being employed in your algorithm. You are reading 64KB chunks of data from 1 GB, so you are not even in L3. So it appears you are compute bound on AES.

The main memory bandwidth on the CPU is 20 GB/s on desktop grade Intel Core. So clearly your algorithm is AES compute bound at 3 GB/s.

I don't know what it will cost to put a fast memory bandwidth interface together with an ASIC. It should be orders-of-magnitude faster and lower power than the CPU.

The only thing holding the GPU back is the lack of specialized AES circuitry, but note my prior post documenting that such circuitry doesn't require many transistors. GPU manufacturers could decide to add this perhaps or perhaps someone figure out a way to piggyback a cheap $50 ASIC on a mid-range GPU memory bus to get 50X performance.

Also perhaps someone can put an ASIC (or FPGA) or several Intel Core i5 on a PCIe card, since there is no sequential memory bound in this algorithm.

Note there is a way to make an scrypt-like hash sequential, CPU-only, and fast validating. That was my major breakthrough recently.

It appears you bought some time against GPUs and ASICs, but as far as I see you don't have a CPU-only coin forever into the future.

P.S. I am guessing 1968 is the year you were born.  Grin
legendary
Activity: 1470
Merit: 1030
December 15, 2013, 10:45:25 AM
#66
Does that mean each thread searches a different section of the pseudo-random 1GB? But wouldn't that mean the result found could vary depending on the number of threads? Since the first value of 1968 found is taken as the solution.

I am thinking that is a design bug. Or perhaps I just don't understand the algorithm employed yet.


P.S. you typo-ed pseudo as 'psuedo' in the code.

There are 16,000 different starting points - each thread takes a section to search. But there are 50 steps from each starting point, and each step can range over the whole 1GB, so every thread needs to have random access to the whole 1GB.

Every 1968 found is a solution, and can create a different SHA256 result. On average there should be 1 per 1GB data, but there might be 0 or 2 or more.
hero member
Activity: 518
Merit: 521
December 15, 2013, 10:39:32 AM
#65
I don't understand this.

https://github.com/memorycoin/memorycoin/blob/psforkinit/src/momentum.cpp#L71

Code:
                int searchNumber=comparisonSize/totalThreads;
                int startLoc=threadNumber*searchNumber;

Does that mean each thread searches a different section of the pseudo-random 1GB? But wouldn't that mean the result found could vary depending on the number of threads? Since the first value of 1968 found is taken as the solution.

I am thinking that is a design bug. Or perhaps I just don't understand the algorithm employed yet.


P.S. you typo-ed pseudo as 'psuedo' in the code.
hero member
Activity: 518
Merit: 521
December 15, 2013, 09:35:46 AM
#64
I see 150 Kgates and 2 milliWatts per GBps (not Gbps) of throughput with ASICs:

http://www.martes-itea.org/public/papers/Hamalainen-Design_and_Implementation_2.pdf#page=6

ASICs have fast caches.

That should be very inexpensive to produce and obliterate the AES-NI both on performance and hashes per watt. Perhaps the memory bandwidth of the cache becomes the limiting factor.

I don't know about the DRAM memory controller.
hero member
Activity: 518
Merit: 521
December 15, 2013, 09:01:44 AM
#63
3X would be a significant improvement over Litecoin's 15X w.r.t. to GPUs and an order-of-magnitude better than the 30X for Bitcoin:

https://bitsharestalk.org/index.php?topic=22.msg2663#msg2663

However, both Litecoin and Bitcoin will be ASICs dominated, so it is irrelevant except that both got their start by being CPU, then GPU.

Appears MemoryCoin will be CPU then ASIC and mostly skip the GPU stage (due to GPU being less energy efficient than Intel's AES-NI even though slightly faster). This conclusion hinges on gkrypt being fully optimized for GPUs.
Pages:
Jump to: