[XPM] [ANN] Primecoin High Performance | HP14 released! - page 51.

1l1l11ll1l

legendary

Activity: 1274

Merit: 1000

Quote from: Trillium on August 06, 2013, 04:51:18 AM

Quote from: 1l1l11ll1l on August 05, 2013, 07:03:42 PM

Also, in case anyone is curious

24-core Opteron 6164HE 1.7GHz:
"chainspermin" : 29,
   "chainsperday" : 1.67533939,
   "primespersec" : 8389,

32-core Opteron 6274 2.2GHz:
"chainspermin" : 12,
   "chainsperday" : .71721642,
   "primespersec" : 7039,

From PassMark and the opteron wiki page:
http://www.cpubenchmark.net/cpu_list.php
http://en.wikipedia.org/wiki/List_of_AMD_Opteron_microprocessors

Dual CPU, 12-core opteron 6164HE's PassMark CPU result: 5351/ea, 5351*2 = 10702 ||| Cache arrangement; L2: 12x 512 KB L3: 2x 6 MB
[Dual CPU] AMD Opteron 6274 PassMark CPU result: 10809 (inclusive of both) ||| Cache arrangement; L2: 8x 2MB L3: 2x 8 MB

If I had to guess, the dual cpu, 16 core setup (6274's) is slower because it shares one unit of L2 cache between two cores. The HE's have dedicated L2 for every core.

Despite the disappointing(?) performance, those are still all nice systems and I would mine on them any day.

Looking at the specs on AMD's site, it shows the L2 cache of the 6274 at 1MBx16

http://products.amd.com/en-us/OpteronCPUDetail.aspx?id=760&f1=AMD+Opteron%E2%84%A2+6200+Series+Processor&f2=&f3=Yes&f4=&f5=&f6=G34&f7=B2&f8=32nm&f9=&f10=6400&f11=&
http://products.amd.com/en-us/OpteronCPUDetail.aspx?id=649

If that were the case then the only thing left would be the L1?

wibtc

newbie

Activity: 32

Merit: 0

Quote from: Trillium on August 06, 2013, 09:54:25 AM

Quote from: wibtc on August 06, 2013, 08:51:35 AM

AMD CPU is better for mining Primecoin?

Quote from: 1l1l11ll1l on August 05, 2013, 07:03:42 PM

Also, in case anyone is curious

///
All systems running 64-bit HP9

No. Thats not what he is saying/asking. 1l1l11ll1l has several relatively nice servers mining, and was wondering why the 24-core server seem to outperform those with a total of 32-cores. Refer to my response above for one possibility why.

AMD CPU is faster if you compare one Opteron 6274 or Opteron 6164HE with one Intel CPU such as i7-2600k, Xeon L5520...

roy7

sr. member

Activity: 434

Merit: 250

Quote from: Trillium on August 06, 2013, 09:54:25 AM

No. Thats not what he is saying/asking. 1l1l11ll1l has several relatively nice servers mining, and was wondering why the 24-core server seem to outperform those with a total of 32-cores. Refer to my response above for one possibility why.

I think the faster cpu also has larger L1 cache. Don't know if primecoin mining sits in L1 cache much though vs L2.

Edit: By faster I meant in chainsperday, not clock speed. It surprised me less cores and lower clock would do more work. The only thing jumping out at me was L1 size differences.

Trillium

hero member

Activity: 546

Merit: 500

Quote from: wibtc on August 06, 2013, 08:51:35 AM

AMD CPU is better for mining Primecoin?

Quote from: 1l1l11ll1l on August 05, 2013, 07:03:42 PM

Also, in case anyone is curious

///
All systems running 64-bit HP9

No. Thats not what he is saying/asking. 1l1l11ll1l has several relatively nice servers mining, and was wondering why the 24-core server seem to outperform those with a total of 32-cores. Refer to my response above for one possibility why.

wibtc

newbie

Activity: 32

Merit: 0

AMD CPU is better for mining Primecoin?

Quote from: 1l1l11ll1l on August 05, 2013, 07:03:42 PM

Also, in case anyone is curious

24-core Opteron 6164HE 1.7GHz:
"chainspermin" : 29,
   "chainsperday" : 1.67533939,
   "primespersec" : 8389,

32-core Opteron 6274 2.2GHz:
"chainspermin" : 12,
   "chainsperday" : .71721642,
   "primespersec" : 7039,

4-core i7-2600k 3.4GHz:
"chainspermin" : 8,
"chainsperday" : 0.57826364,
"primespersec" : 3170,

8-core L5520 2.26GHz:
"chainspermin" : 8,
   "chainsperday" : 0.73978522,
   "primespersec" : 3628,

8-core L5420 2.5GHz:
"chainspermin" : 14,
   "chainsperday" : 0.96020906,
   "primespersec" : 3490,

8-core X5355 2.66GHz:
"chainspermin" : 15,
   "chainsperday" : 1.00721642,
   "primespersec" : 3670,

4-core Xeon 5160 3.0GHz:
"chainspermin" : 7,
   "chainsperday" : 0.50713449,
   "primespersec" : 1859,

4-core Xeon 5130 2.0GHz
"chainspermin" : 6,
   "chainsperday" : 0.34404084,
   "primespersec" : 1267,

Core 2 Duo 6300 1.86GHz:
"chainspermin" : 3,
   "chainsperday" : 0.15991434,
   "primespersec" : 587,

All systems running 64-bit HP9

Trillium

hero member

Activity: 546

Merit: 500

Quote from: 1l1l11ll1l on August 05, 2013, 07:03:42 PM

Also, in case anyone is curious

24-core Opteron 6164HE 1.7GHz:
"chainspermin" : 29,
   "chainsperday" : 1.67533939,
   "primespersec" : 8389,

32-core Opteron 6274 2.2GHz:
"chainspermin" : 12,
   "chainsperday" : .71721642,
   "primespersec" : 7039,

From PassMark and the opteron wiki page:
http://www.cpubenchmark.net/cpu_list.php
http://en.wikipedia.org/wiki/List_of_AMD_Opteron_microprocessors

Dual CPU, 12-core opteron 6164HE's PassMark CPU result: 5351/ea, 5351*2 = 10702 ||| Cache arrangement; L2: 12x 512 KB L3: 2x 6 MB
[Dual CPU] AMD Opteron 6274 PassMark CPU result: 10809 (inclusive of both) ||| Cache arrangement; L2: 8x 2MB L3: 2x 8 MB

If I had to guess, the dual cpu, 16 core setup (6274's) is slower because it shares one unit of L2 cache between two cores. The HE's have dedicated L2 for every core.

Despite the disappointing(?) performance, those are still all nice systems and I would mine on them any day.

paulthetafy

hero member

Activity: 820

Merit: 1000

Quote from: mikaelh on August 06, 2013, 03:40:09 AM

Quote from: Dsfyu on August 05, 2013, 08:41:14 PM

Quote from: mikaelh on August 05, 2013, 05:56:34 PM

Quote from: ivanlabrie on August 05, 2013, 04:20:19 PM

I found hyper threading adds no perf increase on my end...so I run 4 threads on a sandy bridge i7 and it's faster.

Are you running Windows? If so, which version? Hyper threading performance depends on the CPU scheduler and lots of other things. The CPU scheduler in Windows isn't that great in my experience but I haven't witnessed it actually being detrimental.

I see something similar to this on my end right now but not that bad - On my 3930k I can set genproclimit to 6 and I get 2517 pps and 1.2 cpd and with genproclimit set to 12 I get 2900 pps and 1.4 cpd. Something seems wrong with this right now. I also set genproclimit to 1 and I'm getting about 450 pps/ 0.23 cpd. If the performance scaled linearly I would be getting ~5kpps/2.7 or 2.8 cpd. Yes, I know that I should never expect anything like this but it seems like the performance scales linearly up until hyperthreading is involved and then it steeply drops off.

Edit: I just tried a few values between 6 and 12 and I'm getting at most a 100 pps increase in performance from one to another, and in some cases no significant increase whatsoever (going from 9 to 10 increased from 2784 to 2832).

That seems pretty much normal to me. The idea behind hyper threading is that the CPU core switches threads when one thread is blocked waiting for memory. My code is nearly always hitting the L1 or L2 caches which keeps the CPU core busy at all times even with one thread. So in theory you only need enough threads to keep all the physical cores busy.

Thanks for the clarification mikael. To summarise then, you will see very little benefit of hyperthreaded / virtual CPU's as most time will be spent utilizing the physical cores only. VPS miners TAKE NOTE!

mikaelh

sr. member

Activity: 301

Merit: 250

Quote from: Dsfyu on August 05, 2013, 08:41:14 PM

Quote from: mikaelh on August 05, 2013, 05:56:34 PM

Quote from: ivanlabrie on August 05, 2013, 04:20:19 PM

I found hyper threading adds no perf increase on my end...so I run 4 threads on a sandy bridge i7 and it's faster.

Are you running Windows? If so, which version? Hyper threading performance depends on the CPU scheduler and lots of other things. The CPU scheduler in Windows isn't that great in my experience but I haven't witnessed it actually being detrimental.

I see something similar to this on my end right now but not that bad - On my 3930k I can set genproclimit to 6 and I get 2517 pps and 1.2 cpd and with genproclimit set to 12 I get 2900 pps and 1.4 cpd. Something seems wrong with this right now. I also set genproclimit to 1 and I'm getting about 450 pps/ 0.23 cpd. If the performance scaled linearly I would be getting ~5kpps/2.7 or 2.8 cpd. Yes, I know that I should never expect anything like this but it seems like the performance scales linearly up until hyperthreading is involved and then it steeply drops off.

Edit: I just tried a few values between 6 and 12 and I'm getting at most a 100 pps increase in performance from one to another, and in some cases no significant increase whatsoever (going from 9 to 10 increased from 2784 to 2832).

That seems pretty much normal to me. The idea behind hyper threading is that the CPU core switches threads when one thread is blocked waiting for memory. My code is nearly always hitting the L1 or L2 caches which keeps the CPU core busy at all times even with one thread. So in theory you only need enough threads to keep all the physical cores busy.

laughingbear

hero member

Activity: 622

Merit: 500

www.cryptobetfair.com

Quote from: hasle2 on August 05, 2013, 11:12:04 PM

I'm curious as to whether mikaelh has considered or is working on a gpu miner.

Reading is hard

hasle2

full member

Activity: 122

Merit: 100

I'm curious as to whether mikaelh has considered or is working on a gpu miner.

Dsfyu

member

Activity: 75

Merit: 10

Quote from: mikaelh on August 05, 2013, 05:56:34 PM

Quote from: ivanlabrie on August 05, 2013, 04:20:19 PM

I found hyper threading adds no perf increase on my end...so I run 4 threads on a sandy bridge i7 and it's faster.

Are you running Windows? If so, which version? Hyper threading performance depends on the CPU scheduler and lots of other things. The CPU scheduler in Windows isn't that great in my experience but I haven't witnessed it actually being detrimental.

I see something similar to this on my end right now but not that bad - On my 3930k I can set genproclimit to 6 and I get 2517 pps and 1.2 cpd and with genproclimit set to 12 I get 2900 pps and 1.4 cpd. Something seems wrong with this right now. I also set genproclimit to 1 and I'm getting about 450 pps/ 0.23 cpd. If the performance scaled linearly I would be getting ~5kpps/2.7 or 2.8 cpd. Yes, I know that I should never expect anything like this but it seems like the performance scales linearly up until hyperthreading is involved and then it steeply drops off.

Edit: I just tried a few values between 6 and 12 and I'm getting at most a 100 pps increase in performance from one to another, and in some cases no significant increase whatsoever (going from 9 to 10 increased from 2784 to 2832).

1l1l11ll1l

legendary

Activity: 1274

Merit: 1000

Also, in case anyone is curious

24-core Opteron 6164HE 1.7GHz:
"chainspermin" : 29,
   "chainsperday" : 1.67533939,
   "primespersec" : 8389,

32-core Opteron 6274 2.2GHz:
"chainspermin" : 12,
   "chainsperday" : .71721642,
   "primespersec" : 7039,

4-core i7-2600k 3.4GHz:
"chainspermin" : 8,
"chainsperday" : 0.57826364,
"primespersec" : 3170,

8-core L5520 2.26GHz:
"chainspermin" : 8,
   "chainsperday" : 0.73978522,
   "primespersec" : 3628,

8-core L5420 2.5GHz:
"chainspermin" : 14,
   "chainsperday" : 0.96020906,
   "primespersec" : 3490,

8-core X5355 2.66GHz:
"chainspermin" : 15,
   "chainsperday" : 1.00721642,
   "primespersec" : 3670,

4-core Xeon 5160 3.0GHz:
"chainspermin" : 7,
   "chainsperday" : 0.50713449,
   "primespersec" : 1859,

4-core Xeon 5130 2.0GHz
"chainspermin" : 6,
   "chainsperday" : 0.34404084,
   "primespersec" : 1267,

Core 2 Duo 6300 1.86GHz:
"chainspermin" : 3,
   "chainsperday" : 0.15991434,
   "primespersec" : 587,

All systems running 64-bit HP9

1l1l11ll1l

legendary

Activity: 1274

Merit: 1000

Quote from: mikaelh on August 05, 2013, 06:06:42 PM

Quote from: paulthetafy on August 05, 2013, 03:15:11 PM

Quote from: 1l1l11ll1l on August 05, 2013, 02:49:50 PM

So with HP9 on a 24 core 1.7GHz AMD system I was getting 8300PPS, I just upgraded to a 32 Core 2.2GHz set-up and I'm getting 7100PPS, Anyone have experience with 32 core systems? What setting might I need to adjust?

You appear to have uncovered an issue with the miner with high thread counts - I get almost no performance gain when using more than 16 threads on a 32 core system.

mikael / sunny any thoughts where this bottleneck might be?

Well, as far as I know there are 2 bottlenecks when it comes to scaling out:

1) Block generation. Only 1 thread at a time can be generating new blocks. This was already mostly fixed by Sunny.

2) Memory allocation. The default malloc implementation uses mutexes internally which reduces performance with multiple thread trying to allocate memory. This shouldn't be an issue with my client because I have reduced the amount of memory allocations needed.

So as far as the code is concerned there shouldn't really be any bottlenecks. If the caches on the CPU are completely inadequate, some performance issues would start appearing. But as far as I know, most server CPUs have pretty big caches.

And of course if you have a VPS, remember that you may be sharing the CPU time with other people's instances.

So looking at the specs, the 6100 series opteron has 128kb per core of L1 cache, the 6200 and 6300 have 48kb per core of L1. That's the only difference I can see. I tried running on fewer cores on the 16 core 2.2GHz 6276, but the performance was probably half that of the 12 core 1.7GHz 6164 opteron (Chains per min and chain per day) The PPS is 8300 with the 12-core and 7100 with the faster 16-core

mikaelh

sr. member

Activity: 301

Merit: 250

Quote from: paulthetafy on August 05, 2013, 03:15:11 PM

Quote from: 1l1l11ll1l on August 05, 2013, 02:49:50 PM

So with HP9 on a 24 core 1.7GHz AMD system I was getting 8300PPS, I just upgraded to a 32 Core 2.2GHz set-up and I'm getting 7100PPS, Anyone have experience with 32 core systems? What setting might I need to adjust?

You appear to have uncovered an issue with the miner with high thread counts - I get almost no performance gain when using more than 16 threads on a 32 core system.

mikael / sunny any thoughts where this bottleneck might be?

Well, as far as I know there are 2 bottlenecks when it comes to scaling out:

1) Block generation. Only 1 thread at a time can be generating new blocks. This was already mostly fixed by Sunny.

2) Memory allocation. The default malloc implementation uses mutexes internally which reduces performance with multiple thread trying to allocate memory. This shouldn't be an issue with my client because I have reduced the amount of memory allocations needed.

So as far as the code is concerned there shouldn't really be any bottlenecks. If the caches on the CPU are completely inadequate, some performance issues would start appearing. But as far as I know, most server CPUs have pretty big caches.

And of course if you have a VPS, remember that you may be sharing the CPU time with other people's instances.

mikaelh

sr. member

Activity: 301

Merit: 250

Quote from: ivanlabrie on August 05, 2013, 04:20:19 PM

I found hyper threading adds no perf increase on my end...so I run 4 threads on a sandy bridge i7 and it's faster.

Are you running Windows? If so, which version? Hyper threading performance depends on the CPU scheduler and lots of other things. The CPU scheduler in Windows isn't that great in my experience but I haven't witnessed it actually being detrimental.

matt4054

legendary

Activity: 1946

Merit: 1035

Quote from: elebit on August 05, 2013, 05:44:49 PM

Quote from: matt4054 on August 05, 2013, 05:34:33 PM

But again, I may be wrong on this, anyone who dev'd on the bitcoind wallet could give a more authoritative answer.

Or you could just check for yourself: getinfo returns the size of the current pool of unused keys.

Thanks for pointing this out.

The keypoolsize seems to remain constantly at 101 with default settings on all of my instances using the same wallet.dat

elebit

sr. member

Activity: 441

Merit: 250

Quote from: matt4054 on August 05, 2013, 05:34:33 PM

But again, I may be wrong on this, anyone who dev'd on the bitcoind wallet could give a more authoritative answer.

Or you could just check for yourself: getinfo returns the size of the current pool of unused keys.

matt4054

legendary

Activity: 1946

Merit: 1035

Quote from: B.T.Coin on August 05, 2013, 05:23:33 PM

Quote from: matt4054 on August 05, 2013, 05:07:53 PM

it means that for any slave that forks before master, newly generated coins will be lost when the forked slave wallet gets overwritten.

That's why my idea was to clone the master to the slaves every day, so they can never reach a 100 block difference.

My understanding of the process is that primecoind will not update wallet.dat for every new generated address, but only once the pool is exhausted. So replicating master to slave on a regular basis would not help IMO, and will not address the issue of a slave that exhausted its pool before master and updated wallet.dat before master did. Generation after the fork on the slave (overwritten wallet.dat) would be lost.

But again, I may be wrong on this, anyone who dev'd on the bitcoind wallet could give a more authoritative answer.

mikaelh

sr. member

Activity: 301

Merit: 250

Quote from: B.T.Coin on August 05, 2013, 04:52:47 PM

What if I run the same wallet on all computers, but run a script once a day that will copy the wallet.dat from my central PC to all others, and overwrite the wallet.dat that is on that machine, which was a clone of the original anyway. This way the wallets can't drift apart after 100 blocks since they get updated/replaced every day.
Would this work or could I lose coins this way? Obviously the mining program will be closed and restarted when the wallet gets replaced/renewed.

I have to say this sounds potentially dangerous. If you are overwriting wallet files, then you risk losing the private keys of addresses that may be holding coins.

I think the best solution is to make one big wallet with thousands of keys. First you need to backup any old wallet files from all your nodes. Then you run the client once with the parameter -keypool=10000 which will generate a big wallet file. Then you can distribute that new file to your mining nodes. Eventually you may need to make a new wallet file if the keys get exhausted. But that probably won't happen any time soon. Many people are using this solution and it's known to work.

B.T.Coin

sr. member

Activity: 332

Merit: 250

Quote from: matt4054 on August 05, 2013, 05:07:53 PM

it means that for any slave that forks before master, newly generated coins will be lost when the forked slave wallet gets overwritten.

That's why my idea was to clone the master to the slaves every day, so they can never reach a 100 block difference.

Topic: [XPM] [ANN] Primecoin High Performance | HP14 released! - page 51. (Read 397722 times)