Pages:
Author

Topic: [XPM] [ANN] Primecoin High Performance | HP14 released! - page 51. (Read 397657 times)

legendary
Activity: 1274
Merit: 1000
Also, in case anyone is curious

24-core Opteron 6164HE 1.7GHz:
"chainspermin" : 29,
    "chainsperday" : 1.67533939,
    "primespersec" : 8389,

32-core Opteron 6274 2.2GHz:
"chainspermin" : 12,
    "chainsperday" : .71721642,
    "primespersec" : 7039,


From PassMark and the opteron wiki page:
http://www.cpubenchmark.net/cpu_list.php
http://en.wikipedia.org/wiki/List_of_AMD_Opteron_microprocessors


Dual CPU, 12-core opteron 6164HE's   PassMark CPU result: 5351/ea, 5351*2 = 10702 ||| Cache arrangement; L2:  12x 512 KB    L3:  2x 6 MB
[Dual CPU] AMD Opteron 6274   PassMark CPU result: 10809 (inclusive of both) ||| Cache arrangement; L2:  8x 2MB L3:  2x 8 MB

If I had to guess, the dual cpu, 16 core setup (6274's) is slower because it shares one unit of L2 cache between two cores. The HE's have dedicated L2 for every core.

Despite the disappointing(?) performance, those are still all nice systems and I would mine on them any day.

Looking at the specs on AMD's site, it shows the L2 cache of the 6274 at 1MBx16

http://products.amd.com/en-us/OpteronCPUDetail.aspx?id=760&f1=AMD+Opteron%E2%84%A2+6200+Series+Processor&f2=&f3=Yes&f4=&f5=&f6=G34&f7=B2&f8=32nm&f9=&f10=6400&f11=&
http://products.amd.com/en-us/OpteronCPUDetail.aspx?id=649

If that were the case then the only thing left would be the L1?

newbie
Activity: 32
Merit: 0
AMD CPU is better for mining Primecoin?

Also, in case anyone is curious

///
All systems running 64-bit HP9


No. Thats not what he is saying/asking. 1l1l11ll1l has several relatively nice servers mining, and was wondering why the 24-core server seem to outperform those with a total of 32-cores. Refer to my response above for one possibility why.

AMD CPU is faster if you compare one Opteron 6274 or Opteron 6164HE with one Intel CPU such as i7-2600k, Xeon  L5520...
sr. member
Activity: 434
Merit: 250
No. Thats not what he is saying/asking. 1l1l11ll1l has several relatively nice servers mining, and was wondering why the 24-core server seem to outperform those with a total of 32-cores. Refer to my response above for one possibility why.

I think the faster cpu also has larger L1 cache. Don't know if primecoin mining sits in L1 cache much though vs L2.

Edit: By faster I meant in chainsperday, not clock speed. It surprised me less cores and lower clock would do more work. The only thing jumping out at me was L1 size differences.
hero member
Activity: 546
Merit: 500
AMD CPU is better for mining Primecoin?

Also, in case anyone is curious

///
All systems running 64-bit HP9


No. Thats not what he is saying/asking. 1l1l11ll1l has several relatively nice servers mining, and was wondering why the 24-core server seem to outperform those with a total of 32-cores. Refer to my response above for one possibility why.
newbie
Activity: 32
Merit: 0
AMD CPU is better for mining Primecoin?

Also, in case anyone is curious

24-core Opteron 6164HE 1.7GHz:
"chainspermin" : 29,
    "chainsperday" : 1.67533939,
    "primespersec" : 8389,

32-core Opteron 6274 2.2GHz:
"chainspermin" : 12,
    "chainsperday" : .71721642,
    "primespersec" : 7039,

4-core i7-2600k 3.4GHz:
"chainspermin" : 8,
    "chainsperday" : 0.57826364,
    "primespersec" : 3170,

8-core L5520 2.26GHz:
"chainspermin" : 8,
    "chainsperday" : 0.73978522,
    "primespersec" : 3628,

8-core L5420 2.5GHz:
"chainspermin" : 14,
    "chainsperday" : 0.96020906,
    "primespersec" : 3490,

8-core X5355 2.66GHz:
"chainspermin" : 15,
    "chainsperday" : 1.00721642,
    "primespersec" : 3670,

4-core Xeon 5160 3.0GHz:
"chainspermin" : 7,
    "chainsperday" : 0.50713449,
    "primespersec" : 1859,

4-core Xeon 5130 2.0GHz
"chainspermin" : 6,
    "chainsperday" : 0.34404084,
    "primespersec" : 1267,

Core 2 Duo 6300 1.86GHz:
"chainspermin" : 3,
    "chainsperday" : 0.15991434,
    "primespersec" : 587,



All systems running 64-bit HP9

hero member
Activity: 546
Merit: 500
Also, in case anyone is curious

24-core Opteron 6164HE 1.7GHz:
"chainspermin" : 29,
    "chainsperday" : 1.67533939,
    "primespersec" : 8389,

32-core Opteron 6274 2.2GHz:
"chainspermin" : 12,
    "chainsperday" : .71721642,
    "primespersec" : 7039,


From PassMark and the opteron wiki page:
http://www.cpubenchmark.net/cpu_list.php
http://en.wikipedia.org/wiki/List_of_AMD_Opteron_microprocessors


Dual CPU, 12-core opteron 6164HE's   PassMark CPU result: 5351/ea, 5351*2 = 10702 ||| Cache arrangement; L2:  12x 512 KB    L3:  2x 6 MB
[Dual CPU] AMD Opteron 6274   PassMark CPU result: 10809 (inclusive of both) ||| Cache arrangement; L2:  8x 2MB L3:  2x 8 MB

If I had to guess, the dual cpu, 16 core setup (6274's) is slower because it shares one unit of L2 cache between two cores. The HE's have dedicated L2 for every core.

Despite the disappointing(?) performance, those are still all nice systems and I would mine on them any day.
hero member
Activity: 820
Merit: 1000
I found hyper threading adds no perf increase on my end...so I run 4 threads on a sandy bridge i7 and it's faster.

Are you running Windows? If so, which version? Hyper threading performance depends on the CPU scheduler and lots of other things. The CPU scheduler in Windows isn't that great in my experience but I haven't witnessed it actually being detrimental.

I see something similar to this on my end right now but not that bad - On my 3930k I can set genproclimit to 6 and I get 2517 pps and 1.2 cpd and with genproclimit set to 12 I get 2900 pps and 1.4 cpd. Something seems wrong with this right now. I also set genproclimit to 1 and I'm getting about 450 pps/ 0.23 cpd. If the performance scaled linearly I would be getting ~5kpps/2.7 or 2.8 cpd. Yes, I know that I should never expect anything like this but it seems like the performance scales linearly up until hyperthreading is involved and then it steeply drops off.

Edit: I just tried a few values between 6 and 12 and I'm getting at most a 100 pps increase in performance from one to another, and in some cases no significant increase whatsoever (going from 9 to 10 increased from 2784 to 2832).

That seems pretty much normal to me. The idea behind hyper threading is that the CPU core switches threads when one thread is blocked waiting for memory. My code is nearly always hitting the L1 or L2 caches which keeps the CPU core busy at all times even with one thread. So in theory you only need enough threads to keep all the physical cores busy.

Thanks for the clarification mikael.  To summarise then, you will see very little benefit of hyperthreaded / virtual CPU's as most time will be spent utilizing the physical cores only.  VPS miners TAKE NOTE!
sr. member
Activity: 301
Merit: 250
I found hyper threading adds no perf increase on my end...so I run 4 threads on a sandy bridge i7 and it's faster.

Are you running Windows? If so, which version? Hyper threading performance depends on the CPU scheduler and lots of other things. The CPU scheduler in Windows isn't that great in my experience but I haven't witnessed it actually being detrimental.

I see something similar to this on my end right now but not that bad - On my 3930k I can set genproclimit to 6 and I get 2517 pps and 1.2 cpd and with genproclimit set to 12 I get 2900 pps and 1.4 cpd. Something seems wrong with this right now. I also set genproclimit to 1 and I'm getting about 450 pps/ 0.23 cpd. If the performance scaled linearly I would be getting ~5kpps/2.7 or 2.8 cpd. Yes, I know that I should never expect anything like this but it seems like the performance scales linearly up until hyperthreading is involved and then it steeply drops off.

Edit: I just tried a few values between 6 and 12 and I'm getting at most a 100 pps increase in performance from one to another, and in some cases no significant increase whatsoever (going from 9 to 10 increased from 2784 to 2832).

That seems pretty much normal to me. The idea behind hyper threading is that the CPU core switches threads when one thread is blocked waiting for memory. My code is nearly always hitting the L1 or L2 caches which keeps the CPU core busy at all times even with one thread. So in theory you only need enough threads to keep all the physical cores busy.
hero member
Activity: 622
Merit: 500
www.cryptobetfair.com
I'm curious as to whether mikaelh has considered or is working on a gpu miner.


Reading is hard
full member
Activity: 122
Merit: 100
I'm curious as to whether mikaelh has considered or is working on a gpu miner.
member
Activity: 75
Merit: 10
I found hyper threading adds no perf increase on my end...so I run 4 threads on a sandy bridge i7 and it's faster.

Are you running Windows? If so, which version? Hyper threading performance depends on the CPU scheduler and lots of other things. The CPU scheduler in Windows isn't that great in my experience but I haven't witnessed it actually being detrimental.

I see something similar to this on my end right now but not that bad - On my 3930k I can set genproclimit to 6 and I get 2517 pps and 1.2 cpd and with genproclimit set to 12 I get 2900 pps and 1.4 cpd. Something seems wrong with this right now. I also set genproclimit to 1 and I'm getting about 450 pps/ 0.23 cpd. If the performance scaled linearly I would be getting ~5kpps/2.7 or 2.8 cpd. Yes, I know that I should never expect anything like this but it seems like the performance scales linearly up until hyperthreading is involved and then it steeply drops off.

Edit: I just tried a few values between 6 and 12 and I'm getting at most a 100 pps increase in performance from one to another, and in some cases no significant increase whatsoever (going from 9 to 10 increased from 2784 to 2832).
legendary
Activity: 1274
Merit: 1000
Also, in case anyone is curious

24-core Opteron 6164HE 1.7GHz:
"chainspermin" : 29,
    "chainsperday" : 1.67533939,
    "primespersec" : 8389,

32-core Opteron 6274 2.2GHz:
"chainspermin" : 12,
    "chainsperday" : .71721642,
    "primespersec" : 7039,

4-core i7-2600k 3.4GHz:
"chainspermin" : 8,
    "chainsperday" : 0.57826364,
    "primespersec" : 3170,

8-core L5520 2.26GHz:
"chainspermin" : 8,
    "chainsperday" : 0.73978522,
    "primespersec" : 3628,

8-core L5420 2.5GHz:
"chainspermin" : 14,
    "chainsperday" : 0.96020906,
    "primespersec" : 3490,

8-core X5355 2.66GHz:
"chainspermin" : 15,
    "chainsperday" : 1.00721642,
    "primespersec" : 3670,

4-core Xeon 5160 3.0GHz:
"chainspermin" : 7,
    "chainsperday" : 0.50713449,
    "primespersec" : 1859,

4-core Xeon 5130 2.0GHz
"chainspermin" : 6,
    "chainsperday" : 0.34404084,
    "primespersec" : 1267,

Core 2 Duo 6300 1.86GHz:
"chainspermin" : 3,
    "chainsperday" : 0.15991434,
    "primespersec" : 587,



All systems running 64-bit HP9
legendary
Activity: 1274
Merit: 1000
So with HP9 on a 24 core 1.7GHz AMD system I was getting 8300PPS, I just upgraded to a 32 Core 2.2GHz set-up and I'm getting 7100PPS, Anyone have experience with 32 core systems? What setting might I need to adjust?
You appear to have uncovered an issue with the miner with high thread counts - I get almost no performance gain when using more than 16 threads on a 32 core system.  

mikael / sunny any thoughts where this bottleneck might be?

Well, as far as I know there are 2 bottlenecks when it comes to scaling out:

1) Block generation. Only 1 thread at a time can be generating new blocks. This was already mostly fixed by Sunny.

2) Memory allocation. The default malloc implementation uses mutexes internally which reduces performance with multiple thread trying to allocate memory. This shouldn't be an issue with my client because I have reduced the amount of memory allocations needed.

So as far as the code is concerned there shouldn't really be any bottlenecks. If the caches on the CPU are completely inadequate, some performance issues would start appearing. But as far as I know, most server CPUs have pretty big caches.

And of course if you have a VPS, remember that you may be sharing the CPU time with other people's instances.

So looking at the specs, the 6100 series opteron has 128kb per core of L1 cache, the 6200 and 6300 have 48kb per core of L1. That's the only difference I can see. I tried running on fewer cores on the 16 core 2.2GHz 6276, but the performance was probably half that of the 12 core 1.7GHz 6164 opteron (Chains per min and chain per day) The PPS is 8300 with the 12-core and 7100 with the faster 16-core


sr. member
Activity: 301
Merit: 250
So with HP9 on a 24 core 1.7GHz AMD system I was getting 8300PPS, I just upgraded to a 32 Core 2.2GHz set-up and I'm getting 7100PPS, Anyone have experience with 32 core systems? What setting might I need to adjust?
You appear to have uncovered an issue with the miner with high thread counts - I get almost no performance gain when using more than 16 threads on a 32 core system. 

mikael / sunny any thoughts where this bottleneck might be?

Well, as far as I know there are 2 bottlenecks when it comes to scaling out:

1) Block generation. Only 1 thread at a time can be generating new blocks. This was already mostly fixed by Sunny.

2) Memory allocation. The default malloc implementation uses mutexes internally which reduces performance with multiple thread trying to allocate memory. This shouldn't be an issue with my client because I have reduced the amount of memory allocations needed.

So as far as the code is concerned there shouldn't really be any bottlenecks. If the caches on the CPU are completely inadequate, some performance issues would start appearing. But as far as I know, most server CPUs have pretty big caches.

And of course if you have a VPS, remember that you may be sharing the CPU time with other people's instances.
sr. member
Activity: 301
Merit: 250
I found hyper threading adds no perf increase on my end...so I run 4 threads on a sandy bridge i7 and it's faster.

Are you running Windows? If so, which version? Hyper threading performance depends on the CPU scheduler and lots of other things. The CPU scheduler in Windows isn't that great in my experience but I haven't witnessed it actually being detrimental.
legendary
Activity: 1946
Merit: 1035
But again, I may be wrong on this, anyone who dev'd on the bitcoind wallet could give a more authoritative answer.

Or you could just check for yourself: getinfo returns the size of the current pool of unused keys.

Thanks for pointing this out.

The keypoolsize seems to remain constantly at 101 with default settings on all of my instances using the same wallet.dat
sr. member
Activity: 441
Merit: 250
But again, I may be wrong on this, anyone who dev'd on the bitcoind wallet could give a more authoritative answer.

Or you could just check for yourself: getinfo returns the size of the current pool of unused keys.
legendary
Activity: 1946
Merit: 1035
it means that for any slave that forks before master, newly generated coins will be lost when the forked slave wallet gets overwritten.

That's why my idea was to clone the master to the slaves every day, so they can never reach a 100 block difference.

My understanding of the process is that primecoind will not update wallet.dat for every new generated address, but only once the pool is exhausted. So replicating master to slave on a regular basis would not help IMO, and will not address the issue of a slave that exhausted its pool before master and updated wallet.dat before master did. Generation after the fork on the slave (overwritten wallet.dat) would be lost.

But again, I may be wrong on this, anyone who dev'd on the bitcoind wallet could give a more authoritative answer.
sr. member
Activity: 301
Merit: 250
What if I run the same wallet on all computers, but run a script once a day that will copy the wallet.dat from my central PC to all others, and overwrite the wallet.dat that is on that machine, which was a clone of the original anyway. This way the wallets can't drift apart after 100 blocks since they get updated/replaced every day.
Would this work or could I lose coins this way? Obviously the mining program will be closed and restarted when the wallet gets replaced/renewed.

I have to say this sounds potentially dangerous. If you are overwriting wallet files, then you risk losing the private keys of addresses that may be holding coins.

I think the best solution is to make one big wallet with thousands of keys. First you need to backup any old wallet files from all your nodes. Then you run the client once with the parameter -keypool=10000 which will generate a big wallet file. Then you can distribute that new file to your mining nodes. Eventually you may need to make a new wallet file if the keys get exhausted. But that probably won't happen any time soon. Many people are using this solution and it's known to work.
sr. member
Activity: 332
Merit: 250
it means that for any slave that forks before master, newly generated coins will be lost when the forked slave wallet gets overwritten.

That's why my idea was to clone the master to the slaves every day, so they can never reach a 100 block difference.
Pages:
Jump to: