I know this is a pretty old thread, but it comes up in several related google searches so I figure it needs an answer. My server with 4x6378 running 40 threads for Monero mining does 2199.6 H/s average when idle (running Ubuntu 16.04, only does about 1800 H/s in Windows Server 2016 with same config file)
I didn't bother with any tweaks (other than huge pages and ulimit increase), I just run xmr-stak-cpu with config as full power on every odd core, then a low power thread on the 2nd, 10th, 18th, etc.
Do you think you could share your full config file ?
Sorry, I don't actually use this forum and didn't see this
Literally the only thing I had done was the CPU setup
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 1 },
{ "low_power_mode" : true, "no_prefetch" : true, "affine_to_cpu" : 2 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 3 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 5 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 7 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 9 },
{ "low_power_mode" : true, "no_prefetch" : true, "affine_to_cpu" : 10 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 11 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 13 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 15 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 17 },
{ "low_power_mode" : true, "no_prefetch" : true, "affine_to_cpu" : 18 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 19 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 21 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 23 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 25 },
{ "low_power_mode" : true, "no_prefetch" : true, "affine_to_cpu" : 26 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 27 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 29 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 31 },
Just continued that pattern through all cores, every other core but added one low power thread at third and eleventh core on each CPU. That assignment was done because of the way the chips themselves are built. I don't remember exactly what the reason was, only that it resulted in lower heat.
I couldn't use XMRig even though it was quite a bit faster in single threads because it refused to assign that way, and did "whatever it felt like" no matter what assignment rules I fed it, resulting in a much slower overall rate (assignment seemed to be broken on more than 16 threads, not sure if they fixed it yet).
Running more than 40 threads was slower or the same speed, due to the shared resources in the CPU design (2 cores share some things, AES seems to be one of those shared functions), so this was the best hashrate I saw while still leaving the server perfectly usable for basically everything else. I can't really offer anything other than that, though, because I only ran it for a couple days as part of testing the server before putting it to work someplace else.
on linux compiling xmr-stak with -march=native added a couple hundred H/s, and -funroll-loops added about 100 H/s if you care enough to figure out where to manually assign that stuff. [insert funroll-all-loops gentoo joke here]. Playing with compile flags I think I capped out about 2650 H/s on the system, which is slower than it should have been so I think the dell board was limiting somehow (geekbench results were slightly lower than expected too). HP systems will probably do better.