Pages:
Author

Topic: [ANN]: cpuminer-opt v3.8.8.1, open source optimized multi-algo CPU miner - page 43. (Read 444040 times)

newbie
Activity: 7
Merit: 0
nothing major changed from skylake to kaby lake and to coffe lake

Cache changed in coffee lake
Sir, where did you find this info? Cache Block diagram looks exactly like the one from sandy bridge times, everything seems to be the same as inside kaby.

BTW, cygwin fails to compile:
/usr/lib/gcc/x86_64-pc-cygwin/6.4.0/../../../../x86_64-pc-cygwin/bin/ld: cannot find -lpthreadGC2
pthreads installed, pthreadGC2.dll exists inside cygwin64\usr\x86_64-w64-mingw32\sys-root\mingw\bin, tried to copy it to cygwin64\usr\x86_64-pc-cygwin\bin - no result  Sad
looks like tomorrow will be linux day, because previously some default ubuntu gcc compiled it ok  Smiley
sr. member
Activity: 338
Merit: 250
Fantastic miner!

Added to Sniffdog default sse2
full member
Activity: 187
Merit: 100
Cryptocurrency enthusiast
nothing major changed from skylake to kaby lake and to coffe lake

Cache changed in coffee lake, it's very important cos even such cpus like AMD FX-8350 mine very good - poor per-core performance, but great cache amounts, that's it. Xeons v1 do the same.

Wanna upgrade my work machine. I do like i7-7800x/7820x with avx-512, but I think I'd go for ryzen...
newbie
Activity: 7
Merit: 0
nothing major changed from skylake to kaby lake and to coffe lake
even more, as i can see in 7.2 manual ( https://gcc.gnu.org/onlinedocs/gcc-7.2.0/gcc/x86-Options.html#x86-Options ), skylake looks like most prefferable march as long as coffelake seems to lack avx512 support
member
Activity: 473
Merit: 18
few benchmarks on yescryptr16

3.7.7 "v1" (gcc 4.8.3)
avx - ~930 h/s
sse2 - ~ 870 h/s

3.7.7 v2 (gcc 5.3.1)
avx - ~950 h/s
sse2 - ~970 h/s


3.7.7 4ward (gcc 6.2.1)
avx - ~970 h/s
sse2 - ~960 h/s

additional algos that show better performance in sse2 than avx (although very small):
yescrypt, poltimos and lbry


This shows gcc-5.3.1 might be the issue. Is that on a Coffeelake? If not it eliminates that as a Coffeelake
issue and looks purely like a compiler version issue.

i5 7600k, kaby lake

nothing major changed from skylake to kaby lake and to coffe lake, and skylake support was added in gcc 6, so I'm guessing it does a better job optimizing the code
legendary
Activity: 1470
Merit: 1114
few benchmarks on yescryptr16

3.7.7 "v1" (gcc 4.8.3)
avx - ~930 h/s
sse2 - ~ 870 h/s

3.7.7 v2 (gcc 5.3.1)
avx - ~950 h/s
sse2 - ~970 h/s


3.7.7 4ward (gcc 6.2.1)
avx - ~970 h/s
sse2 - ~960 h/s

additional algos that show better performance in sse2 than avx (although very small):
yescrypt, poltimos and lbry


This shows gcc-5.3.1 might be the issue. Is that on a Coffeelake? If not it eliminates that as a Coffeelake
issue and looks purely like a compiler version issue.
newbie
Activity: 7
Merit: 0

There is a tool which can show L3 cache usage https://www.cpuid.com/softwares/perfmonitor-2.html, it is old as heck, but still works on some (!) configurations. It could work on i7-6700 and help with optimizations. I will also try to use https://github.com/opcm/pcm which supports Intel's cache monitoring technology.

5820k:
with 12 threads L2 hit is 49%, L3 hit is 6%
with 6 threads L2 hit is 54%, L3 hit is 11% but, only 3 cores under load

12 threads: stalled cycles 57% (wtf???), branch hit 99% (don't know what is it), 1.2-1.3 instruction per cycle (don't think it is important as long as it is "medium" value)

tomorrow will compare with 8700k
legendary
Activity: 1470
Merit: 1114
Quote
If Coffeelake has a design quirk that can be worked around in software
such a workaround would probably have a negative effect on other models. If it's a coffee lake issue it needs a Coffeelake fix.
I suppose that populating scrypt algo parameters to command line can help a lot, if yescript, like regular scrypt, can be calced with different algo presets (precache amount, link split size and so on), then coffeelake make take profit from better cache-fitting splitting.
Anyway tomorrow i'll try to recompile miner with different presets in scrypt.c
Quote
Again I speculate but maybe the
compiler isn't yet tweaked for Codffeelake. What version did you compile with?
i currently use windows precompiled versions on both 5820 and 8700, tomorrow i'll try latest gcc with skylake opt flag, but i suppose that compiler won't make any changes inside asm instruction so opt flag won't help, at least a lot.
Quote
You stated lower performance with fewer threads.
That is mostly windows problem - with 6 threads it uses only 3 physical cores and 3 ht cores - clearly seen with cputemp - after start of 6 threads, 3 cores start to generate heat (70-75C on busy ones, 45C on spare) and shows 100% load, with 12 threads all cores are hot and busy
Under ubuntu difference is within the margin of error
Quote
I suggest you try other algos with 6 and 12 threads to get a more complete profile. If some algos are affected more than others
it may reveal a pattern.
yep, i'll try

If 6 threads aren't balanced use custom cpu affinity (--cpu-affinity 0x555) It hasn't been an issue on Intel before but if you
say only 3 cores are heating up then maybe the mapping has changed.

There is no ASM but there is hardcoded SSE2 code but none between SSE2 and AVX. The only differences between SSE2 and
AVX compile are generated by the compiler.
member
Activity: 473
Merit: 18
few benchmarks on yescryptr16

3.7.7 "v1" (gcc 4.8.3)
avx - ~930 h/s
sse2 - ~ 870 h/s

3.7.7 v2 (gcc 5.3.1)
avx - ~950 h/s
sse2 - ~970 h/s


3.7.7 4ward (gcc 6.2.1)
avx - ~970 h/s
sse2 - ~960 h/s

additional algos that show better performance in sse2 than avx (although very small):
yescrypt, poltimos and lbry
newbie
Activity: 7
Merit: 0
Quote
If Coffeelake has a design quirk that can be worked around in software
such a workaround would probably have a negative effect on other models. If it's a coffee lake issue it needs a Coffeelake fix.
I suppose that populating scrypt algo parameters to command line can help a lot, if yescript, like regular scrypt, can be calced with different algo presets (precache amount, link split size and so on), then coffeelake make take profit from better cache-fitting splitting.
Anyway tomorrow i'll try to recompile miner with different presets in scrypt.c
Quote
Again I speculate but maybe the
compiler isn't yet tweaked for Codffeelake. What version did you compile with?
i currently use windows precompiled versions on both 5820 and 8700, tomorrow i'll try latest gcc with skylake opt flag, but i suppose that compiler won't make any changes inside asm instruction so opt flag won't help, at least a lot.
Quote
You stated lower performance with fewer threads.
That is mostly windows problem - with 6 threads it uses only 3 physical cores and 3 ht cores - clearly seen with cputemp - after start of 6 threads, 3 cores start to generate heat (70-75C on busy ones, 45C on spare) and shows 100% load, with 12 threads all cores are hot and busy
Under ubuntu difference is within the margin of error
Quote
I suggest you try other algos with 6 and 12 threads to get a more complete profile. If some algos are affected more than others
it may reveal a pattern.
yep, i'll try
full member
Activity: 187
Merit: 100
Cryptocurrency enthusiast
Not only coffee lake has bottlenecks...

lyra2z330, Core i5-7600 (non-k) locked at 3.9GHz @ all cores, 16Gb DDR4-2400 dual channel

2 threads w/o affinity @ AVX2 build => ~830 h/s, results as ~50% load for each of 4 cores
2 threads --cpu-affinity 3 @ AVX2 build => ~865 h/s (this is interesting), results as ~100% load for cores 0 and 1
4 threads w/o affinity @ AVX2 build => ~790 h/s (that's a crap)
4 threads --cpu-affinity 15 @ AVX2 build => ~792 h/s (that's a crap, too)

I guess these cpus need 4 channel ram to perform at full speed Sad

There is a tool which can show L3 cache usage https://www.cpuid.com/softwares/perfmonitor-2.html, it is old as heck, but still works on some (!) configurations. It could work on i7-6700 and help with optimizations. I will also try to use https://github.com/opcm/pcm which supports Intel's cache monitoring technology.
legendary
Activity: 1470
Merit: 1114
Looks like there is some performance problem with yescript16 implementation on coffee-lake cpus
Intel i5 4440 (stock 3.3GHz) @4 threads generates ~600h/s
Intel i7 5820k (no overclock, 3.6GHz) @12 threads generates ~1200h/s in pool and up to 1400 in solo mining (6 threads generate little less) with both cpuminer-opt (3.7.6 and 3.7.7v2) under windows64 and under ubuntu64
Intel i7 8700k (stock 4.3GHz) @12 threads generates only 950h/s (both pool and solo), overclocking to 5Ghz (50x100) with cache overclock to 4.6Ghz (stock is 4.2) gives no profit, even more usually performance degrade (power limit disabled, core temperatures are ~75C so no throttling involved), 1-2-3-4-5-6 threads gives less results, overclocking bus to 130Mhz (also with ram) gives no result - maximum is about 950h/s
Even more funny - on stock frequency, switching from AVX to SSE2 gives some performance boost from 950 to 1000-1050h/s

I understand that 8700k lacks quad-channel RAM and has little bit less L3 cache (12 vs 15Mb), compared to 5820, but bottleneck is obviously something different because ram overclock gives no result (so double channel is not a problem, we should see performance boost when overclocking bus and ram) and cache is also not a problem (25% cache is gone but we gain >30% frequency bonus (when overclocked) so our smaller cache works at higher speeds together with cpu cores - we can put less but more frequent and calc it in less time - ) also, compared to 4440, if cache was a bottleneck, we have twice more (12 vs 6Mb), taking in mind much higher speed and optimized pipelane, if cache only matters, we should have 2x gain, compared to 4440

I hope for a fix  Smiley

That's quite a first post, you did your homework.

Your results are concerning but not a software issue. If Coffeelake has a design quirk that can be worked around in software
such a workaround would probably have a negative effect on other models. If it's a coffee lake issue it needs a Coffeelake fix.

On the technical side it's difficult to compare the 8700K with either of the two other CPUs you tested. I have a 6700K @ 4 GHz
and it gets 780 H/s. With a projected linear increase the 8700K should get around 1170. Clearly the 8700k has a problem.

I haven't done a deep dive into the architecture to see if there is a design change that could have an effect. As a new CPU
it could still have a few issues that need to be ironed out.

I noticed in my brief test that reducing the thread count by half on my 6700K had no effect on total hash rate. This tells me the bottleneck
is memory access (cache or main). You stated lower performance with fewer threads. That may be a clue. If your CPU is not
I/O bound when mining an I/O bound algo there may be a problem on the compute side.

Your observation that SSE2 build is faster than AVX is very interesting and deserves more testing. There is no AVX specific code in
yescrypt so there should be no difference in hash speed. Yescrypt is also very self contained, ie it doesn't use any libraries. The only effect
the AVX flag would have is on the compiler. It may compile code differently but there are no big gains between SSE2 and AVX, they are
both mostly limited to 128 bit vectors. It's only with AVX2 that there is a quantum leap to 256 bit vectors. Again I speculate but maybe the
compiler isn't yet tweaked for Codffeelake. What version did you compile with?

I suggest you try other algos with 6 and 12 threads to get a more complete profile. If some algos are affected more than others
it may reveal a pattern.

It would also be interesting to see if other Coffelake owners see the same issues.
newbie
Activity: 7
Merit: 0
Basic setup is ok, synth benchmarks shows nice performance
In coffee-lake, as i can see from specs, main difference is added edram L4 cache, which is used as gpu vram (i currently use this embedded gpu, but looks like there will be no benefit from adding external video board because cpu seems to be unable to use this L4 cache for anything else then gpu, or am i wrong?), other changes are (compared to 5820k):
L1 both have 32kb per core, both are 4-way accessed
L2 both have 256kb per core, but 8700k have only 4-way access while 5820 can use 8-way - not sure if it is important for scrypt
Less L3 cache (12 vs 15Mb) and less access ways (16 vs 20) but cache frequency is faster (at least 4.2GHz without overclock, while 5820 operates at about 3Ghz)
newbie
Activity: 64
Merit: 0
Lol i have i7-4790k and did think that im better using avx, now did try sse and it faster... would love to know it before.
member
Activity: 473
Merit: 18
Looks like there is some performance problem with yescript16 implementation on coffee-lake cpus
Intel i5 4440 (stock 3.3GHz) @4 threads generates ~600h/s
Intel i7 5820k (no overclock, 3.6GHz) @12 threads generates ~1200h/s in pool and up to 1400 in solo mining (6 threads generate little less) with both cpuminer-opt (3.7.6 and 3.7.7v2) under windows64 and under ubuntu64
Intel i7 8700k (stock 4.3GHz) @12 threads generates only 950h/s (both pool and solo), overclocking to 5Ghz (50x100) with cache overclock to 4.6Ghz (stock is 4.2) gives no profit, even more usually performance degrade (power limit disabled, core temperatures are ~75C so no throttling involved), 1-2-3-4-5-6 threads gives less results, overclocking bus to 130Mhz (also with ram) gives no result - maximum is about 950h/s
Even more funny - on stock frequency, switching from AVX to SSE2 gives some performance boost from 950 to 1000-1050h/s

I understand that 8700k lacks quad-channel RAM and has little bit less L3 cache (12 vs 15Mb), compared to 5820, but bottleneck is obviously something different because ram overclock gives no result (so double channel is not a problem, we should see performance boost when overclocking bus and ram) and cache is also not a problem (25% cache is gone but we gain >30% frequency bonus (when overclocked) so our smaller cache works at higher speeds together with cpu cores - we can put less but more frequent and calc it in less time - ) also, compared to 4440, if cache was a bottleneck, we have twice more (12 vs 6Mb), taking in mind much higher speed and optimized pipelane, if cache only matters, we should have 2x gain, compared to 4440

I hope for a fix  Smiley

I think it's definitely not normal, You should be getting more. And SSE2 really is faster than AVX
On i5 7600k @4.5Ghz I get ~920 H/s on AVX and ~950 on SSE2 (!)
newbie
Activity: 7
Merit: 0
Looks like there is some performance problem with yescript16 implementation on coffee-lake cpus
Intel i5 4440 (stock 3.3GHz) @4 threads generates ~600h/s
Intel i7 5820k (no overclock, 3.6GHz) @12 threads generates ~1200h/s in pool and up to 1400 in solo mining (6 threads generate little less) with both cpuminer-opt (3.7.6 and 3.7.7v2) under windows64 and under ubuntu64
Intel i7 8700k (stock 4.3GHz) @12 threads generates only 950h/s (both pool and solo), overclocking to 5Ghz (50x100) with cache overclock to 4.6Ghz (stock is 4.2) gives no profit, even more usually performance degrade (power limit disabled, core temperatures are ~75C so no throttling involved), 1-2-3-4-5-6 threads gives less results, overclocking bus to 130Mhz (also with ram) gives no result - maximum is about 950h/s
Even more funny - on stock frequency, switching from AVX to SSE2 gives some performance boost from 950 to 1000-1050h/s

I understand that 8700k lacks quad-channel RAM and has little bit less L3 cache (12 vs 15Mb), compared to 5820, but bottleneck is obviously something different because ram overclock gives no result (so double channel is not a problem, we should see performance boost when overclocking bus and ram) and cache is also not a problem (25% cache is gone but we gain >30% frequency bonus (when overclocked) so our smaller cache works at higher speeds together with cpu cores - we can put less but more frequent and calc it in less time - ) also, compared to 4440, if cache was a bottleneck, we have twice more (12 vs 6Mb), taking in mind much higher speed and optimized pipelane, if cache only matters, we should have 2x gain, compared to 4440

I hope for a fix  Smiley
full member
Activity: 420
Merit: 108
it does not depend on algo, even if I trying to get help noting displayed. CPU Celeron G3930

C:\App\cpuminer-opt-3.7.7-windows-v2>cpuminer-avx2.exe --help

C:\App\cpuminer-opt-3.7.7-windows-v2>cpuminer-avx.exe --help

C:\App\cpuminer-opt-3.7.7-windows-v2>cpuminer-4way.exe --help

C:\App\cpuminer-opt-3.7.7-windows-v2>cpuminer-4way.exe

C:\App\cpuminer-opt-3.7.7-windows-v2>

First of all your CPU doesn't have AVX or AVX2 so stay away from those and 4way.
You should be using aes-sse42. But there's a bigger problem if it won't display help.
I've never seen that kind of a problem, it looks like it's your system. No one else is
complaining so the problem is at your end.

that's why I ask if I miss some runtime libs ). I never seen such behavior before - it starts, waits silently several seconds, exit. quite unusual.
thanks anyway
legendary
Activity: 1470
Merit: 1114
it does not depend on algo, even if I trying to get help noting displayed. CPU Celeron G3930

C:\App\cpuminer-opt-3.7.7-windows-v2>cpuminer-avx2.exe --help

C:\App\cpuminer-opt-3.7.7-windows-v2>cpuminer-avx.exe --help

C:\App\cpuminer-opt-3.7.7-windows-v2>cpuminer-4way.exe --help

C:\App\cpuminer-opt-3.7.7-windows-v2>cpuminer-4way.exe

C:\App\cpuminer-opt-3.7.7-windows-v2>

First of all your CPU doesn't have AVX or AVX2 so stay away from those and 4way.
You should be using aes-sse42. But there's a bigger problem if it won't display help.
I've never seen that kind of a problem, it looks like it's your system. No one else is
complaining so the problem is at your end.
full member
Activity: 420
Merit: 108
it does not depend on algo, even if I trying to get help noting displayed. CPU Celeron G3930

C:\App\cpuminer-opt-3.7.7-windows-v2>cpuminer-avx2.exe --help

C:\App\cpuminer-opt-3.7.7-windows-v2>cpuminer-avx.exe --help

C:\App\cpuminer-opt-3.7.7-windows-v2>cpuminer-4way.exe --help

C:\App\cpuminer-opt-3.7.7-windows-v2>cpuminer-4way.exe

C:\App\cpuminer-opt-3.7.7-windows-v2>
legendary
Activity: 1470
Merit: 1114
I have  a problem running cpuminer-opt  v3.7.x on a couple of machines with win 10 x64. when started, it silently waits several seconds, then exit, without writing  symbol. any advice? do I miss some runtime libs ?

Advice? Yes, provide proper information. What CPU, algo, command line options?
It's all displayed when the program starts, always provide that when reporting a problem.
Pages:
Jump to: