[ANN]: cpuminer-opt v3.8.8.1, open source optimized multi-algo CPU miner - page 134.

joblo

legendary

Activity: 1470

Merit: 1114

Quote from: adamvp on September 22, 2016, 10:13:01 AM

how can I mine on my core i5-2520M ?
It hangs after

Code:

Switching to getwork, gbt version 6

Windows 7

From the requirements in the OP...

3. Stratum pool, cpuminer-opt only supports stratum minning.

joblo

legendary

Activity: 1470

Merit: 1114

cpuminer-opt v3.4.7 released

Source code: https://drive.google.com/file/d/0B0lVSGQYLJIZano2UmdqN2tfWXM/view?usp=sharing

Windows binaries: https://drive.google.com/file/d/0B0lVSGQYLJIZRGVIMkpZdjdWRm8/view?usp=sharing

Fixed benchmark except x11evo.

Added CPU temperature to share submission report on Linux.

Edit Now of git: https://github.com/JayDDee/cpuminer-opt

jadefalke

legendary

Activity: 1456

Merit: 1014

Quote from: joblo on September 22, 2016, 12:06:25 AM

Take a look at README.md for compile instructions, you're missing some configure options.

ok, as always RTFM helped

thanks for pointing me at the Readme.
Compiling was no problem with the right command.
When i start the Miner now i get very low hashrate.

root@s92471:/home/customer/cpuminer-opt-3.4.6# ./cpuminer -a scrypt:1048576 -o stratum+tcp://vrm.cpuminers.com:3333 -u VHpXzBw5gafsCFWm8idSr1xxxxxxxxx -p x -t 32

********** cpuminer-multi 1.2-dev ***********
A CPU miner with multi algo support and optimized for CPUs
with AES_NI and AVX extensions.
BTC donation address: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT
Forked from TPruvot's cpuminer-multi with credits
to Lucas Jones, elmad, palmd, djm34, pooler, ig0tik3d,
Wolf0, Jeff Garzik and Optiminer.

CPU: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
CPU features: SSE2 AES AVX
SW built on Sep 21 2016 with GCC 5.2.1
SW features: SSE2
Algo features: SSE2
Start mining with SSE2

2016-09-22 11:09:50] CPU #26: 132 H, 2.13 H/s
[2016-09-22 11:09:50] CPU #29: 132 H, 2.13 H/s
[2016-09-22 11:09:52] CPU #3: 180 H, 2.80 H/s
[2016-09-22 11:10:04] CPU #28: 132 H, 2.12 H/s
[2016-09-22 11:10:26] Accepted 7/7 (100%), 4824 H, 77.64 H/s
[2016-09-22 11:10:45] CPU #22: 144 H, 2.60 H/s
[2016-09-22 11:10:46] CPU #0: 168 H, 2.76 H/s
[2016-09-22 11:10:46] CPU #6: 168 H, 2.75 H/s
[2016-09-22 11:10:47] CPU #9: 132 H, 2.16 H/s
[2016-09-22 11:10:47] CPU #10: 132 H, 2.18 H/s
[2016-09-22 11:10:47] CPU #15: 132 H, 2.17 H/s
[2016-09-22 11:10:48] CPU #5: 168 H, 2.75 H/s
[2016-09-22 11:10:48] CPU #24: 132 H, 2.13 H/s
[2016-09-22 11:10:48] CPU #13: 132 H, 2.18 H/s
[2016-09-22 11:10:48] CPU #27: 132 H, 2.13 H/s
[2016-09-22 11:10:49] CPU #25: 132 H, 2.16 H/s
[2016-09-22 11:10:49] CPU #31: 132 H, 2.16 H/s
[2016-09-22 11:10:49] CPU #8: 132 H, 2.16 H/s
[2016-09-22 11:10:49] CPU #7: 168 H, 2.69 H/s
[2016-09-22 11:10:49] CPU #11: 132 H, 2.20 H/s
[2016-09-22 11:10:49] CPU #26: 132 H, 2.24 H/s
[2016-09-22 11:10:50] CPU #2: 180 H, 2.83 H/s

i was getting with the Wallet like 2000 H/s, did i missed something?

adamvp

hero member

Activity: 1246

Merit: 708

how can I mine on my core i5-2520M ?
It hangs after

Code:

Switching to getwork, gbt version 6

Windows 7

joblo

legendary

Activity: 1470

Merit: 1114

Quote from: jadefalke on September 21, 2016, 11:54:44 PM

i just did /autogen.sh /configure and then make

CPU is this:

bugs      :
bogomips   : 5201.40
clflush size   : 64
cache_alignment   : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

processor   : 31
vendor_id   : GenuineIntel
cpu family   : 6
model      : 45
model name   : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
stepping   : 6
microcode   : 0x615
cpu MHz      : 1199.960
cache size   : 20480 KB
physical id   : 1
siblings   : 16
core id      : 7
cpu cores   : 8
apicid      : 47
initial apicid   : 47
fpu      : yes
fpu_exception   : yes
cpuid level   : 13
wp      : yes
flags      : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid xsaveopt
bugs      :
bogomips   : 5201.40
clflush size   : 64
cache_alignment   : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

Take a look at README.md for compile instructions, you're missing some configure options.

jadefalke

legendary

Activity: 1456

Merit: 1014

Quote from: joblo on September 21, 2016, 03:59:07 PM

Quote from: jadefalke on September 21, 2016, 03:15:44 PM

hi,

tried to compile it and it fails, does onyone know what the Problem is?

In file included from algo/echo/aes_ni/vperm.h:20:0,
from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
_mm_shuffle_epi8 (__m128i __X, __m128i __Y)
^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:52:5: error: called from here
t1 = _mm_shuffle_epi8(*((__m128i*)table + 1), t1);\
^
algo/echo/aes_ni/hash.c:385:4: note: in expansion of macro ‘TRANSFORM’
TRANSFORM(_state[j], _k_opt, t1, t2);
^
Makefile:2445: recipe for target 'algo/echo/aes_ni/cpuminer-hash.o' failed
make[2]: *** [algo/echo/aes_ni/cpuminer-hash.o] Error 1
make[2]: Leaving directory '/home/customer/cpuminer-opt-3.4.6'
Makefile:3462: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/home/customer/cpuminer-opt-3.4.6'
Makefile:661: recipe for target 'all' failed
make: *** [all] Error 2

It looks like your CPU doesn't support AES. What model and how did you compile?

Hi,

i just did /autogen.sh /configure and then make

CPU is this:

bugs      :
bogomips   : 5201.40
clflush size   : 64
cache_alignment   : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

processor   : 31
vendor_id   : GenuineIntel
cpu family   : 6
model      : 45
model name   : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
stepping   : 6
microcode   : 0x615
cpu MHz      : 1199.960
cache size   : 20480 KB
physical id   : 1
siblings   : 16
core id      : 7
cpu cores   : 8
apicid      : 47
initial apicid   : 47
fpu      : yes
fpu_exception   : yes
cpuid level   : 13
wp      : yes
flags      : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid xsaveopt
bugs      :
bogomips   : 5201.40
clflush size   : 64
cache_alignment   : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

K1lo

full member

Activity: 151

Merit: 100

Moar mining!!! .. oh wait, that's too much

Quote from: joblo on September 21, 2016, 05:43:31 PM

Quote from: K1lo on September 21, 2016, 05:36:36 PM

Seeing some really strange results for 3.4.6 vs 3.3.8 using cryptonight (XMR) on my i7-3770 running Windows 10 x86-64:

https://bitcointalksearch.org/topic/m.16168098

Ah sorry, I thought I looked but I missed this.

joblo

legendary

Activity: 1470

Merit: 1114

Quote from: K1lo on September 21, 2016, 05:36:36 PM

Seeing some really strange results for 3.4.6 vs 3.3.8 using cryptonight (XMR) on my i7-3770 running Windows 10 x86-64:

https://bitcointalksearch.org/topic/m.16168098

K1lo

full member

Activity: 151

Merit: 100

Moar mining!!! .. oh wait, that's too much

Seeing some really strange results for 3.4.6 vs 3.3.8 using cryptonight (XMR) on my i7-3770 running Windows 10 x86-64:

Code:

cpuminer-sandybridge-ivybridge.exe -a cryptonight --benchmark -q

         **********  cpuminer-opt 3.3.8  ***********
     A CPU miner with multi algo support and optimized for CPUs
     with AES_NI and AVX extensions.
     BTC donation address: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT
     Forked from TPruvot's cpuminer-multi with credits
     to Lucas Jones, elmad, palmd, djm34, pooler, ig0tik3d,
     Wolf0 and Jeff Garzik.

CPU:         Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
CPU features: SSE2 AES AVX
SW built on Jul 24 2016 with GCC 5.3.0
SW features: SSE2 AES AVX
Algo features: SSE2 AES
Start mining with AES-AVX optimizations...

[2016-09-21 23:31:00] 8 miner threads started, using 'cryptonight' algorithm.
[2016-09-21 23:31:03] Total: 132 H, 52.12 H/s
[2016-09-21 23:31:04] Total: 107 H, 76.56 H/s
[2016-09-21 23:31:05] Total: 376 H, 132.45 H/s
[2016-09-21 23:31:21] Total: 616 H, 138.54 H/s
[2016-09-21 23:31:26] Total: 627 H, 140.75 H/s
[2016-09-21 23:31:30] Total: 587 H, 174.55 H/s
[2016-09-21 23:31:35] Total: 675 H, 157.14 H/s
[2016-09-21 23:31:40] Total: 672 H, 149.26 H/s
[2016-09-21 23:31:45] Total: 440 H, 176.82 H/s
[2016-09-21 23:31:51] Total: 464 H, 155.15 H/s
[2016-09-21 23:31:53] CTRL_C_EVENT received, exiting

Code:

cpuminer-avx-i.exe -a cryptonight --benchmark -q

         **********  cpuminer-opt 3.4.6  ***********
     A CPU miner with multi algo support and optimized for CPUs
     with AES_NI and AVX extensions.
     BTC donation address: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT
     Forked from TPruvot's cpuminer-multi with credits
     to Lucas Jones, elmad, palmd, djm34, pooler, ig0tik3d,
     Wolf0, Jeff Garzik and Optiminer.

CPU:         Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
CPU features: SSE2 AES AVX
SW built on Sep  6 2016 with GCC 4.8.3
SW features: SSE2 AES AVX
Algo features: SSE2 AES
Start mining with SSE2 AES

[2016-09-21 23:31:57] 8 miner threads started, using 'cryptonight' algorithm.
[2016-09-21 23:31:57] Total: 4 H, 26.75 H/s
[2016-09-21 23:31:57] Total: 13 H, 102.67 H/s
[2016-09-21 23:31:57] Total: 8 H, 105.64 H/s
[2016-09-21 23:31:57] Total: 8 H, 112.06 H/s
[2016-09-21 23:31:58] Total: 8 H, 140.94 H/s
[2016-09-21 23:31:58] Total: 8 H, 138.38 H/s
[2016-09-21 23:31:58] Total: 8 H, 134.96 H/s
[2016-09-21 23:31:58] Total: 8 H, 133.02 H/s
[2016-09-21 23:31:58] Total: 8 H, 136.76 H/s
[2016-09-21 23:31:58] Total: 8 H, 136.55 H/s
[2016-09-21 23:31:58] Total: 8 H, 107.57 H/s
[2016-09-21 23:31:58] Total: 8 H, 110.47 H/s
[2016-09-21 23:31:58] Total: 8 H, 106.01 H/s
[2016-09-21 23:31:58] Total: 8 H, 106.42 H/s
[2016-09-21 23:31:58] Total: 8 H, 136.45 H/s
[2016-09-21 23:31:58] Total: 8 H, 134.75 H/s
[2016-09-21 23:31:59] Total: 8 H, 138.43 H/s
[2016-09-21 23:31:59] Total: 8 H, 133.90 H/s
[2016-09-21 23:31:59] Total: 8 H, 135.65 H/s
[2016-09-21 23:31:59] Total: 8 H, 137.36 H/s
...
[2016-09-21 23:32:05] CTRL_C_EVENT received, exiting

Without the -q option it's even worse, I suspect it loses a lot of CPU time dumping out so much text to the console. I'm not sure why it's also showing statistics after only 8 hashes too.

Any ideas anyone? From this quick and dirty benchmark it looks like 3.3.8 is faster for XMR mining?

AlexGR

legendary

Activity: 1708

Merit: 1049

Quote from: joblo on September 21, 2016, 09:16:52 AM

I don't anticipate a big improvement with AVX512. Reducing the instruction count doesn't reduce the data accesses
so there is more chance of being I/O bound. Coding also becomes more complex trying to transform a buffer 512
bits at a time. Intel also needs to improve data conversion. I don't know if the register model is changing for AVX512,
I suspect not, but it needs to. Having seperate registers for 64, 128, 256, and maybe 512 bit data creates conversion
overhead as data needs to be moved to the appropriate register type before processing it. The data in registers needs
to be an overlay (union in c lingo) so there is no overhead when switching from scalar to vector processing on the same data.

The moves are usually relatively inexpensive in terms of time compared to the processing part, so I wouldn't worry that much. I was doing some googling to check sha256 and 512 for avx512, and I found this in the openssl site: https://rt.openssl.org/Ticket/Display.html?id=4307

...where an intel software engineer is submitting optimized algorithms... he says over 2x for avx512.

The benchmarks are also quite interesting (the numbers should be up to avx2*)

Architecture: SHA256 Serial | SHA256 MultiBlock
HSW: 7.82 cycles/round | 2.99 cycles/round
BDW: 7.82 cycles/round | 2.85 cycles/round
SKL: 7.73 cycles/round | 2.59 cycles/round

Architecture SHA512 Serial | SHA512 MultiBlock
HSW: 5.43 | 3.79
BDW: 5.35 | 3.64
SKL: 5.23 | 3.42

* I had seen a pdf or powerpoint, I can't find it right now, but it should be from the same Intel guy, where with avx512 it went down to ~1.something cycle/round for multiblock sha256. I think it was a projection/simulation run due to lack of chips.

joblo

legendary

Activity: 1470

Merit: 1114

Quote from: jadefalke on September 21, 2016, 03:15:44 PM

hi,

tried to compile it and it fails, does onyone know what the Problem is?

In file included from algo/echo/aes_ni/vperm.h:20:0,
from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
_mm_shuffle_epi8 (__m128i __X, __m128i __Y)
^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:52:5: error: called from here
t1 = _mm_shuffle_epi8(*((__m128i*)table + 1), t1);\
^
algo/echo/aes_ni/hash.c:385:4: note: in expansion of macro ‘TRANSFORM’
TRANSFORM(_state[j], _k_opt, t1, t2);
^
Makefile:2445: recipe for target 'algo/echo/aes_ni/cpuminer-hash.o' failed
make[2]: *** [algo/echo/aes_ni/cpuminer-hash.o] Error 1
make[2]: Leaving directory '/home/customer/cpuminer-opt-3.4.6'
Makefile:3462: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/home/customer/cpuminer-opt-3.4.6'
Makefile:661: recipe for target 'all' failed
make: *** [all] Error 2

It looks like your CPU doesn't support AES. What model and how did you compile?

jadefalke

legendary

Activity: 1456

Merit: 1014

hi,

tried to compile it and it fails, does onyone know what the Problem is?

In file included from algo/echo/aes_ni/vperm.h:20:0,
from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
_mm_shuffle_epi8 (__m128i __X, __m128i __Y)
^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:52:5: error: called from here
t1 = _mm_shuffle_epi8(*((__m128i*)table + 1), t1);\
^
algo/echo/aes_ni/hash.c:385:4: note: in expansion of macro ‘TRANSFORM’
TRANSFORM(_state[j], _k_opt, t1, t2);
^
Makefile:2445: recipe for target 'algo/echo/aes_ni/cpuminer-hash.o' failed
make[2]: *** [algo/echo/aes_ni/cpuminer-hash.o] Error 1
make[2]: Leaving directory '/home/customer/cpuminer-opt-3.4.6'
Makefile:3462: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/home/customer/cpuminer-opt-3.4.6'
Makefile:661: recipe for target 'all' failed
make: *** [all] Error 2

joblo

legendary

Activity: 1470

Merit: 1114

I don't anticipate a big improvement with AVX512. Reducing the instruction count doesn't reduce the data accesses
so there is more chance of being I/O bound. Coding also becomes more complex trying to transform a buffer 512
bits at a time. Intel also needs to improve data conversion. I don't know if the register model is changing for AVX512,
I suspect not, but it needs to. Having seperate registers for 64, 128, 256, and maybe 512 bit data creates conversion
overhead as data needs to be moved to the appropriate register type before processing it. The data in registers needs
to be an overlay (union in c lingo) so there is no overhead when switching from scalar to vector processing on the same data.

And fully parallel processors are a different beast. where you essentially code for one data stream and it gets replicated.

AlexGR

legendary

Activity: 1708

Merit: 1049

Quote from: joblo on September 20, 2016, 08:27:24 PM

It must be a lot of work to find what code work best with which compiler.

It depends on how many speed-critical parts a program has. If you have 15 hashes, of which you can change 10 without the program hitting compilation problems, and you are experimenting with 4 compilers, it may go up to 60 times or even 200+ times once you start tuning cflags for each.

If you focus on tuning 2-3 pieces of code (files), it's way faster. At most 12 compilations. But then you are tempted to start playing around with different sets of CFLAGS for each, and you end up spending your time anyway Grin

Quote

Was it trial and error at the file
level or could you focus on individual functions?

Trial and error at the file level switching compilers (and flags), per file.

Quote

If the compiler selection can make that much of a difference, ASM could probably do even better.

Both intrinsics and asm implementation tend to exceed plain .c by quite a margin. The compiler can only make a difference in less optimized pieces of code...

Quote

It would be an interesting exercise to analyze the different asm code produced by the two compilers, assuming most of the gains were local to the functions, and try to further optimize by hand. Unfortunately an algo like X11 is dead from a CPU mining point of view.

Indeed.

Quote

Also my Intel asm skills are non-esixtent. The last
time I wrote Intel asm was for an 8 bit 8085. I do understand asm very well but would have to spend a lot of time with my nose in the Intel manual. If there are viable CPU algos that can be targetted it might be worth spending that time.

Yeah I'm similar in that regard, although with a shorter-time gap (last time I was doing it seriously it was on 16bits). In general, all algos have multiple implementations for multiple instruction sets, with c, intrinsics, asm, etc. So I wouldn't try to reinvent the wheel in this regard - as there are extremely competent people actively optimizing these hashes and providing multiple implementations for multiple instruction sets.

I anticipate that the big gains (for cpu mining) will be when hashing starts going in batches of 2 or 4 at the same time, instead of 1 per thread. Trying 2 or 4 candidates each round will allow proper vectorization, by packing the data from each step of the process in wide registers, and then processing them with packed simd instructions (instead of scalar, as it is done most of the time right now).

Intel did a sha256 in that way, by loading multiple hashes together on the AVX registers so that all the steps can be executed with less commands: https://lwn.net/Articles/692605/

"On multi-buffer SHA256 computation with AVX2, we see throughput increase up to 2.2x over the existing x86_64 single buffer AVX2 algorithm."

...if we were to ever implement something like that, it would be quite a change because most, if not, implementations are for single-data rounds and the software itself is based on a logic of processing 1 hash every time. As Xeons with AVX512 start coming online, these will need it even more - because those would be able to load twice the data on their registers and process them with just one command. This is a job that would probably pay very good money if one was to make such a miner - because it would take the throughput 2x+ upwards. I guess for many it would be just easier to port the code to opencl and create GPU miners instead. And now thinking about this, it may even be possible to use multiple-buffer implementation on each thread of the GPU to increase the throughput there also, provided the SIMD registers are wide enough to take multiple data (I guess it will depend on the gpu architecture).

joblo

legendary

Activity: 1470

Merit: 1114

Quote from: AlexGR on September 20, 2016, 07:31:44 PM

Quote from: joblo on September 20, 2016, 04:43:56 PM

10% just from the compiler?

Multiple compilers, each one taking a different hash. Actually I think it peaked around 13-14% for x11 (with darkcoin 1.2 miner*), by combining gcc/clang/icc and having three sets of cflags - although it took me days to fine tune it.

* It's performance is similar (-5%) for core2 & c2 quad architecture to current cpuminer-opt 3.4.6.

Quote

I'm not sure what you means by "more of its algorithms were C-based". I haven't added any assembly code.

I was talking about a couple of years ago when there was less use of intrinsics and fewer advanced implementations embedded (darkcoin miner 1.2 and 1.3 days). I'm not currently mining x11 due to asics, but the principle is the same as it always has been - as long as there are some c implementations without asm/intrinsics (it leaves much less room for the compiler to make a difference) one can try alternative compilers per hash and pick the winners, to increase the throughput.

Oh Yeah, your magic recipe, it's pretty fascinating. I'm surprised it makes that much of a difference.
It must be a lot of work to find what code work best with which compiler. Was it trial and error at the file
level or could you focus on individual functions?

The X11 code is mostly unchanged from darkcoin miner 1.3, the AES optimized version. The only significant change I made
was promoting the cubehash SSE2 code to AVX2, essentially turning a pile of instrinsic calls to a smaller pile of different
intrinsics.

If the compiler selection can make that much of a difference, ASM could probably do even better. It would be an interesting
exercise to analyze the different asm code produced by the two compilers, assuming most of the gains were local to the
functions, and try to further optimize by hand.

Unfortunately an algo like X11 is dead from a CPU mining point of view. Also my Intel asm skills are non-esixtent. The last
time I wrote Intel asm was for an 8 bit 8085. I do understand asm very well but would have to spend a lot of time with my nose in
the Intel manual. If there are viable CPU algos that can be targetted it might be worth spending that time.

AlexGR

legendary

Activity: 1708

Merit: 1049

Quote from: joblo on September 20, 2016, 04:43:56 PM

10% just from the compiler?

Multiple compilers, each one taking a different hash. Actually I think it peaked around 13-14% for x11 (with darkcoin 1.2 miner*), by combining gcc/clang/icc and having three sets of cflags - although it took me days to fine tune it.

* It's performance is similar (-5%) for core2 & c2 quad architecture to current cpuminer-opt 3.4.6.

Quote

I'm not sure what you means by "more of its algorithms were C-based". I haven't added any assembly code.

I was talking about a couple of years ago when there was less use of intrinsics and fewer advanced implementations embedded (darkcoin miner 1.2 and 1.3 days). I'm not currently mining x11 due to asics, but the principle is the same as it always has been - as long as there are some c implementations without asm/intrinsics (it leaves much less room for the compiler to make a difference) one can try alternative compilers per hash and pick the winners, to increase the throughput.

joblo

legendary

Activity: 1470

Merit: 1114

Quote from: felixbrucker on September 20, 2016, 05:18:27 PM

Quote from: joblo on September 20, 2016, 04:43:56 PM

The next release will also add CPU temperature display. I've considered a few options based on usefulness vs work involved. I'm thinking of
simp;ly ading the temp to the share submission report, it was the simplest implementation although not architecturally sound. It only gets
displayed when a share is submitted. If no shares are submitted for a long time it isn't very useful.

A periodic temperature report, independent of share submissions, would be better but it's more work and would add verbosity to
the output.

are you talking about the temperature also available via api? if so, on many systems this temp is either nonexistent (windows mostly) or some mobo temp, only on laptops i have received the actual cpu temp this way

Yes same as API, forgot to mention Linux only.

felixbrucker

hero member

Activity: 700

Merit: 500

Quote from: joblo on September 20, 2016, 04:43:56 PM

Quote from: AlexGR on September 20, 2016, 03:18:28 PM

Quote from: th3.r00t on September 20, 2016, 01:43:52 PM

Quote from: AlexGR on September 19, 2016, 05:08:02 PM

If anyone wants to squeeze a bit more "easy" speed out of their binaries, I wrote a technique that I was using a couple of years ago by combining multiple compilers: https://steemit.com/development/@alexgr/creating-faster-c-c-binaries-without-changing-a-single-line-of-code

That's mainly for multi-algo setups where there are C implementations - and there might be speed differences in how these are executed from different compilers. Obviously, when we have machines running 24/7, even 2-5-10% makes a difference.

Guess that is for Lunux?
Do you test it on your compile?
What are the speed diff's with default compile and your method?

Yes, Linux.
Yes, I have tested it with X11 - back when more of its algorithms were C-based. I haven't tried recently due to x11 asics, and more files being assembly.
I got >10% in gains on my pc.

10% just from the compiler?

I'm not sure what you means by "more of its algorithms were C-based". I haven't added any assembly code.

I've been dithering about the benchmark issue while there were no other pressing issues to prompt a release. It has to do with the way the nonce
is initialized on startup. The change in v3.4.6 (init to 0) seems to be cleaner than the original (init pseudo-random) as it eliminates the
occasional 1 hash scans on startup. But it breaks benchmark. I'm likely to init the nonce differently depending on benchmark.

The next release will also add CPU temperature display. I've considered a few options based on usefulness vs work involved. I'm thinking of
simp;ly ading the temp to the share submission report, it was the simplest implementation although not architecturally sound. It only gets
displayed when a share is submitted. If no shares are submitted for a long time it isn't very useful.

A periodic temperature report, independent of share submissions, would be better but it's more work and would add verbosity to
the output.

are you talking about the temperature also available via api? if so, on many systems this temp is either nonexistent (windows mostly) or some mobo temp, only on laptops i have received the actual cpu temp this way

joblo

legendary

Activity: 1470

Merit: 1114

Quote from: AlexGR on September 20, 2016, 03:18:28 PM

Quote from: th3.r00t on September 20, 2016, 01:43:52 PM

Quote from: AlexGR on September 19, 2016, 05:08:02 PM

If anyone wants to squeeze a bit more "easy" speed out of their binaries, I wrote a technique that I was using a couple of years ago by combining multiple compilers: https://steemit.com/development/@alexgr/creating-faster-c-c-binaries-without-changing-a-single-line-of-code

That's mainly for multi-algo setups where there are C implementations - and there might be speed differences in how these are executed from different compilers. Obviously, when we have machines running 24/7, even 2-5-10% makes a difference.

Guess that is for Lunux?
Do you test it on your compile?
What are the speed diff's with default compile and your method?

Yes, Linux.
Yes, I have tested it with X11 - back when more of its algorithms were C-based. I haven't tried recently due to x11 asics, and more files being assembly.
I got >10% in gains on my pc.

10% just from the compiler?

I'm not sure what you means by "more of its algorithms were C-based". I haven't added any assembly code.

I've been dithering about the benchmark issue while there were no other pressing issues to prompt a release. It has to do with the way the nonce
is initialized on startup. The change in v3.4.6 (init to 0) seems to be cleaner than the original (init pseudo-random) as it eliminates the
occasional 1 hash scans on startup. But it breaks benchmark. I'm likely to init the nonce differently depending on benchmark.

The next release will also add CPU temperature display. I've considered a few options based on usefulness vs work involved. I'm thinking of
simp;ly ading the temp to the share submission report, it was the simplest implementation although not architecturally sound. It only gets
displayed when a share is submitted. If no shares are submitted for a long time it isn't very useful.

A periodic temperature report, independent of share submissions, would be better but it's more work and would add verbosity to
the output.

AlexGR

legendary

Activity: 1708

Merit: 1049

Quote from: th3.r00t on September 20, 2016, 01:43:52 PM

Quote from: AlexGR on September 19, 2016, 05:08:02 PM

If anyone wants to squeeze a bit more "easy" speed out of their binaries, I wrote a technique that I was using a couple of years ago by combining multiple compilers: https://steemit.com/development/@alexgr/creating-faster-c-c-binaries-without-changing-a-single-line-of-code

That's mainly for multi-algo setups where there are C implementations - and there might be speed differences in how these are executed from different compilers. Obviously, when we have machines running 24/7, even 2-5-10% makes a difference.

Guess that is for Lunux?
Do you test it on your compile?
What are the speed diff's with default compile and your method?

Yes, Linux.
Yes, I have tested it with X11 - back when more of its algorithms were C-based. I haven't tried recently due to x11 asics, and more files being assembly.
I got >10% in gains on my pc.

Topic: [ANN]: cpuminer-opt v3.8.8.1, open source optimized multi-algo CPU miner - page 134. (Read 444131 times)