Author

Topic: [ANN]: cpuminer-opt v3.8.8.1, open source optimized multi-algo CPU miner - page 147. (Read 444067 times)

newbie
Activity: 62
Merit: 0
newbie
Activity: 62
Merit: 0
Actually I got it working all the way. Here is a link for download for a few architectures.
AMD compiled with 8350
btver1 compiled with t1100
skylake compiled with 6700hq
avx2 compiled with xeon e5 1620v4
https://github.com/KhryptorGraphics/cpuminer-opt-3.4.1/blob/Windows-Binary/cpuminer-opt-3.4.1.zip


Worked for me. Your package also worked with my native compile. It looks like you have the magic recipe, thanks for sharing.

I will provide binaries with the next release. This will be considered a test that my compilations work on other machines than
my own. I will also bundle the DLLs with the binaries. Moving forward I may omit binaries for certain lesser architectures
that don't provide any gains over the previous release. In short I want to serve all users but I don't want to make a full time
job of it.

I'm taking requests for the next release. Post your preferred architecture if you know it or your CPU model and I'll build them all
for the first release.

Thanks!!!

I would like to see Westmere sse4 - 4.2 added in addition to sse2, aes that work together (I.E. sse4 - 4.2 + aes) . Might give the 5600 series xeons a slightly more efficient and faster hashrate for some algos.  I would also to have a discussion on optimization while compiling. I used -flto and 03 - and tested 0fast but couldnt see a hashrate difference.
legendary
Activity: 1470
Merit: 1114
Actually I got it working all the way. Here is a link for download for a few architectures.
AMD compiled with 8350
btver1 compiled with t1100
skylake compiled with 6700hq
avx2 compiled with xeon e5 1620v4
https://github.com/KhryptorGraphics/cpuminer-opt-3.4.1/blob/Windows-Binary/cpuminer-opt-3.4.1.zip


Worked for me. Your package also worked with my native compile. It looks like you have the magic recipe, thanks for sharing.

I will provide binaries with the next release. This will be considered a test that my compilations work on other machines than
my own. I will also bundle the DLLs with the binaries. Moving forward I may omit binaries for certain lesser architectures
that don't provide any gains over the previous release. In short I want to serve all users but I don't want to make a full time
job of it.

I'm taking requests for the next release. Post your preferred architecture if you know it or your CPU model and I'll build them all
for the first release.
legendary
Activity: 1470
Merit: 1114
@Joblo You might want to update the m7m algo from this (based on wolf's optimization):  https://github.com/magi-project/m-cpuminer-v2

For comparison (i5-2500k)
cpuminer-opt v3.4.2 does approx. 27 KH/s 39 KH/s
m-cpuminer-v2 does approx. 53 KH/s

Edit: Sorry Joblo, my mistake, was half asleep when I typed this. However the wolf based version still faster.

I have confirmed your results, i7-6700K 121 kH/s up from 85. It will be in the next release of cpuminer-opt.

Great find.
newbie
Activity: 62
Merit: 0
Actually I got it working all the way. Here is a link for download for a few architectures.
AMD compiled with 8350
btver1 compiled with t1100
skylake compiled with 6700hq
avx2 compiled with xeon e5 1620v4
https://github.com/KhryptorGraphics/cpuminer-opt-3.4.1/blob/master/cpuminer-opt-3.4.1.zip

hello, i was wondering if you have any updates on your work for the cpuminer3.41 windows binary compile.
I can compile it using mingw and msys but it gives me dll errors, and when i find all the dlls it asks for then it give a 0x0000007 error code and says the program must close
Thanks.  

I get the same problem if I run it from a dos shell, use the msys shell instead. If it compiles under msys/mingw it should also
run under msys/mingw. If not let me know.

Please do not use PM for general questions.
legendary
Activity: 1470
Merit: 1114
@Joblo You might want to update the m7m algo from this (based on wolf's optimization):  https://github.com/magi-project/m-cpuminer-v2

For comparison (i5-2500k)
cpuminer-opt v3.4.2 does approx. 27 KH/s
m-cpuminer-v2 does approx. 53 KH/s

That's curious, I get 37 kH/s on my i5-2400 with 3.4.2. I'll follow up on the wolf version.
sr. member
Activity: 292
Merit: 250
@Joblo You might want to update the m7m algo from this (based on wolf's optimization):  https://github.com/magi-project/m-cpuminer-v2

For comparison (i5-2500k)
cpuminer-opt v3.4.2 does approx. 27 KH/s 39 KH/s
m-cpuminer-v2 does approx. 53 KH/s

Edit: Sorry Joblo, my mistake, was half asleep when I typed this. However the wolf based version still faster.
sr. member
Activity: 462
Merit: 250
Arianee:Smart-link Connecting Owners,Assets,Brands
does any one know what is the hashrate of the 3rd or 4th mobile cpu when mining with cryptolight algo ?
legendary
Activity: 1470
Merit: 1114
cpuminer-opt v3.4.2...

https://drive.google.com/file/d/0B0lVSGQYLJIZdF9Oelc4RlNYUFU/view?usp=sharing

- tweaked lyra2 AVX2/AVX code for small improvement
- added veltor algo
Love the new version.
I see slight increase of hashrate on the algos that I usually use.

And today i found this:
Code:
https://github.com/floodyberry/blake2s-opt

It seems to me that there are blake2s AVX, XOP and AVX2 optimised versions.
Hope that helps of boosting blake2s hashrates if it can be implemented.

Thanks for the link.
That implementation of blake2s has a very different code structure, I'm not sure it can be used as a drop in
replacement for the existing one.
sr. member
Activity: 312
Merit: 250
cpuminer-opt v3.4.2...

https://drive.google.com/file/d/0B0lVSGQYLJIZdF9Oelc4RlNYUFU/view?usp=sharing

- tweaked lyra2 AVX2/AVX code for small improvement
- added veltor algo
Love the new version.
I see slight increase of hashrate on the algos that I usually use.

And today i found this:
Code:
https://github.com/floodyberry/blake2s-opt

It seems to me that there are blake2s AVX, XOP and AVX2 optimised versions.
Hope that helps of boosting blake2s hashrates if it can be implemented.
legendary
Activity: 1470
Merit: 1114
cpuminer-opt v3.4.2...

https://drive.google.com/file/d/0B0lVSGQYLJIZdF9Oelc4RlNYUFU/view?usp=sharing

- tweaked lyra2 AVX2/AVX code for small improvement
- added veltor algo
legendary
Activity: 1470
Merit: 1114
hello, i was wondering if you have any updates on your work for the cpuminer3.41 windows binary compile.
I can compile it using mingw and msys but it gives me dll errors, and when i find all the dlls it asks for then it give a 0x0000007 error code and says the program must close
Thanks.  

I get the same problem if I run it from a dos shell, use the msys shell instead. If it compiles under msys/mingw it should also
run under msys/mingw. If not let me know.

Please do not use PM for general questions.
legendary
Activity: 1470
Merit: 1114
Tested it on big brother )) core-i3 haswell, win 8.1 64
It works without msys, just copied needed dlls and all ok!
0.5mhs@31watt )) mobile broadwell is more effecient ))


I'll have to try again. Last time I tried it I just kept copying DLLs it was complaining about then got a different error
and gave up.
legendary
Activity: 1510
Merit: 1003
Tested it on big brother )) core-i3 haswell, win 8.1 64
It works without msys, just copied needed dlls and all ok!
0.5mhs@31watt )) mobile broadwell is more effecient ))

Code:
         **********  cpuminer-opt 3.4.1  ***********
     A CPU miner with multi algo support and optimized for CPUs
     with AES_NI and AVX extensions.
     BTC donation address: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT
     Forked from TPruvot's cpuminer-multi with credits
     to Lucas Jones, elmad, palmd, djm34, pooler, ig0tik3d,
     Wolf0, Jeff Garzik and Optiminer.

CPU: Intel(R) Core(TM) i3-4360 CPU @ 3.70GHz
CPU features: SSE2 AES AVX AVX2
SW built on Aug  9 2016 with GCC 4.8.3
SW features: SSE2 AES AVX AVX2
Algo features: SSE2 AES AVX AVX2
Start mining with SSE2 AES AVX AVX2

[2016-08-10 10:53:45] 4 miner threads started, using 'lyra2re' algorithm.
[2016-08-10 10:53:45] Starting Stratum on stratum+tcp://lyra2re.eu.nicehash.com:
3342
[2016-08-10 10:53:46] Stratum difficulty set to 0.8
[2016-08-10 10:53:46] lyra2re.eu.nicehash.com:3342 lyra2re block 83674
[2016-08-10 10:53:47] CPU #3: 65.54 kH, 122.41 kH/s
[2016-08-10 10:53:47] CPU #1: 65.54 kH, 120.39 kH/s
[2016-08-10 10:53:47] CPU #2: 65.54 kH, 120.17 kH/s
[2016-08-10 10:53:47] CPU #0: 65.54 kH, 114.90 kH/s
[2016-08-10 10:54:14] CPU #3: 3527.63 kH, 130.86 kH/s
[2016-08-10 10:54:14] CPU #1: 3475.30 kH, 128.92 kH/s
[2016-08-10 10:54:14] CPU #0: 3363.77 kH, 124.90 kH/s
[2016-08-10 10:54:14] CPU #2: 3487.83 kH, 129.39 kH/s
[2016-08-10 10:55:13] CPU #0: 7494.16 kH, 126.37 kH/s
[2016-08-10 10:55:14] CPU #1: 7735.44 kH, 128.64 kH/s
[2016-08-10 10:55:14] CPU #3: 7851.33 kH, 130.24 kH/s
[2016-08-10 10:55:14] CPU #2: 7763.33 kH, 128.50 kH/s
[2016-08-10 10:56:14] CPU #0: 7582.22 kH, 125.87 kH/s
[2016-08-10 10:56:14] CPU #1: 7718.49 kH, 128.52 kH/s
[2016-08-10 10:56:14] CPU #3: 7814.49 kH, 130.32 kH/s
[2016-08-10 10:56:14] CPU #2: 7710.00 kH, 128.79 kH/s
[2016-08-10 10:56:52] Stratum difficulty set to 0.4
[2016-08-10 10:56:59] CPU #3: 5692.51 kH, 128.40 kH/s
[2016-08-10 10:56:59] accepted: 1/1 (100%), 28.70 MH, 511.58 kH/s yes!
[2016-08-10 10:57:07] CPU #3: 1024.97 kH, 127.01 kH/s
[2016-08-10 10:57:07] accepted: 2/2 (100%), 24.04 MH, 510.19 kH/s yes!
legendary
Activity: 1510
Merit: 1003
It was build exactly as you wrote in readme.md. I did use winbuild.sh

Very strange it works from msys (mingw) shell but not as standalone app. Maybe we need some extra dlls or environment variables set?
member
Activity: 81
Merit: 1002
It was only the wind.

Theoretically yes if there exists any earlier executed code that contained compiler produced AVX2 instructions from regular source.
That isn't likely since the capabilities check is done ealy in main.

Let's not forget that we ask gcc to compile and optimize (all those -O2 -O3 -Ofast) for the cpu it's being run on. So regardless whether you actually include any explicit AVX/AVX2 assembler in the code, even a simple printf("hi"); may produce AVX2 instruction(s) if the compiler feels like it. That's the whole point of the compiler compiling for the given cpu (-march=native) - it's allowed to use all the capabilities (and thus instruction sets) of the cpu.


That's what I was referring to when I wrote "compiler produced AVX2". AVX(2) provides SIMD instructions an it's unlikely something
like a printf would use it. A memcpy wouldn't use it because of the overhead of loading/storing the data to/from the ymm regs.
It's only useful for vector arith, and apparently the compiler isn't smart enough to convert conventionally coded array processing
loops to AVX2. I'm not even sure *if* the compiler can optimize in this fashion, the existance of so much hand coded AVX2 suggests
otherwise.

Since we're playing semantic games would you care to explain your concerns with my use of the term cross compiling?
IMO cross compiling can mean any compilation not done on the target machine and not executable on the build machine.

And my comment about you maybe using a core2 was based on the symptoms you decribed and that some server CPUs can
be optimized for efficiency by removing/disabling unneeded features like floating point, AES or AVX.

Not true. memcpy() and friends CAN and do use SSE/AVX - if the source/dests are aligned properly.

It doesn't really matter in this context whether it crashes before or after the warning message.

I'll take your word for it, but it doesn't seem to make much sense. It is essentially load, move, store, 256 bits wide. Where
are the savings? I presume it takes longer to load data into the ymm regs than general purpose ones. The same amount of data
has to be moved around in memory. Using AVX seems to make sense if you're going to do a lot of processing of the data while
in vector format.

There's my strawman, rip it apart.

Speaking of alignment I need to fix that up in my avx code. I used all loadu/storeu for convenience.

If it's aligned, then the load/stores don't take nearly as long. Also keep in mind that there's no such thing as a mov memaddr, memaddr opcode in x86 that I know of. Therefore, it's gotta go in a register (this is simplified, I know about things like DMA, but they don't come into play for the purposes of this discussion) and if it's aligned, it makes one hell of a lot more sense to stuff it in an AVX register, because it's a lot wider than a GPR. Even better if you're doing some kind of gather-scatter shit, possibly.
legendary
Activity: 1470
Merit: 1114
Update:
strange, it works on laptop 2 but only under msys command terminal ))

Code:
         **********  cpuminer-opt 3.4.1  ***********
     A CPU miner with multi algo support and optimized for CPUs
     with AES_NI and AVX extensions.
     BTC donation address: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT
     Forked from TPruvot's cpuminer-multi with credits
     to Lucas Jones, elmad, palmd, djm34, pooler, ig0tik3d,
     Wolf0, Jeff Garzik and Optiminer.

CPU: Intel(R) Core(TM) i5-5250U CPU @ 1.60GHz
CPU features: SSE2 AES AVX AVX2
SW built on Aug  9 2016 with GCC 4.8.3
SW features: SSE2 AES AVX AVX2
Algo features: SSE2 AES AVX AVX2
Start mining with SSE2 AES AVX AVX2

[2016-08-09 23:15:56] Starting Stratum on stratum+tcp://lyra2re.eu.nicehash.com:
3342
[2016-08-09 23:15:56] 4 miner threads started, using 'lyra2re' algorithm.
[2016-08-09 23:15:57] Stratum difficulty set to 0.8
[2016-08-09 23:15:57] lyra2re.eu.nicehash.com:3342 lyra2re block 83403
[2016-08-09 23:15:58] CPU #0: 65.54 kH, 86.15 kH/s
[2016-08-09 23:15:58] CPU #1: 65.54 kH, 85.36 kH/s
[2016-08-09 23:15:58] CPU #3: 65.54 kH, 80.93 kH/s
[2016-08-09 23:15:58] CPU #2: 65.54 kH, 80.93 kH/s

350khs @ 13watt!!! great speedup. It was 280khs before ))


Got it, I suspected that when I started thinking of you copying it from L1. Thanks for testing.
legendary
Activity: 1470
Merit: 1114
Just for fun, like to build from source ))

Tried your latest 3.4.1 version. I have 2 laptops:
one (1) with core i5 3317U (Ivy Bridge) running win7 64. It has AVX only.
second (2) with core i5 5250U (Broadwell-U) running win8.1 64. It has AVX2.

I only tried Lyra2re on Nicehash.

First of all I built on laptop (1). It was successfull Compile and Run, I get something like 240khs in Lyra2re.


Then I tried to compile your miner on laptop 2.
It was successfull compile but again no run.


This is the part that concerns me. You successfully compiled but failed to run. I would like to follow up
if you can provide more information at your convenience. I successfully compiled and run this version on a
Haswell i7 with Win8.1. But it must be run from the mingw shell. I tried running it from a dos shell, even with
the dlls and it failed. It definitely can't be ported to a different CPU, especially one with a lower feature set.

When you have the time could you post info about the compile such any options if you didn't use winbuild.sh,
and some info on the run failure, any output, segfault, did it exit, hang?

member
Activity: 81
Merit: 1002
It was only the wind.

Theoretically yes if there exists any earlier executed code that contained compiler produced AVX2 instructions from regular source.
That isn't likely since the capabilities check is done ealy in main.

Let's not forget that we ask gcc to compile and optimize (all those -O2 -O3 -Ofast) for the cpu it's being run on. So regardless whether you actually include any explicit AVX/AVX2 assembler in the code, even a simple printf("hi"); may produce AVX2 instruction(s) if the compiler feels like it. That's the whole point of the compiler compiling for the given cpu (-march=native) - it's allowed to use all the capabilities (and thus instruction sets) of the cpu.


That's what I was referring to when I wrote "compiler produced AVX2". AVX(2) provides SIMD instructions an it's unlikely something
like a printf would use it. A memcpy wouldn't use it because of the overhead of loading/storing the data to/from the ymm regs.
It's only useful for vector arith, and apparently the compiler isn't smart enough to convert conventionally coded array processing
loops to AVX2. I'm not even sure *if* the compiler can optimize in this fashion, the existance of so much hand coded AVX2 suggests
otherwise.

Since we're playing semantic games would you care to explain your concerns with my use of the term cross compiling?
IMO cross compiling can mean any compilation not done on the target machine and not executable on the build machine.

And my comment about you maybe using a core2 was based on the symptoms you decribed and that some server CPUs can
be optimized for efficiency by removing/disabling unneeded features like floating point, AES or AVX.

Not true. memcpy() and friends CAN and do use SSE/AVX - if the source/dests are aligned properly.
legendary
Activity: 1510
Merit: 1003
Update:
strange, it works on laptop 2 but only under msys command terminal ))

Code:
         **********  cpuminer-opt 3.4.1  ***********
     A CPU miner with multi algo support and optimized for CPUs
     with AES_NI and AVX extensions.
     BTC donation address: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT
     Forked from TPruvot's cpuminer-multi with credits
     to Lucas Jones, elmad, palmd, djm34, pooler, ig0tik3d,
     Wolf0, Jeff Garzik and Optiminer.

CPU: Intel(R) Core(TM) i5-5250U CPU @ 1.60GHz
CPU features: SSE2 AES AVX AVX2
SW built on Aug  9 2016 with GCC 4.8.3
SW features: SSE2 AES AVX AVX2
Algo features: SSE2 AES AVX AVX2
Start mining with SSE2 AES AVX AVX2

[2016-08-09 23:15:56] Starting Stratum on stratum+tcp://lyra2re.eu.nicehash.com:
3342
[2016-08-09 23:15:56] 4 miner threads started, using 'lyra2re' algorithm.
[2016-08-09 23:15:57] Stratum difficulty set to 0.8
[2016-08-09 23:15:57] lyra2re.eu.nicehash.com:3342 lyra2re block 83403
[2016-08-09 23:15:58] CPU #0: 65.54 kH, 86.15 kH/s
[2016-08-09 23:15:58] CPU #1: 65.54 kH, 85.36 kH/s
[2016-08-09 23:15:58] CPU #3: 65.54 kH, 80.93 kH/s
[2016-08-09 23:15:58] CPU #2: 65.54 kH, 80.93 kH/s

350khs @ 13watt!!! great speedup. It was 280khs before ))
Jump to: