[ANN]: cpuminer-opt v3.8.8.1, open source optimized multi-algo CPU miner - page 150.

johnsmithx

hero member

Activity: 589

Merit: 507

I don't buy nor sell anything here and never will.

Success!

I did this very ugly hack, joblo please don't get a heart attack:

Code:

--- scrypt-jane-romix-template.h.orig   2016-02-05 22:05:38.000000000 +0000
+++ scrypt-jane-romix-template.h 2016-08-05 00:37:48.949684265 +0000
@@ -86,9 +86,9 @@
  for (i = 0; i < /*N - 1*/511; i++, block += chunkWords) {
        /* 3: V_i = X */
        /* 4: X = H(X) */
-       SCRYPT_CHUNKMIX_FN(block + chunkWords, block, NULL, /*r*/1);
+//         SCRYPT_CHUNKMIX_FN(block + chunkWords, block, NULL, /*r*/1);
  }
- SCRYPT_CHUNKMIX_FN(X, block, NULL, 1);
+//     SCRYPT_CHUNKMIX_FN(X, block, NULL, 1);

  /* 6: for i = 0 to N - 1 do */
  for (i = 0; i < /*N*/512; i += 2) {
@@ -96,13 +96,13 @@
        j = X[chunkWords - SCRYPT_BLOCK_WORDS] & /*(N - 1)*/511;

        /* 8: X = H(Y ^ V_j) */
-       SCRYPT_CHUNKMIX_FN(Y, X, scrypt_item(V, j, chunkWords), 1);
+//         SCRYPT_CHUNKMIX_FN(Y, X, scrypt_item(V, j, chunkWords), 1);

        /* 7: j = Integerify(Y) % N */
        j = Y[chunkWords - SCRYPT_BLOCK_WORDS] & /*(N - 1)*/511;

        /* 8: X = H(Y ^ V_j) */
-       SCRYPT_CHUNKMIX_FN(X, Y, scrypt_item(V, j, chunkWords), 1);
+//         SCRYPT_CHUNKMIX_FN(X, Y, scrypt_item(V, j, chunkWords), 1);
  }

  /* 10: B' = X */

And now it does compile with -flto and here is the result:

Code:

CPU: Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz
CPU features: SSE2 AES AVX AVX2
SW built on Aug  5 2016 with GCC 5.4.0
SW features: SSE2 AES AVX AVX2
Algo features: SSE2 AES AVX AVX2
Start mining with SSE2 AES AVX AVX2

[2016-08-05 00:58:03] 4 miner threads started, using 'lyra2re' algorithm.
[2016-08-05 00:58:04] CPU #0: 65.54 kH, 84.17 kH/s
[2016-08-05 00:58:04] CPU #1: 65.54 kH, 84.25 kH/s
[2016-08-05 00:58:04] CPU #3: 65.54 kH, 84.23 kH/s
[2016-08-05 00:58:04] Total: 196.61 kH, 252.64 kH/s
[2016-08-05 00:58:04] CPU #2: 65.54 kH, 83.86 kH/s
[2016-08-05 00:58:08] CPU #2: 335.45 kH, 84.02 kH/s
[2016-08-05 00:58:08] CPU #1: 336.99 kH, 84.25 kH/s
[2016-08-05 00:58:08] CPU #3: 336.92 kH, 84.24 kH/s
[2016-08-05 00:58:08] Total: 1074.89 kH, 336.68 kH/s
[2016-08-05 00:58:08] CPU #0: 336.67 kH, 84.04 kH/s
[2016-08-05 00:58:13] CPU #2: 420.12 kH, 84.16 kH/s
[2016-08-05 00:58:13] CPU #1: 421.26 kH, 84.35 kH/s
[2016-08-05 00:58:13] CPU #0: 420.18 kH, 84.19 kH/s
[2016-08-05 00:58:13] CPU #3: 421.18 kH, 84.34 kH/s
[2016-08-05 00:58:13] Total: 1682.74 kH, 337.04 kH/s
[2016-08-05 00:58:18] CPU #2: 420.78 kH, 84.16 kH/s
[2016-08-05 00:58:18] CPU #1: 421.77 kH, 84.31 kH/s
[2016-08-05 00:58:18] CPU #0: 420.97 kH, 84.19 kH/s
[2016-08-05 00:58:18] CPU #3: 421.69 kH, 84.26 kH/s
[2016-08-05 00:58:18] Total: 1685.21 kH, 336.92 kH/s
[2016-08-05 00:58:23] CPU #1: 421.54 kH, 84.37 kH/s
[2016-08-05 00:58:23] CPU #3: 421.31 kH, 84.32 kH/s
[2016-08-05 00:58:23] CPU #2: 420.81 kH, 83.99 kH/s
[2016-08-05 00:58:23] Total: 1684.63 kH, 336.87 kH/s
[2016-08-05 00:58:23] CPU #0: 420.93 kH, 84.01 kH/s
[2016-08-05 00:58:28] CPU #2: 419.96 kH, 84.10 kH/s
[2016-08-05 00:58:28] CPU #0: 420.07 kH, 84.10 kH/s
[2016-08-05 00:58:28] CPU #1: 421.87 kH, 84.17 kH/s
[2016-08-05 00:58:28] CPU #3: 421.58 kH, 84.09 kH/s
[2016-08-05 00:58:28] Total: 1683.49 kH, 336.46 kH/s

So using -flto gives another 2.75% speed increase. That's 7.7% speed increase in total over tpruvot.

Now this is with -flto and -fuse-linker-plugin:

Code:

CPU: Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz
CPU features: SSE2 AES AVX AVX2
SW built on Aug  5 2016 with GCC 5.4.0
SW features: SSE2 AES AVX AVX2
Algo features: SSE2 AES AVX AVX2
Start mining with SSE2 AES AVX AVX2

[2016-08-05 00:55:15] 4 miner threads started, using 'lyra2re' algorithm.
[2016-08-05 00:55:16] CPU #0: 65.54 kH, 84.75 kH/s
[2016-08-05 00:55:16] CPU #1: 65.54 kH, 84.78 kH/s
[2016-08-05 00:55:16] CPU #2: 65.54 kH, 84.56 kH/s
[2016-08-05 00:55:16] CPU #3: 65.54 kH, 84.44 kH/s
[2016-08-05 00:55:16] Total: 262.14 kH, 338.53 kH/s
[2016-08-05 00:55:20] CPU #3: 337.77 kH, 84.06 kH/s
[2016-08-05 00:55:20] Total: 534.38 kH, 338.15 kH/s
[2016-08-05 00:55:20] CPU #2: 338.22 kH, 84.01 kH/s
[2016-08-05 00:55:20] CPU #1: 339.13 kH, 84.09 kH/s
[2016-08-05 00:55:20] CPU #0: 338.98 kH, 84.02 kH/s
[2016-08-05 00:55:25] CPU #0: 420.11 kH, 84.71 kH/s
[2016-08-05 00:55:25] CPU #2: 420.03 kH, 84.49 kH/s
[2016-08-05 00:55:25] CPU #3: 420.31 kH, 84.05 kH/s
[2016-08-05 00:55:25] Total: 1599.59 kH, 337.33 kH/s
[2016-08-05 00:55:25] CPU #1: 420.43 kH, 84.07 kH/s
[2016-08-05 00:55:30] CPU #3: 420.25 kH, 83.97 kH/s
[2016-08-05 00:55:30] Total: 1680.82 kH, 337.24 kH/s
[2016-08-05 00:55:30] CPU #2: 422.44 kH, 83.97 kH/s
[2016-08-05 00:55:30] CPU #0: 423.54 kH, 83.98 kH/s
[2016-08-05 00:55:30] CPU #1: 420.36 kH, 83.97 kH/s
[2016-08-05 00:55:35] CPU #0: 419.88 kH, 84.64 kH/s
[2016-08-05 00:55:35] CPU #2: 419.84 kH, 84.39 kH/s
[2016-08-05 00:55:35] CPU #3: 419.85 kH, 84.00 kH/s
[2016-08-05 00:55:35] Total: 1679.93 kH, 337.00 kH/s
[2016-08-05 00:55:35] CPU #1: 419.85 kH, 84.02 kH/s
[2016-08-05 00:55:40] CPU #0: 423.20 kH, 84.42 kH/s
[2016-08-05 00:55:40] CPU #3: 420.02 kH, 84.32 kH/s
[2016-08-05 00:55:40] Total: 1682.91 kH, 337.15 kH/s

Basically the same speed. Now what if I actually call tpruvot's build.sh, exactly the one I showed in my previous post:

Code:

CPU: Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz
CPU features: SSE2 AES AVX AVX2
SW built on Aug  5 2016 with GCC 5.4.0
SW features: SSE2 AES AVX AVX2
Algo features: SSE2 AES AVX AVX2
Start mining with SSE2 AES AVX AVX2

[2016-08-05 01:10:02] 4 miner threads started, using 'lyra2re' algorithm.
[2016-08-05 01:10:03] CPU #0: 65.54 kH, 84.11 kH/s
[2016-08-05 01:10:03] CPU #1: 65.54 kH, 83.93 kH/s
[2016-08-05 01:10:03] CPU #2: 65.54 kH, 83.86 kH/s
[2016-08-05 01:10:03] CPU #3: 65.54 kH, 83.96 kH/s
[2016-08-05 01:10:03] Total: 262.14 kH, 335.86 kH/s
[2016-08-05 01:10:07] CPU #1: 335.71 kH, 84.00 kH/s
[2016-08-05 01:10:07] CPU #2: 335.44 kH, 83.92 kH/s
[2016-08-05 01:10:07] CPU #3: 335.85 kH, 83.99 kH/s
[2016-08-05 01:10:07] Total: 1072.54 kH, 336.02 kH/s
[2016-08-05 01:10:07] CPU #0: 336.45 kH, 83.93 kH/s
[2016-08-05 01:10:12] CPU #1: 420.00 kH, 84.00 kH/s
[2016-08-05 01:10:12] CPU #2: 419.62 kH, 83.92 kH/s
[2016-08-05 01:10:12] CPU #3: 419.93 kH, 83.99 kH/s
[2016-08-05 01:10:12] Total: 1596.00 kH, 335.82 kH/s
[2016-08-05 01:10:12] CPU #0: 419.64 kH, 83.91 kH/s
[2016-08-05 01:10:17] CPU #1: 419.98 kH, 84.05 kH/s
[2016-08-05 01:10:17] CPU #2: 419.58 kH, 83.98 kH/s
[2016-08-05 01:10:17] CPU #3: 419.93 kH, 84.03 kH/s
[2016-08-05 01:10:17] Total: 1679.12 kH, 335.98 kH/s
[2016-08-05 01:10:17] CPU #0: 419.53 kH, 83.99 kH/s
[2016-08-05 01:10:22] CPU #2: 419.92 kH, 84.04 kH/s
[2016-08-05 01:10:22] CPU #1: 420.25 kH, 84.04 kH/s
[2016-08-05 01:10:22] CPU #0: 419.93 kH, 84.04 kH/s
[2016-08-05 01:10:22] CPU #3: 420.18 kH, 84.02 kH/s
[2016-08-05 01:10:22] Total: 1680.28 kH, 336.14 kH/s

Still the same (maximum) speed.

So I will be using joblo's cpuminer with tpruvot's (uncommented) build.sh because that build.sh has all those other flags (including -falign-*) which may or may not matter, so just to be safe..

EDIT: when I took the avx2 binary and tried to run it on a avx cpu I got this:

Code:

CPU:       Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
CPU features: SSE2 AES AVX
SW built on Aug  5 2016 with GCC 5.4.0
SW features: SSE2 AES AVX AVX2
Algo features: SSE2 AES AVX AVX2
Start mining with SSE2 AES AVX

Illegal instruction (core dumped)

But wasn't the whole idea that all the cpu features will be compiled in and what particular feature shall be used will be determined at the runtime? It's not a big deal, I just recompiled it and I will have two versions (avx and avx2) and run the one that's appropriate to the cpu. Just I thought I would report this.

ReiMomo

sr. member

Activity: 2366

Merit: 305

Duelbits - $100k Bonus/week

So when is the Windows bin out? Huh

johnsmithx

hero member

Activity: 589

Merit: 507

I don't buy nor sell anything here and never will.

joblo, that's an excellent improvement! Now you are definitely faster than tpruvot, at least by 4.8%.

I made a better repeatable benchmark of tpruvot's, the numbers are directly in the build.sh:

Code:

#!/bin/bash

if [ "$OS" = "Windows_NT" ]; then
    ./mingw64.sh
    exit 0
fi

# Linux build

make clean || echo clean

rm -f config.status
./autogen.sh || echo done

# Ubuntu 10.04 (gcc 4.4)
extracflags="-O3 -march=native -w -D_REENTRANT -funroll-loops -fvariable-expansion-in-unroller -fmerge-all-constants -fbranch-target-load-optimize2 -fsched2-use-superblocks -falign-loops=16 -falign-functions=16 -falign-jumps=16 -falign-labels=16"

# Debian 7.7 / Ubuntu 14.04 (gcc 4.7+)
extracflags="$extracflags -Ofast -fuse-linker-plugin -ftree-loop-if-convert-stores"

if [ ! "0" = `cat /proc/cpuinfo | grep -c avx` ]; then
    # march native doesn't always works, ex. some Pentium Gxxx (no avx)
    extracflags="$extracflags -march=native"
fi


# Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz 4 threads (d2.xlarge)


#309-310
#CFLAGS="-O3 $extracflags -flto -march=native -DUSE_ASM -pg" ./configure --with-crypto --with-curl

#311-312
CFLAGS="-O3 $extracflags -flto -march=native -DUSE_ASM -pg" CXXFLAGS="-std=gnu++11" ./configure --with-crypto --with-curl

#281
#CFLAGS="-O3 $extracflags -flto -march=native -DUSE_ASM -pg" CXXFLAGS="$CFLAGS" ./configure --with-crypto --with-curl

#280
#CFLAGS="-O3 $extracflags -flto -march=native -DUSE_ASM -pg" CXXFLAGS="$CFLAGS -std=gnu++11" ./configure --with-crypto --with-curl


#269
#CFLAGS="-O3 $extracflags -march=native -DUSE_ASM -pg" ./configure --with-crypto --with-curl

#264
#CFLAGS="-O3 $extracflags -march=native -DUSE_ASM -pg" CXXFLAGS="-std=gnu++11" ./configure --with-crypto --with-curl

#242
#CFLAGS="-O3 $extracflags -march=native -DUSE_ASM -pg" CXXFLAGS="$CFLAGS" ./configure --with-crypto --with-curl

#245
#CFLAGS="-O3 $extracflags -march=native -DUSE_ASM -pg" CXXFLAGS="$CFLAGS -std=gnu++11" ./configure --with-crypto --with-curl


make -j $(grep processor /proc/cpuinfo | wc -l)

strip -s cpuminer

So with him I get 312 at best on this particular machine and that config of flags is basically the default if you uncomment everything so I didn't make him any faster, I just proved he can be much slower if wrong flags are used.

Now with yours, without any change, untouched cpuminer-opt-3.4.0.tar.gz, I get this:

Code:

root@xxx:~/cpuminer-opt# ./cpuminer -a lyra2re --benchmark

CPU: Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz
CPU features: SSE2 AES AVX AVX2
SW built on Aug  4 2016 with GCC 5.4.0
SW features: SSE2 AES AVX AVX2
Algo features: SSE2 AES AVX AVX2
Start mining with SSE2 AES AVX AVX2

[2016-08-04 23:24:25] 4 miner threads started, using 'lyra2re' algorithm.
[2016-08-04 23:24:26] CPU #1: 65.54 kH, 82.16 kH/s
[2016-08-04 23:24:26] CPU #0: 65.54 kH, 81.67 kH/s
[2016-08-04 23:24:26] CPU #3: 65.54 kH, 81.84 kH/s
[2016-08-04 23:24:26] Total: 196.61 kH, 245.67 kH/s
[2016-08-04 23:24:26] CPU #2: 65.54 kH, 81.68 kH/s
[2016-08-04 23:24:30] CPU #0: 326.68 kH, 81.80 kH/s
[2016-08-04 23:24:30] CPU #3: 327.37 kH, 82.00 kH/s
[2016-08-04 23:24:30] Total: 785.13 kH, 327.64 kH/s
[2016-08-04 23:24:30] CPU #2: 326.73 kH, 81.75 kH/s
[2016-08-04 23:24:30] CPU #1: 328.64 kH, 82.01 kH/s
[2016-08-04 23:24:35] CPU #0: 409.02 kH, 81.78 kH/s
[2016-08-04 23:24:35] CPU #3: 409.99 kH, 81.93 kH/s
[2016-08-04 23:24:35] Total: 1474.38 kH, 327.46 kH/s
[2016-08-04 23:24:35] CPU #2: 408.76 kH, 81.73 kH/s
[2016-08-04 23:24:35] CPU #1: 410.04 kH, 81.94 kH/s
[2016-08-04 23:24:40] CPU #0: 408.89 kH, 81.78 kH/s
[2016-08-04 23:24:40] CPU #3: 409.63 kH, 81.94 kH/s
[2016-08-04 23:24:40] Total: 1637.32 kH, 327.39 kH/s

But when I add -flto I get the following error at the final link:

Code:

g++  -O3 -march=native -w -flto -std=gnu++11 -Lyes/lib  -Lyes/lib  -o cpuminer cpuminer-cpu-miner.o cpuminer-util.o cpuminer-uint256.o cpuminer-api.o cpuminer-sysinfos.o cpuminer-algo-gate-api.o algo/groestl/cpuminer-sph_groestl.o algo/skein/cpuminer-sph_skein.o algo/bmw/cpuminer-sph_bmw.o algo/shavite/cpuminer-sph_shavite.o algo/shavite/cpuminer-shavite.o algo/echo/cpuminer-sph_echo.o algo/blake/cpuminer-sph_blake.o algo/heavy/cpuminer-sph_hefty1.o algo/blake/cpuminer-mod_blakecoin.o algo/luffa/cpuminer-sph_luffa.o algo/cubehash/cpuminer-sph_cubehash.o algo/simd/cpuminer-sph_simd.o algo/hamsi/cpuminer-sph_hamsi.o algo/fugue/cpuminer-sph_fugue.o algo/gost/cpuminer-sph_gost.o algo/jh/cpuminer-sph_jh.o algo/keccak/cpuminer-sph_keccak.o algo/keccak/cpuminer-keccak.o algo/sha3/cpuminer-sph_sha2.o algo/sha3/cpuminer-sph_sha2big.o algo/shabal/cpuminer-sph_shabal.o algo/whirlpool/cpuminer-sph_whirlpool.o crypto/cpuminer-blake2s.o crypto/cpuminer-oaes_lib.o crypto/cpuminer-c_keccak.o crypto/cpuminer-c_groestl.o crypto/cpuminer-c_blake256.o crypto/cpuminer-c_jh.o crypto/cpuminer-c_skein.o crypto/cpuminer-hash.o crypto/cpuminer-aesb.o crypto/cpuminer-magimath.o algo/argon2/cpuminer-argon2a.o algo/argon2/ar2/cpuminer-argon2.o algo/argon2/ar2/cpuminer-opt.o algo/argon2/ar2/cpuminer-cores.o algo/argon2/ar2/cpuminer-ar2-scrypt-jane.o algo/argon2/ar2/cpuminer-blake2b.o algo/cpuminer-axiom.o algo/blake/cpuminer-blake.o algo/blake/cpuminer-blake2.o algo/blake/cpuminer-blakecoin.o algo/blake/cpuminer-decred.o algo/blake/cpuminer-pentablake.o algo/bmw/cpuminer-bmw256.o algo/cubehash/sse2/cpuminer-cubehash_sse2.o algo/cryptonight/cpuminer-cryptolight.o algo/cryptonight/cpuminer-cryptonight-common.o algo/cryptonight/cpuminer-cryptonight-aesni.o algo/cryptonight/cpuminer-cryptonight.o algo/cpuminer-drop.o algo/echo/aes_ni/cpuminer-hash.o algo/cpuminer-fresh.o algo/groestl/cpuminer-groestl.o algo/groestl/cpuminer-myr-groestl.o algo/groestl/sse2/cpuminer-grso.o algo/groestl/sse2/cpuminer-grso-asm.o algo/groestl/aes_ni/cpuminer-hash-groestl.o algo/groestl/aes_ni/cpuminer-hash-groestl256.o algo/haval/cpuminer-haval.o algo/heavy/cpuminer-heavy.o algo/heavy/cpuminer-bastion.o algo/cpuminer-hmq1725.o algo/hodl/cpuminer-hodl.o algo/hodl/cpuminer-hodl-gate.o algo/hodl/cpuminer-hodl_arith_uint256.o algo/hodl/cpuminer-hodl_uint256.o algo/hodl/cpuminer-hash.o algo/hodl/cpuminer-hmac_sha512.o algo/hodl/cpuminer-sha256.o algo/hodl/cpuminer-sha512.o algo/hodl/cpuminer-utilstrencodings.o algo/hodl/cpuminer-hodl-wolf.o algo/hodl/cpuminer-aes.o algo/hodl/cpuminer-sha512_avx.o algo/hodl/cpuminer-sha512_avx2.o algo/cpuminer-lbry.o algo/luffa/cpuminer-luffa.o algo/luffa/sse2/cpuminer-luffa_for_sse2.o algo/lyra2/cpuminer-lyra2.o algo/lyra2/cpuminer-sponge.o algo/lyra2/cpuminer-lyra2rev2.o algo/lyra2/cpuminer-lyra2re.o algo/keccak/sse2/cpuminer-keccak.o algo/cpuminer-m7m.o algo/cpuminer-neoscrypt.o algo/cpuminer-nist5.o algo/cpuminer-pluck.o algo/quark/cpuminer-quark.o algo/qubit/cpuminer-qubit.o algo/ripemd/cpuminer-sph_ripemd.o algo/cpuminer-scrypt.o algo/scryptjane/cpuminer-scrypt-jane.o algo/sha2/cpuminer-sha2.o algo/simd/sse2/cpuminer-nist.o algo/simd/sse2/cpuminer-vector.o algo/skein/cpuminer-skein.o algo/skein/cpuminer-skein2.o algo/cpuminer-s3.o algo/tiger/cpuminer-sph_tiger.o algo/whirlpool/cpuminer-whirlpool.o algo/whirlpool/cpuminer-whirlpoolx.o algo/x11/cpuminer-x11.o algo/x11/cpuminer-x11evo.o algo/x11/cpuminer-x11gost.o algo/x11/cpuminer-c11.o algo/x13/cpuminer-x13.o algo/x14/cpuminer-x14.o algo/x15/cpuminer-x15.o algo/x17/cpuminer-x17.o algo/yescrypt/cpuminer-yescrypt.o algo/yescrypt/cpuminer-yescrypt-common.o algo/yescrypt/cpuminer-sha256_Y.o algo/yescrypt/cpuminer-yescrypt-simd.o algo/cpuminer-zr5.o asm/cpuminer-neoscrypt_asm.o  asm/cpuminer-sha2-x64.o asm/cpuminer-scrypt-x64.o asm/cpuminer-aesb-x64.o   -lcurl -lz -ljansson -lpthread  -lssl -lcrypto -lgmp
/tmp/ccVXbbn8.ltrans6.ltrans.o: In function `scrypt_ROMix_avx2':
:(.text+0x9712): undefined reference to `scrypt_ChunkMix_avx2'
:(.text+0x9729): undefined reference to `scrypt_ChunkMix_avx2'
:(.text+0x9760): undefined reference to `scrypt_ChunkMix_avx2'
:(.text+0x9785): undefined reference to `scrypt_ChunkMix_avx2'
/tmp/ccVXbbn8.ltrans6.ltrans.o: In function `scrypt_ROMix_xop':
:(.text+0x99f2): undefined reference to `scrypt_ChunkMix_xop'
:(.text+0x9a09): undefined reference to `scrypt_ChunkMix_xop'
:(.text+0x9a40): undefined reference to `scrypt_ChunkMix_xop'
:(.text+0x9a65): undefined reference to `scrypt_ChunkMix_xop'
/tmp/ccVXbbn8.ltrans6.ltrans.o: In function `scrypt_ROMix_avx':
:(.text+0x9cd2): undefined reference to `scrypt_ChunkMix_avx'
:(.text+0x9ce9): undefined reference to `scrypt_ChunkMix_avx'
:(.text+0x9d20): undefined reference to `scrypt_ChunkMix_avx'
:(.text+0x9d45): undefined reference to `scrypt_ChunkMix_avx'
/tmp/ccVXbbn8.ltrans6.ltrans.o: In function `scrypt_ROMix_ssse3':
:(.text+0x9fb2): undefined reference to `scrypt_ChunkMix_ssse3'
:(.text+0x9fc9): undefined reference to `scrypt_ChunkMix_ssse3'
:(.text+0xa000): undefined reference to `scrypt_ChunkMix_ssse3'
:(.text+0xa025): undefined reference to `scrypt_ChunkMix_ssse3'
/tmp/ccVXbbn8.ltrans6.ltrans.o: In function `scrypt_ROMix_sse2':
:(.text+0xa292): undefined reference to `scrypt_ChunkMix_sse2'
:(.text+0xa2a9): undefined reference to `scrypt_ChunkMix_sse2'
:(.text+0xa2e0): undefined reference to `scrypt_ChunkMix_sse2'
:(.text+0xa305): undefined reference to `scrypt_ChunkMix_sse2'
collect2: error: ld returned 1 exit status
Makefile:1333: recipe for target 'cpuminer' failed
make[2]: *** [cpuminer] Error 1
make[2]: Leaving directory '/root/cpuminer-opt'
Makefile:3453: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/root/cpuminer-opt'
Makefile:670: recipe for target 'all' failed
make: *** [all] Error 2

If you are unsure how to fix it could you at least guide me how to disable the whole scrypt (optimization) because I am really anxious to see what -flto will do.

joblo

legendary

Activity: 1470

Merit: 1114

Quote from: clipto on August 04, 2016, 03:08:54 PM

3.3.7 gave me better hashrate than 3.3.8 on Lyra2RE, so I'm still running that.
But looking forward to the increased performance, but are bound to Windows OS.

There should be a slight increase between 3.3.7 and 3.3.8.

I'm suspecting data alignment issues. I've noticed different hashrates on different runs of the same version.
It doesn't seem to be related to other CPU activity.

clipto

member

Activity: 311

Merit: 10

3.3.7 gave me better hashrate than 3.3.8 on Lyra2RE, so I'm still running that.
But looking forward to the increased performance, but are bound to Windows OS.

joblo

legendary

Activity: 1470

Merit: 1114

Quote from: clipto on August 04, 2016, 03:03:58 PM

Great stuff, will it be released for Windows too?

I'm hoping. CMB have been good, seems they skipped v3.3.9 because it didn't compile. I hope they pickup v3.4.0.

clipto

member

Activity: 311

Merit: 10

Great stuff, will it be released for Windows too?

joblo

legendary

Activity: 1470

Merit: 1114

cpuminer-opt v3.4.0 is released.

A compile error was introduced in Windows v3.3.9 and has been fixed. X11gost was also broken since
v3.3.7 and has also been fixed.

The big news is more AVX2 optimizations inspired by Optiminer's work on the Hodl algo. See OP for details.
The entire Cubehash function was converted from SSE2 to AVX2 and improved all algos that use it. Some
AVX2 optimizations were also done to the Lyra2 core, improving both Lyra2RE and Lyra2REv2. Those were
the easy ones, I don't know how much more I can find. See OP for list of improved algos.

Source:
https://drive.google.com/file/d/0B0lVSGQYLJIZbFB1WThUZ09JbVk/view?usp=sharing

joblo

legendary

Activity: 1470

Merit: 1114

Quote from: johnsmithx on August 04, 2016, 03:45:32 AM

Quote from: joblo on August 03, 2016, 11:54:55 PM

I tested on a i7-6700K with 8 threads and saw no difference between multi with or without -flto.

One thing to keep in mind is that you are using a desktop cpu, I am using a server cpu. Those xeons are not multiple times more expensive for no reason (sure partly it's just branding, more reliability etc. etc., but there are also some functional differences).

Or it (the fact that -flto doesn't do anything to you) could simply be the compiler. Yours is 2 years old, mine 2 months. But the crucial information is that you can actually compile with -flto. Could you please give me the exact flags that work for you? I will take the effort and find out what's the problem on my end, I just need the starting (compilable) point.

I have no idea whether he improved the performance after you took the code from him. The very last commit is 3 days old but which one is the last one that could have had any real impact on Lyra2RE speed, or the overall speed, I am not going to investigate. But if you could remember, at least roughly, when did you take his code I can revert his tree to that date and try that version.

I considered things like larger cache and fatser memory interface as likely advantages of a server grade CPU but I can't
figure out how that would produce inconsistent results. I'm still leaning toward the compiler. My dusty memories seem to also
have LTO in them, I may have to dig back in the thread to refresh. Your compile errors might also jog some memories. I don't
think the problem was in code imported from multi, more likely one of the ugly SSE2 optimized macros. When I compiled i
used the flags from build.sh + -flto -pg.

It is an issue worth pursuing, I can't stay on the old compiler forever.

One way to determine if it's a compiler issue or CPU issue is with a VM. I don't know if you start up a VM or boot an older
version of Ubuntu but a direct comparison of different compilers on the same HW would help understanding what's going on.

I'm a little busy right now testing the latest AVX2 optimisations to get them released, but after that I'll look into it more.

Edit: I found the post from Wolf0 about his experience compiling cpuminer-opt with gcc 6.1.1. Look similar to yours?

https://bitcointalksearch.org/topic/m.15140799

johnsmithx

hero member

Activity: 589

Merit: 507

I don't buy nor sell anything here and never will.

Quote from: gimomars on August 04, 2016, 04:39:20 AM

Quote from: johnsmithx on August 04, 2016, 03:45:32 AM

Quote from: gimomars on August 04, 2016, 12:55:27 AM

Hi experts, I'm trying to build in my openSUSE linux. But I've got an error:

./build.sh
make: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by make)
make: /lib64/libc.so.6: version `GLIBC_2.17' not found (required by make)
strip: 'cpuminer': No such file

Regards

This problem has nothing to do with cpuminer, your building environment is messed up. Glibc is the very core linux library, if make can't find the version it likes then there is something wrong with either. But if glibc was messed up the system would hardly even boot properly. Maybe try to update?

Thanks for the reply. I think my linux has old version of GLIBC_x.x.

Version information:
/bin/sh:
libdl.so.2 (GLIBC_2.2.5) => /lib64/libdl.so.2
libc.so.6 (GLIBC_2.4) => /lib64/libc.so.6
libc.so.6 (GLIBC_2. Cool

=> /lib64/libc.so.6
libc.so.6 (GLIBC_2.3) => /lib64/libc.so.6
libc.so.6 (GLIBC_2.11) => /lib64/libc.so.6
libc.so.6 (GLIBC_2.3.4) => /lib64/libc.so.6
libc.so.6 (GLIBC_2.2.5) => /lib64/libc.so.6
/lib64/libreadline.so.5:
libc.so.6 (GLIBC_2.4) => /lib64/libc.so.6
libc.so.6 (GLIBC_2.3) => /lib64/libc.so.6
libc.so.6 (GLIBC_2.3.4) => /lib64/libc.so.6
libc.so.6 (GLIBC_2.11) => /lib64/libc.so.6
libc.so.6 (GLIBC_2.2.5) => /lib64/libc.so.6

Now you are looking what versions of glibc these two binaries (/bin/sh and /lib64/libreadline.so.5) require. To find out what version of glibc you actually have just run the library: type /lib64/libc.so.6 and press enter. Presumably it will be equal or higher than what /bin/sh requires and lower than what 'make' requires. But that would mean that you didn't install 'make' a standard way via a package system but somehow sideway. What did you do to your suse?!?

gimomars

newbie

Activity: 34

Merit: 0

Quote from: johnsmithx on August 04, 2016, 03:45:32 AM

Quote from: gimomars on August 04, 2016, 12:55:27 AM

Hi experts, I'm trying to build in my openSUSE linux. But I've got an error:

./build.sh
make: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by make)
make: /lib64/libc.so.6: version `GLIBC_2.17' not found (required by make)
strip: 'cpuminer': No such file

Regards

This problem has nothing to do with cpuminer, your building environment is messed up. Glibc is the very core linux library, if make can't find the version it likes then there is something wrong with either. But if glibc was messed up the system would hardly even boot properly. Maybe try to update?

Thanks for the reply. I think my linux has old version of GLIBC_x.x.

Version information:
/bin/sh:
libdl.so.2 (GLIBC_2.2.5) => /lib64/libdl.so.2
libc.so.6 (GLIBC_2.4) => /lib64/libc.so.6
libc.so.6 (GLIBC_2. Cool

=> /lib64/libc.so.6
libc.so.6 (GLIBC_2.3) => /lib64/libc.so.6
libc.so.6 (GLIBC_2.11) => /lib64/libc.so.6
libc.so.6 (GLIBC_2.3.4) => /lib64/libc.so.6
libc.so.6 (GLIBC_2.2.5) => /lib64/libc.so.6
/lib64/libreadline.so.5:
libc.so.6 (GLIBC_2.4) => /lib64/libc.so.6
libc.so.6 (GLIBC_2.3) => /lib64/libc.so.6
libc.so.6 (GLIBC_2.3.4) => /lib64/libc.so.6
libc.so.6 (GLIBC_2.11) => /lib64/libc.so.6
libc.so.6 (GLIBC_2.2.5) => /lib64/libc.so.6

johnsmithx

hero member

Activity: 589

Merit: 507

I don't buy nor sell anything here and never will.

Quote from: joblo on August 03, 2016, 11:54:55 PM

I tested on a i7-6700K with 8 threads and saw no difference between multi with or without -flto.

One thing to keep in mind is that you are using a desktop cpu, I am using a server cpu. Those xeons are not multiple times more expensive for no reason (sure partly it's just branding, more reliability etc. etc., but there are also some functional differences).

Or it (the fact that -flto doesn't do anything to you) could simply be the compiler. Yours is 2 years old, mine 2 months. But the crucial information is that you can actually compile with -flto. Could you please give me the exact flags that work for you? I will take the effort and find out what's the problem on my end, I just need the starting (compilable) point.

I have no idea whether he improved the performance after you took the code from him. The very last commit is 3 days old but which one is the last one that could have had any real impact on Lyra2RE speed, or the overall speed, I am not going to investigate. But if you could remember, at least roughly, when did you take his code I can revert his tree to that date and try that version.

Quote from: gimomars on August 04, 2016, 12:55:27 AM

Hi experts, I'm trying to build in my openSUSE linux. But I've got an error:

./build.sh
make: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by make)
make: /lib64/libc.so.6: version `GLIBC_2.17' not found (required by make)
strip: 'cpuminer': No such file

Regards

This problem has nothing to do with cpuminer, your building environment is messed up. Glibc is the very core linux library, if make can't find the version it likes then there is something wrong with either. But if glibc was messed up the system would hardly even boot properly. Maybe try to update?

gimomars

newbie

Activity: 34

Merit: 0

Hi experts, I'm trying to build in my openSUSE linux. But I've got an error:

./build.sh
make: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by make)
make: /lib64/libc.so.6: version `GLIBC_2.17' not found (required by make)
strip: 'cpuminer': No such file

Regards

joblo

legendary

Activity: 1470

Merit: 1114

Quote from: johnsmithx on August 03, 2016, 10:50:28 PM

I support very much your optimizing effort and whenever you need I will gladly do tests for you on various machines.

I appeciate the detailed report and the humour. It will take some time to digest it all but I have a couple of comments.
You clearly have all the optimization features in your CPU and the miner was running the optimized code, so that is eliminated.

I tested on a i7-6700K with 8 threads and saw no difference between multi with or without -flto.
Specifically I get:

multi: 920-922 kH/s
opt v3.3.9: 995
opt v3.3.8: 930
opt dev: 1025

Furthermore I can compile opt with both -flto and -pg, again with no performance difference.

I took the Lyra2RE code, and a lot more, directly from multi and I don't think he has made any changes to it. I've been tickering with
Lyra2RE for a while and only recently made any significant progress.

C++11 is required to support an algo not included in multi.

I can only speculate your compiler version may be the issue. I am using 4.8.4 and you 5.4.0. I seem to recall someone
else, Wolf maybe, having compile problems using a more recent version of gcc. I'm too lazy to look back in the thread to find it.
If you can find it you could compare notes.

I'll read through your report in more detail, If any ideas come to mind I'll post an update.

johnsmithx

hero member

Activity: 589

Merit: 507

I don't buy nor sell anything here and never will.

Quote from: joblo on August 03, 2016, 12:03:00 AM

Quote from: johnsmithx on August 02, 2016, 10:37:53 PM

Quote from: joblo on August 02, 2016, 04:52:19 PM

Things are looking up. I solved the alignment problem and made progress with performance. I've added 3%
to Lyra2 so far and I have a few functions left to convert. So far I've only implemented AVX2, I still have to do
AVX implementations of al functions. The improvements are to the lyra2 core so they should also help lyra2v2.

You are talking about Lyra2RE, right? So it will get sped up? Because so far tpruvot-cpuminer-multi's Lyra2RE is faster than yours by some 3.8% on my servers. I guess it might be because of -flto that I use with his but can't use with yours (doesn't compile) Sad

I think you are doing something wrong, lyra2RE in cpuminer-opt v3.3.7 was improved 7% faster than cpuminer-multi.
In the next release it will another 3% faster. If you have very old CPUs (ie core2) you won't get the benefits of my optimisations.

I know you didn't mean to but what you said was very amusing. The whole time I am talking about servers, the real servers in data centers, not some desktop pc in your home that you call "a server", and you come back at me with "core2" - a 10 years old super-obsolete desktop cpu. Hilarious!

But maybe you are right, maybe I am doing something wrong but it's not the hardware. I am using the up-to-date Ubuntu 16.04, if I messed something up then it must be the flags in the build.sh. Or maybe you did something wrong actually. Maybe you didn't compile tpruvot's cpuminer with -flto so now you are competing with a crippled sw because this flag does make the difference and it's not on by default, it's commented out in the build.sh so you have to enable it.

Either way, instead of accusing each other let's try to make things better. In this spirit I made a little test for you. I picked the most powerful server AWS has to offer, the x1.32xlarge (https://aws.amazon.com/ec2/instance-types/x1/) with 128 cores and 1952 GB memory. Here are the specs:

Code:

root@xxx:~/# grep -e name -e flags /proc/cpuinfo | head -n2
model name      : Intel(R) Xeon(R) CPU E7-8880 v3 @ 2.30GHz
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq monitor est ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm xsaveopt ida

root@xxx:~/# grep processor /proc/cpuinfo | wc -l
128

root@xxx:~/# free -h
              total        used        free      shared  buff/cache   available
Mem:           1.9T        3.3G        1.9T        9.1M        624M        1.9T
Swap:            0B          0B          0B

Now I fresh recompiled tpruvot and here is the result:

(I noticed that with the many-cores machines it is sometimes actually more powerful to use just half the threads; here on x1.32xlarge the difference is marginal, on g2.8xlarge it's quite significant; the spike at the beginning of full cores I attribute to some throttling done by Amazon, it's just a vps after all, not a dedi)

Code:

root@xxx:~/tpruvot-cpuminer-multi# ./cpuminer -a lyra2re --benchmark --threads=64 | grep Total
[2016-08-04 02:44:55] Total: 8903 kH/s
[2016-08-04 02:44:59] Total: 8833 kH/s
[2016-08-04 02:45:04] Total: 8735 kH/s
[2016-08-04 02:45:09] Total: 8728 kH/s
[2016-08-04 02:45:14] Total: 8727 kH/s

root@xxx:~/tpruvot-cpuminer-multi# ./cpuminer -a lyra2re --benchmark --threads=128 | grep Total
[2016-08-04 02:45:21] Total: 11661 kH/s
[2016-08-04 02:45:25] Total: 11568 kH/s
[2016-08-04 02:45:30] Total: 8731 kH/s
[2016-08-04 02:45:35] Total: 8720 kH/s
[2016-08-04 02:45:40] Total: 8722 kH/s
[2016-08-04 02:45:45] Total: 8703 kH/s

Now let's look at you:

Code:

CPU: Intel(R) Xeon(R) CPU E7-8880 v3 @ 2.30GHz
CPU features: SSE2 AES AVX AVX2
SW built on Aug  4 2016 with GCC 5.4.0
SW features: SSE2 AES AVX AVX2
Algo features: SSE2 AES
Start mining with AES-AVX optimizations...

root@xxx:~/cpuminer-opt-3.3.9# ./cpuminer -a lyra2re --benchmark --threads=64 | grep Total
[2016-08-04 02:47:57] Total: 4128.77 kH, 8660.05 kH/s
[2016-08-04 02:48:01] Total: 35.19 MH, 8668.56 kH/s
[2016-08-04 02:48:06] Total: 43.34 MH, 8590.78 kH/s
[2016-08-04 02:48:11] Total: 42.95 MH, 8600.93 kH/s
[2016-08-04 02:48:16] Total: 43.00 MH, 8621.90 kH/s

root@xxx:~/cpuminer-opt-3.3.9# ./cpuminer -a lyra2re --benchmark --threads=128 | grep Total
[2016-08-04 02:48:23] Total: 8323.07 kH, 11.96 MH/s
[2016-08-04 02:48:26] Total: 11.68 MH, 12.03 MH/s
[2016-08-04 02:48:31] Total: 32.93 MH, 8544.43 kH/s
[2016-08-04 02:48:36] Total: 39.36 MH, 8531.49 kH/s
[2016-08-04 02:48:41] Total: 42.52 MH, 8531.90 kH/s
[2016-08-04 02:48:46] Total: 42.33 MH, 8536.75 kH/s
[2016-08-04 02:48:51] Total: 41.88 MH, 8536.12 kH/s
[2016-08-04 02:48:56] Total: 42.41 MH, 8526.90 kH/s

Here tpruvot is faster by over 2%.

So let's look at the build.sh. Tpruvot's:

Code:

root@xxx:~/tpruvot-cpuminer-multi# cat build.sh
#!/bin/bash

if [ "$OS" = "Windows_NT" ]; then
    ./mingw64.sh
    exit 0
fi

# Linux build

make clean || echo clean

rm -f config.status
./autogen.sh || echo done

# Ubuntu 10.04 (gcc 4.4)
extracflags="-O3 -march=native -D_REENTRANT -funroll-loops -fvariable-expansion-in-unroller -fmerge-all-constants -fbranch-target-load-optimize2 -fsched2-use-superblocks -falign-loops=16 -falign-functions=16 -falign-jumps=16 -falign-labels=16"

# Debian 7.7 / Ubuntu 14.04 (gcc 4.7+)
extracflags="$extracflags -Ofast -flto -fuse-linker-plugin -ftree-loop-if-convert-stores"

if [ ! "0" = `cat /proc/cpuinfo | grep -c avx` ]; then
    # march native doesn't always works, ex. some Pentium Gxxx (no avx)
    extracflags="$extracflags -march=native"
fi

./configure --with-crypto --with-curl CFLAGS="-O3 $extracflags -march=native -DUSE_ASM -pg"

make -j $(grep processor /proc/cpuinfo | wc -l)

strip -s cpuminer

Yours ("customized" by me, but all I actually did was taking most of the flags from tpruvot's as long as it was compilable; maybe I messed it up?):

Code:

root@xxx:~/cpuminer-opt-3.3.9# cat build.sh
#!/bin/bash

#if [ "$OS" = "Windows_NT" ]; then
#    ./mingw64.sh
#    exit 0
#fi

# Linux build

make clean || echo clean

rm -f config.status
./autogen.sh || echo done

# Ubuntu 10.04 (gcc 4.4)
extracflags="-O3 -march=native -D_REENTRANT -funroll-loops -fvariable-expansion-in-unroller -fmerge-all-constants -fbranch-target-load-optimize2 -fsched2-use-superblocks -falign-loops=16 -falign-functions=16 -falign-jumps=16 -falign-labels=16"

# Debian 7.7 / Ubuntu 14.04 (gcc 4.7+)
extracflags="$extracflags -Ofast -fuse-linker-plugin -ftree-loop-if-convert-stores"

CFLAGS="-O3 $extracflags -march=native -DUSE_ASM" CXXFLAGS="$CFLAGS -std=gnu++11" ./configure --with-crypto --with-curl

make -j $(grep processor /proc/cpuinfo | wc -l)

strip -s cpuminer

You don't have -flto and -pg, neither is compilable, he doesn't have -std=gnu++11 (doesn't compile either).

Now let's see what all the -flto fuzz is about. What happens if I compile tpruvot without it:

Code:

root@xxx:~/tpruvot-cpuminer-multi# ./cpuminer -a lyra2re --benchmark --threads=64 | grep Total
[2016-08-04 02:56:24] Total: 356.42 kH/s
[2016-08-04 02:56:29] Total: 352.18 kH/s
[2016-08-04 02:56:34] Total: 352.00 kH/s
[2016-08-04 02:56:39] Total: 352.03 kH/s
[2016-08-04 02:56:44] Total: 352.09 kH/s

root@xxx:~/tpruvot-cpuminer-multi# ./cpuminer -a lyra2re --benchmark --threads=128 | grep Total
[2016-08-04 02:57:29] Total: 358.48 kH/s
[2016-08-04 02:57:34] Total: 357.76 kH/s
[2016-08-04 02:57:39] Total: 357.96 kH/s

It turned into a snail. As if it couldn't manage the multiplexing or something. So let's try just 1 thread:

Code:

root@xxx:~/tpruvot-cpuminer-multi# ./cpuminer -a lyra2re --benchmark --threads=1 | grep Total
[2016-08-04 02:57:45] Total: 124.22 kH/s
[2016-08-04 02:57:50] Total: 124.66 kH/s
[2016-08-04 02:57:55] Total: 125.81 kH/s
[2016-08-04 02:58:00] Total: 126.19 kH/s
[2016-08-04 02:58:05] Total: 126.84 kH/s

And yours:

Code:

root@xxx:~/cpuminer-opt-3.3.9# ./cpuminer -a lyra2re --benchmark --threads=1 | grep Total
[2016-08-04 02:58:25] Total: 65.54 kH, 136.75 kH/s
[2016-08-04 02:58:30] Total: 683.77 kH, 138.37 kH/s
[2016-08-04 02:58:35] Total: 691.85 kH, 140.37 kH/s
[2016-08-04 02:58:40] Total: 701.86 kH, 140.28 kH/s
[2016-08-04 02:58:45] Total: 701.42 kH, 140.33 kH/s
[2016-08-04 02:58:50] Total: 701.68 kH, 140.74 kH/s

You are clearly faster if he doesn't use -flto. But if I again turn -flto back on and recompile:

Code:

root@xxx:~/tpruvot-cpuminer-multi# ./cpuminer -a lyra2re --benchmark --threads=1 | grep Total
[2016-08-04 02:59:54] Total: 138.24 kH/s
[2016-08-04 02:59:58] Total: 139.78 kH/s
[2016-08-04 03:00:03] Total: 140.41 kH/s
[2016-08-04 03:00:08] Total: 140.39 kH/s
[2016-08-04 03:00:13] Total: 141.38 kH/s
[2016-08-04 03:00:18] Total: 141.65 kH/s
[2016-08-04 03:00:23] Total: 141.64 kH/s
[2016-08-04 03:00:28] Total: 142.04 kH/s
[2016-08-04 03:00:33] Total: 141.98 kH/s

He is clearly faster after all.

Now how do you do with 8 threads?

Code:

root@xxx:~/cpuminer-opt-3.3.9# ./cpuminer -a lyra2re --benchmark --threads=8 | grep Total
[2016-08-04 03:00:53] Total: 524.29 kH, 1128.89 kH/s
[2016-08-04 03:00:58] Total: 5644.46 kH, 1130.25 kH/s
[2016-08-04 03:01:03] Total: 5651.25 kH, 1130.72 kH/s
[2016-08-04 03:01:08] Total: 5653.59 kH, 1130.44 kH/s
[2016-08-04 03:01:13] Total: 5652.21 kH, 1130.53 kH/s
[2016-08-04 03:01:18] Total: 5652.64 kH, 1130.45 kH/s

And him with -flto?

Code:

root@xxx:~/tpruvot-cpuminer-multi# ./cpuminer -a lyra2re --benchmark --threads=8 | grep Total
[2016-08-04 03:01:29] Total: 1143 kH/s
[2016-08-04 03:01:34] Total: 1144 kH/s
[2016-08-04 03:01:39] Total: 1144 kH/s
[2016-08-04 03:01:44] Total: 1144 kH/s
[2016-08-04 03:01:49] Total: 1144 kH/s
[2016-08-04 03:01:54] Total: 1145 kH/s

And him without -flto:

Code:

root@xxx:~/tpruvot-cpuminer-multi# ./cpuminer -a lyra2re --benchmark --threads=8 | grep Total
[2016-08-04 03:03:13] Total: 637.08 kH/s
[2016-08-04 03:03:17] Total: 611.28 kH/s
[2016-08-04 03:03:22] Total: 605.97 kH/s
[2016-08-04 03:03:27] Total: 605.81 kH/s
[2016-08-04 03:03:32] Total: 605.90 kH/s

I support very much your optimizing effort and whenever you need I will gladly do tests for you on various machines.

joblo

legendary

Activity: 1470

Merit: 1114

I've found more AVX2 optimizations, converted cubehash SSE2 to AVX2, improved lyra2v2 19% and X algos 3-5%.
Cubehash will probbaly be the biggest single AVX2 optimization next to Hodl. I don't know how much more I can find.
Optiminer's code was an inspiration from which I learned a lot.

A lot more work to do before release.

joblo

legendary

Activity: 1470

Merit: 1114

Quote from: johnsmithx on August 02, 2016, 10:37:53 PM

Quote from: joblo on August 02, 2016, 04:52:19 PM

Things are looking up. I solved the alignment problem and made progress with performance. I've added 3%
to Lyra2 so far and I have a few functions left to convert. So far I've only implemented AVX2, I still have to do
AVX implementations of al functions. The improvements are to the lyra2 core so they should also help lyra2v2.

You are talking about Lyra2RE, right? So it will get sped up? Because so far tpruvot-cpuminer-multi's Lyra2RE is faster than yours by some 3.8% on my servers. I guess it might be because of -flto that I use with his but can't use with yours (doesn't compile) Sad

I think you are doing something wrong, lyra2RE in cpuminer-opt v3.3.7 was improved 7% faster than cpuminer-multi.
In the next release it will another 3% faster. If you have very old CPUs (ie core2) you won't get the benefits of my optimisations.

johnsmithx

hero member

Activity: 589

Merit: 507

I don't buy nor sell anything here and never will.

Quote from: joblo on August 02, 2016, 04:52:19 PM

Things are looking up. I solved the alignment problem and made progress with performance. I've added 3%
to Lyra2 so far and I have a few functions left to convert. So far I've only implemented AVX2, I still have to do
AVX implementations of al functions. The improvements are to the lyra2 core so they should also help lyra2v2.

You are talking about Lyra2RE, right? So it will get sped up? Because so far tpruvot-cpuminer-multi's Lyra2RE is faster than yours by some 3.8% on my servers. I guess it might be because of -flto that I use with his but can't use with yours (doesn't compile) Sad

johnsmithx

hero member

Activity: 589

Merit: 507

I don't buy nor sell anything here and never will.

Quote from: felixbrucker on August 02, 2016, 11:40:58 AM

yes its "more than a dozen *different* ips", same ip but different miner should not get blocked

Yes, all my machines have (obviously) each a different IP and they all connect to Nicehash using the same BTC address of mine. So that's the problem? So it would be ok if all those connections went from the same IP address (if I for example NATed them all through one single IP) but it's not ok if they all use their own real IPs and my real single BTC? So I should give each machine a different BTC address?

But why Nicehash does that? They have no reason to do such a blockage, it doesn't prevent anyone from doing anything, and if I actually did give each of my machine a different BTC address then I would be having many incoming little transactions and I would be loosing money on transaction fees. The Nicehash fee is proportional to the amount so that wouldn't change but when they send BTC they also have to pay a transaction fee (which is not proportional to the amount but fixed) for each transaction and that, I suppose, they deduce from the amount being sent (i.e. I am paying it) so with many little transactions the sum of their fees may stack up.

Well I think the best solution will be the stratum proxy. If I won't get that to work then I will just tunnel all those machines through one single IP.

joblo

legendary

Activity: 1470

Merit: 1114

I did some playing around with AVX, trying to use it to xor arrays 256 bits at a time.
My results were not encouraging. I got code working but with occasional segfaults,
likely due to alignmnet issues. That should be fixable. The discouraging part is there was
no performance improvement. I tried it looping and inline but it didn't seem to make a difference.
I'll poke around some more but it doesn't look good.

Edit: There are a coule of reasons why I may not have seen any performance increase. It is possible,
likely, the process was memory bound in which case optimizing CPU performance wouldn't help.
The other possibility is the compiler may be auto-vectorising which occurs with the -O3 flag.
The code I vectorised was almost identical to the example of auto-vectorising.

The coding is also quite strange with load and store functions like assembly language instructions.
I would expect the compiler to do that transparently.

Edit2:

Things are looking up. I solved the alignment problem and made progress with performance. I've added 3%
to Lyra2 so far and I have a few functions left to convert. So far I've only implemented AVX2, I still have to do
AVX implementations of al functions. The improvements are to the lyra2 core so they should also help lyra2v2.

Topic: [ANN]: cpuminer-opt v3.8.8.1, open source optimized multi-algo CPU miner - page 150. (Read 444131 times)