I did this very ugly hack, joblo please don't get a heart attack:
--- scrypt-jane-romix-template.h.orig 2016-02-05 22:05:38.000000000 +0000
+++ scrypt-jane-romix-template.h 2016-08-05 00:37:48.949684265 +0000
@@ -86,9 +86,9 @@
for (i = 0; i < /*N - 1*/511; i++, block += chunkWords) {
/* 3: V_i = X */
/* 4: X = H(X) */
- SCRYPT_CHUNKMIX_FN(block + chunkWords, block, NULL, /*r*/1);
+// SCRYPT_CHUNKMIX_FN(block + chunkWords, block, NULL, /*r*/1);
}
- SCRYPT_CHUNKMIX_FN(X, block, NULL, 1);
+// SCRYPT_CHUNKMIX_FN(X, block, NULL, 1);
/* 6: for i = 0 to N - 1 do */
for (i = 0; i < /*N*/512; i += 2) {
@@ -96,13 +96,13 @@
j = X[chunkWords - SCRYPT_BLOCK_WORDS] & /*(N - 1)*/511;
/* 8: X = H(Y ^ V_j) */
- SCRYPT_CHUNKMIX_FN(Y, X, scrypt_item(V, j, chunkWords), 1);
+// SCRYPT_CHUNKMIX_FN(Y, X, scrypt_item(V, j, chunkWords), 1);
/* 7: j = Integerify(Y) % N */
j = Y[chunkWords - SCRYPT_BLOCK_WORDS] & /*(N - 1)*/511;
/* 8: X = H(Y ^ V_j) */
- SCRYPT_CHUNKMIX_FN(X, Y, scrypt_item(V, j, chunkWords), 1);
+// SCRYPT_CHUNKMIX_FN(X, Y, scrypt_item(V, j, chunkWords), 1);
}
/* 10: B' = X */
And now it does compile with -flto and here is the result:
CPU: Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz
CPU features: SSE2 AES AVX AVX2
SW built on Aug 5 2016 with GCC 5.4.0
SW features: SSE2 AES AVX AVX2
Algo features: SSE2 AES AVX AVX2
Start mining with SSE2 AES AVX AVX2
[2016-08-05 00:58:03] 4 miner threads started, using 'lyra2re' algorithm.
[2016-08-05 00:58:04] CPU #0: 65.54 kH, 84.17 kH/s
[2016-08-05 00:58:04] CPU #1: 65.54 kH, 84.25 kH/s
[2016-08-05 00:58:04] CPU #3: 65.54 kH, 84.23 kH/s
[2016-08-05 00:58:04] Total: 196.61 kH, 252.64 kH/s
[2016-08-05 00:58:04] CPU #2: 65.54 kH, 83.86 kH/s
[2016-08-05 00:58:08] CPU #2: 335.45 kH, 84.02 kH/s
[2016-08-05 00:58:08] CPU #1: 336.99 kH, 84.25 kH/s
[2016-08-05 00:58:08] CPU #3: 336.92 kH, 84.24 kH/s
[2016-08-05 00:58:08] Total: 1074.89 kH, 336.68 kH/s
[2016-08-05 00:58:08] CPU #0: 336.67 kH, 84.04 kH/s
[2016-08-05 00:58:13] CPU #2: 420.12 kH, 84.16 kH/s
[2016-08-05 00:58:13] CPU #1: 421.26 kH, 84.35 kH/s
[2016-08-05 00:58:13] CPU #0: 420.18 kH, 84.19 kH/s
[2016-08-05 00:58:13] CPU #3: 421.18 kH, 84.34 kH/s
[2016-08-05 00:58:13] Total: 1682.74 kH, 337.04 kH/s
[2016-08-05 00:58:18] CPU #2: 420.78 kH, 84.16 kH/s
[2016-08-05 00:58:18] CPU #1: 421.77 kH, 84.31 kH/s
[2016-08-05 00:58:18] CPU #0: 420.97 kH, 84.19 kH/s
[2016-08-05 00:58:18] CPU #3: 421.69 kH, 84.26 kH/s
[2016-08-05 00:58:18] Total: 1685.21 kH, 336.92 kH/s
[2016-08-05 00:58:23] CPU #1: 421.54 kH, 84.37 kH/s
[2016-08-05 00:58:23] CPU #3: 421.31 kH, 84.32 kH/s
[2016-08-05 00:58:23] CPU #2: 420.81 kH, 83.99 kH/s
[2016-08-05 00:58:23] Total: 1684.63 kH, 336.87 kH/s
[2016-08-05 00:58:23] CPU #0: 420.93 kH, 84.01 kH/s
[2016-08-05 00:58:28] CPU #2: 419.96 kH, 84.10 kH/s
[2016-08-05 00:58:28] CPU #0: 420.07 kH, 84.10 kH/s
[2016-08-05 00:58:28] CPU #1: 421.87 kH, 84.17 kH/s
[2016-08-05 00:58:28] CPU #3: 421.58 kH, 84.09 kH/s
[2016-08-05 00:58:28] Total: 1683.49 kH, 336.46 kH/s
So using -flto gives another 2.75% speed increase. That's 7.7% speed increase in total over tpruvot.
Now this is with -flto and -fuse-linker-plugin:
CPU: Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz
CPU features: SSE2 AES AVX AVX2
SW built on Aug 5 2016 with GCC 5.4.0
SW features: SSE2 AES AVX AVX2
Algo features: SSE2 AES AVX AVX2
Start mining with SSE2 AES AVX AVX2
[2016-08-05 00:55:15] 4 miner threads started, using 'lyra2re' algorithm.
[2016-08-05 00:55:16] CPU #0: 65.54 kH, 84.75 kH/s
[2016-08-05 00:55:16] CPU #1: 65.54 kH, 84.78 kH/s
[2016-08-05 00:55:16] CPU #2: 65.54 kH, 84.56 kH/s
[2016-08-05 00:55:16] CPU #3: 65.54 kH, 84.44 kH/s
[2016-08-05 00:55:16] Total: 262.14 kH, 338.53 kH/s
[2016-08-05 00:55:20] CPU #3: 337.77 kH, 84.06 kH/s
[2016-08-05 00:55:20] Total: 534.38 kH, 338.15 kH/s
[2016-08-05 00:55:20] CPU #2: 338.22 kH, 84.01 kH/s
[2016-08-05 00:55:20] CPU #1: 339.13 kH, 84.09 kH/s
[2016-08-05 00:55:20] CPU #0: 338.98 kH, 84.02 kH/s
[2016-08-05 00:55:25] CPU #0: 420.11 kH, 84.71 kH/s
[2016-08-05 00:55:25] CPU #2: 420.03 kH, 84.49 kH/s
[2016-08-05 00:55:25] CPU #3: 420.31 kH, 84.05 kH/s
[2016-08-05 00:55:25] Total: 1599.59 kH, 337.33 kH/s
[2016-08-05 00:55:25] CPU #1: 420.43 kH, 84.07 kH/s
[2016-08-05 00:55:30] CPU #3: 420.25 kH, 83.97 kH/s
[2016-08-05 00:55:30] Total: 1680.82 kH, 337.24 kH/s
[2016-08-05 00:55:30] CPU #2: 422.44 kH, 83.97 kH/s
[2016-08-05 00:55:30] CPU #0: 423.54 kH, 83.98 kH/s
[2016-08-05 00:55:30] CPU #1: 420.36 kH, 83.97 kH/s
[2016-08-05 00:55:35] CPU #0: 419.88 kH, 84.64 kH/s
[2016-08-05 00:55:35] CPU #2: 419.84 kH, 84.39 kH/s
[2016-08-05 00:55:35] CPU #3: 419.85 kH, 84.00 kH/s
[2016-08-05 00:55:35] Total: 1679.93 kH, 337.00 kH/s
[2016-08-05 00:55:35] CPU #1: 419.85 kH, 84.02 kH/s
[2016-08-05 00:55:40] CPU #0: 423.20 kH, 84.42 kH/s
[2016-08-05 00:55:40] CPU #3: 420.02 kH, 84.32 kH/s
[2016-08-05 00:55:40] Total: 1682.91 kH, 337.15 kH/s
Basically the same speed. Now what if I actually call tpruvot's build.sh, exactly the one I showed in my previous post:
CPU: Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz
CPU features: SSE2 AES AVX AVX2
SW built on Aug 5 2016 with GCC 5.4.0
SW features: SSE2 AES AVX AVX2
Algo features: SSE2 AES AVX AVX2
Start mining with SSE2 AES AVX AVX2
[2016-08-05 01:10:02] 4 miner threads started, using 'lyra2re' algorithm.
[2016-08-05 01:10:03] CPU #0: 65.54 kH, 84.11 kH/s
[2016-08-05 01:10:03] CPU #1: 65.54 kH, 83.93 kH/s
[2016-08-05 01:10:03] CPU #2: 65.54 kH, 83.86 kH/s
[2016-08-05 01:10:03] CPU #3: 65.54 kH, 83.96 kH/s
[2016-08-05 01:10:03] Total: 262.14 kH, 335.86 kH/s
[2016-08-05 01:10:07] CPU #1: 335.71 kH, 84.00 kH/s
[2016-08-05 01:10:07] CPU #2: 335.44 kH, 83.92 kH/s
[2016-08-05 01:10:07] CPU #3: 335.85 kH, 83.99 kH/s
[2016-08-05 01:10:07] Total: 1072.54 kH, 336.02 kH/s
[2016-08-05 01:10:07] CPU #0: 336.45 kH, 83.93 kH/s
[2016-08-05 01:10:12] CPU #1: 420.00 kH, 84.00 kH/s
[2016-08-05 01:10:12] CPU #2: 419.62 kH, 83.92 kH/s
[2016-08-05 01:10:12] CPU #3: 419.93 kH, 83.99 kH/s
[2016-08-05 01:10:12] Total: 1596.00 kH, 335.82 kH/s
[2016-08-05 01:10:12] CPU #0: 419.64 kH, 83.91 kH/s
[2016-08-05 01:10:17] CPU #1: 419.98 kH, 84.05 kH/s
[2016-08-05 01:10:17] CPU #2: 419.58 kH, 83.98 kH/s
[2016-08-05 01:10:17] CPU #3: 419.93 kH, 84.03 kH/s
[2016-08-05 01:10:17] Total: 1679.12 kH, 335.98 kH/s
[2016-08-05 01:10:17] CPU #0: 419.53 kH, 83.99 kH/s
[2016-08-05 01:10:22] CPU #2: 419.92 kH, 84.04 kH/s
[2016-08-05 01:10:22] CPU #1: 420.25 kH, 84.04 kH/s
[2016-08-05 01:10:22] CPU #0: 419.93 kH, 84.04 kH/s
[2016-08-05 01:10:22] CPU #3: 420.18 kH, 84.02 kH/s
[2016-08-05 01:10:22] Total: 1680.28 kH, 336.14 kH/s
Still the same (maximum) speed.
So I will be using joblo's cpuminer with tpruvot's (uncommented) build.sh because that build.sh has all those other flags (including -falign-*) which may or may not matter, so just to be safe..
EDIT: when I took the avx2 binary and tried to run it on a avx cpu I got this:
CPU: Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
CPU features: SSE2 AES AVX
SW built on Aug 5 2016 with GCC 5.4.0
SW features: SSE2 AES AVX AVX2
Algo features: SSE2 AES AVX AVX2
Start mining with SSE2 AES AVX
Illegal instruction (core dumped)
But wasn't the whole idea that all the cpu features will be compiled in and what particular feature shall be used will be determined at the runtime? It's not a big deal, I just recompiled it and I will have two versions (avx and avx2) and run the one that's appropriate to the cpu. Just I thought I would report this.