cpuminer-opt-3.7.3 is released.
git:
https://github.com/JayDDee/cpuminer-opttarball:
https://drive.google.com/file/d/1Nw-kHZ1bnkEjtuKlOdf3XSh-LNZReH4E/view?usp=sharingWindows binaries:
https://drive.google.com/file/d/1r6KLL_YBYvL3Lr0nfeVOEBlPXkATSIxw/view?usp=sharingNew in 3.7.3:
Added polytimos algo
Updated dockerfile
Introducing 4-way AVX2 optimization giving up to 4x performance inprovement
on many compute bound algos. First supported algos: skein, skein2, blake &
keccak. This feature is only available when compiled from source. See
RELEASE_NOTES for instructions how to enable 4-way during compilation
and which algos are currently supported.
Edit: It should be noted that not all algos have been fully tested using 4way.
Skein and keccak have both been tested but I couldn't find a pool for skein2 or blake.
For devs and geeks:
What is 4-way AVX2?
4-way AVX2 uses AVX2 instructions to hash 4 nonces in parallel per CPU thread instead of
just one yielding a theorectical 4x increase. The realized improvement will be lower due
to AVX2 operating at reduced clock speed and some overhead. 4-way AVX2 optimization
uses vertical vectoring similar to GPU miners instead of the existing horizontal vectoring
of some algos. Horizontal vectoring uses AVX2 instructions to hash one nonce faster.
Only CPUs with AVX2 will support 4-way.
Which algos can benefit?
This is the bad news. 4-way will help compute bound algos that use SPH functions the most,
the same algos that already have very efficient GPU miners available, or even ASICS.
They include the entire x11 family as well as more recent algos like hsr and polytimos.
Typical CPU algos will see little benefit as they don't usually rely on SPH or it is only a small
part of the algo.
Hashing 4 nonces in parallel creates the possbility of submitting multiple nonces per scan,
though I haven't seen this in my testing. It will likely be rare but is supported.
What overhead can reduce the gain?
As previously mentioned AVX2 code runs at a reduced CPU clock rate but there is also coding overhead
and workarounds that will affect some algos more than others.
Algos that are pure SPH chains will improve the most. The only overhead is to inlerleave the input data
then deinterleave the vectorized hash.
Some SPH functions work on 32 bit data for their 256 bit versions while other work on 64 bit data for both 512
bit and 256 bit versions. If an algo has a mix of 32 bit and 64 bit data extra interleaving/deinterleaving will be
required. This is not expected to have a significant impact.
Algos that use HW acceleration (AES or SHA) have additional challenges to implement vertical vectoring
and it may be very difficult or possible to implement vertical vectoring on these functions. These
functions will have to be run 4 times serially completely nullifying the benefits of 4-way for those
functions. Oher parts of such algos can use 4-way with intermediate interleaving/deinterleaving.
Algos that already use horizontal vectoring when rewritten for vertical vectoring are not likely to see
any further improvement.
What's the implementation plan?
Around 20 SPH functions need to be rewritten for 4-way but algos only become supported when the entire
function chain is optimized. The shorter chains will be implemented first and the longer ones last. There
is some flexibility in some instances and I can be convinced to modify my initial plan with some motivation.
The pure SPH algos will be done first as they receive the biggest gains. Algos with existing SIMD code or
HW acceleration will be done next with the existing optimizations run serially.
Next are algos that have a portion use SPH. Since SPH is only used for part of the algo the gains will be
limited.
Then I will try to rewrite the functions that use AES_NI to see if they can be improved with 4way.
SHA looks too daunting as it means rewriting openssl.
I may eventually look at rewriting the exiting SIMD optimized functions to work with 4-way. The gains
will be limited due to losing the existing optimizations. It's a matter of hashing 4 nonces in parallel using
vertical vectoring vs hashing 4 nonces serially using horizontal vectoring.
What about the future?
All of the functions that will support 4-way with AVX2 can easilly be extended to 8-way with AVX-512.