[ANN] cpuminer-opt v3.14.2, open source optimized multi-algo CPU miner - page 13.

adamvp

hero member

Activity: 1246

Merit: 708

Is there any coin worth cpu mining right now?

JayDDee

full member

Activity: 1436

Merit: 232

Quote from: nsummy on November 10, 2020, 03:10:28 PM

I have a feature/documentation request. I think it would be good to document which algos can take advantage of some of these newer CPU instruction sets. I've been mostly a GPU miner but CPU mining really intrigues me and would like to do it as a side project. I also think documenting which algos are no longer "supported" would be beneficial, or somehow segregating them from the rest. Just looking at the algo list its apparent that most of the included algos will never be seriously mined by a CPU. I definitely appreciate the work though, I have been using cpuminer-opt off and on for years now

The short answer is any algo derivative of x11 or containing parts of x11, otherwise known as sha3 based.
The irony is that the algos that can use AVX512 etc can't compete with GPUs.

The long answer requires searching through the code of each algo.

I'm not sure what you mean by no longer supported. I focus more on newer CPUs and algos that are still mineable
with a CPU (although difficult for some) but all algos are still supported.

JayDDee

full member

Activity: 1436

Merit: 232

Quote from: sech1 on November 10, 2020, 02:16:01 PM

Quote from: JayDDee on November 10, 2020, 12:10:47 PM

RandomX is a little different. It can benefit proportionaly more with VAES512. Some of the AES sequences
alternate AESENC with AESDEC so they can't be paired. VAES512 can still provide a near 2x improvement
in the AES performance, whlile the pure AESENC or AESDEC sequences get near 4x. I don't know how much
AES factors in the performance of RandomX as a whole.

RandomX is limited by AES instruction latency. Main AES loop has 8 128-bit AES instructions and runs in 4 clock cycles per iteration on Ryzen. With VAES it's 4 256-bit AES instructions but still 4 clock cycles per iteration. It can't be parallelized because each iteration depends on the previous one. AESENC/AESDEC interleaving can be worked around with some clever use of _mm256_permute2x128_si256().

The extra permutes would kill the advantage of 2 way parallel AES, but yes it can be done,
4 way parallel (avx512) might overcome the penalty.

I looked at RandomX VAES a while ago but couldn't figure out how to enable AVX512 to compile with cmake. I'm not good with c++ either.

I'm playing with AVX2+VAES on my Icelake laptop, it looks like x17 with get a 7% boost by using VAES for groestl, Shavite & Echo.
I assume similar for Zen3.

If things work out the next release may include a zen3 build with VAES in addition to AVX2 & SHA.

nsummy

full member

Activity: 1179

Merit: 131

I have a feature/documentation request. I think it would be good to document which algos can take advantage of some of these newer CPU instruction sets. I've been mostly a GPU miner but CPU mining really intrigues me and would like to do it as a side project. I also think documenting which algos are no longer "supported" would be beneficial, or somehow segregating them from the rest. Just looking at the algo list its apparent that most of the included algos will never be seriously mined by a CPU. I definitely appreciate the work though, I have been using cpuminer-opt off and on for years now

sech1

member

Activity: 116

Merit: 66

Quote from: JayDDee on November 10, 2020, 12:10:47 PM

RandomX is a little different. It can benefit proportionaly more with VAES512. Some of the AES sequences
alternate AESENC with AESDEC so they can't be paired. VAES512 can still provide a near 2x improvement
in the AES performance, whlile the pure AESENC or AESDEC sequences get near 4x. I don't know how much
AES factors in the performance of RandomX as a whole.

RandomX is limited by AES instruction latency. Main AES loop has 8 128-bit AES instructions and runs in 4 clock cycles per iteration on Ryzen. With VAES it's 4 256-bit AES instructions but still 4 clock cycles per iteration. It can't be parallelized because each iteration depends on the previous one. AESENC/AESDEC interleaving can be worked around with some clever use of _mm256_permute2x128_si256().

JayDDee

full member

Activity: 1436

Merit: 232

Quote from: sech1 on November 10, 2020, 10:15:27 AM

Official slide from AMD: https://www.techpowerup.com/review/amd-ryzen-5-3600/images/arch1.jpg
"Now supports single-op AVX256"

Ryzen 7 4700U (Zen2) laptop (lots of stuff running in there, but the difference is obvious):
cpuminer-aes-sse42 --benchmark --algo=blake2s: 106 MH/s
cpuminer-avx.exe --benchmark --algo=blake2s: 113 MH/s
cpuminer-avx2.exe --benchmark --algo=blake2s: 195 MH/s

Thanks very much, the numbers are convincing.

That means there would be some benefit from 256 bit VAES on Zen3. I'll look into it further.
It could improve the old X algos with faster groestl, shavite and echo.

RandomX is a little different. It can benefit proportionaly more with VAES512. Some of the AES sequences
alternate AESENC with AESDEC so they can't be paired. VAES512 can still provide a near 2x improvement
in the AES performance, whlile the pure AESENC or AESDEC sequences get near 4x. I don't know how much
AES factors in the performance of RandomX as a whole.

sech1

member

Activity: 116

Merit: 66

Quote from: JayDDee on November 10, 2020, 08:22:49 AM

Quote from: sech1 on November 10, 2020, 02:00:54 AM

Quote from: JayDDee on November 09, 2020, 06:07:48 PM

Can you test AVX2 performance vs AVX? If VAES gives full 256 bit throughput I would expect AVX2 to do so as well.
In previous generations of Ryzen it did not.
Should be close to double AVX. It would be nice if AMD fixed that for Zen3.

It is full 256 bit since Zen2, no need to test it. Only Zen1/Zen+ had 128-bit FP ALU and splitted AVX instructions in 2.
Read: https://en.wikichip.org/wiki/amd/microarchitectures/zen_2#Key_changes_from_Zen.2B "2x wider datapath (256-bit, up from 128-bit)"

If I've been wrong it's because I believed what someone told me. To quote Roger Daltrey, "I won't get fooled again". Please test.

Official slide from AMD: https://www.techpowerup.com/review/amd-ryzen-5-3600/images/arch1.jpg
"Now supports single-op AVX256"

Ryzen 7 4700U (Zen2) laptop (lots of stuff running in there, but the difference is obvious):
cpuminer-aes-sse42 --benchmark --algo=blake2s: 106 MH/s
cpuminer-avx.exe --benchmark --algo=blake2s: 113 MH/s
cpuminer-avx2.exe --benchmark --algo=blake2s: 195 MH/s

JayDDee

full member

Activity: 1436

Merit: 232

Quote from: sech1 on November 10, 2020, 02:00:54 AM

Quote from: JayDDee on November 09, 2020, 06:07:48 PM

Can you test AVX2 performance vs AVX? If VAES gives full 256 bit throughput I would expect AVX2 to do so as well.
In previous generations of Ryzen it did not.
Should be close to double AVX. It would be nice if AMD fixed that for Zen3.

It is full 256 bit since Zen2, no need to test it. Only Zen1/Zen+ had 128-bit FP ALU and splitted AVX instructions in 2.
Read: https://en.wikichip.org/wiki/amd/microarchitectures/zen_2#Key_changes_from_Zen.2B "2x wider datapath (256-bit, up from 128-bit)"

If I've been wrong it's because I believed what someone told me. To quote Roger Daltrey, "I won't get fooled again". Please test.

sech1

member

Activity: 116

Merit: 66

Quote from: JayDDee on November 09, 2020, 06:07:48 PM

Can you test AVX2 performance vs AVX? If VAES gives full 256 bit throughput I would expect AVX2 to do so as well.
In previous generations of Ryzen it did not.
Should be close to double AVX. It would be nice if AMD fixed that for Zen3.

It is full 256 bit since Zen2, no need to test it. Only Zen1/Zen+ had 128-bit FP ALU and splitted AVX instructions in 2.
Read: https://en.wikichip.org/wiki/amd/microarchitectures/zen_2#Key_changes_from_Zen.2B "2x wider datapath (256-bit, up from 128-bit)"

JayDDee

full member

Activity: 1436

Merit: 232

Quote from: sech1 on November 09, 2020, 04:43:27 PM

Quote from: JayDDee on November 09, 2020, 01:45:04 PM

Once again the question of AVX2 performance on Ryzen has surfaced, this time with a new twist: VAES.
Although Zen 3 includes some significant architectural changes none of them seem to apply to the execution engine.
The ease it which 256 bit VAES was added without any additional instructions suggests VAES was implemented
using the existing 128 bit AES hardware. This would be consistent with the previous AVX2 implementation in Ryzen
which performs 128 bit instructions internally and results in no performance increase.

No actual testing was done on a Zen3 CPU, I don't have one.

VAES is full 256 bit in Zen3, I've tested it. VAES 256-bit instructions have the same latency/twice the throughput compared to 128-bit AES instructions. I've also tested two different VAES implementations for RandomX and they both didn't give any speedup. The AES part of RandomX is limited by AES instruction latency, not bandwidth.

Edit: the measured latency was 4 cycles for both AESENC and VAESENC. Throughput was 2 instructions/cycle for both.

Can you test AVX2 performance vs AVX? If VAES gives full 256 bit throughput I would expect AVX2 to do so as well.
In previous generations of Ryzen it did not.

I have to admit I'm skeptical because adding 256 bit VAES would be very simple to add and would only require a change
to the instruction decoder to support the new opcode. Supporting 256 bit throughput is a more radical change and would
require a redesigned vector execution unit that would also benefit AVX2.

If AVX2 performance improved, it will convince me that Zen3 has an improved vector unit.

Edit: I ran a test to demonstrate the poor AVX2 performance:

CPU: R7-1700
OS: Ubuntu-20.04, cpuminer compiled using build-allarch.sh. Windows binaries can also be used.

Benchmark test sha256t and blake2s using cpuminer-avx2 & cpuminer-avx. These algos are 100% SIMD
and the AVX2 code is identical to AVX except for the vector size.

Code:

          sha256t        blake2s
AVX2      26.5 Mh/s    141.5 Mh/s
AVX       25.8 Mh/s    129.7 Mh/s

AVX2 should be close to double AVX. It would be nice if AMD fixed that for Zen3.

sech1

member

Activity: 116

Merit: 66

Quote from: JayDDee on November 09, 2020, 01:45:04 PM

Once again the question of AVX2 performance on Ryzen has surfaced, this time with a new twist: VAES.
Although Zen 3 includes some significant architectural changes none of them seem to apply to the execution engine.
The ease it which 256 bit VAES was added without any additional instructions suggests VAES was implemented
using the existing 128 bit AES hardware. This would be consistent with the previous AVX2 implementation in Ryzen
which performs 128 bit instructions internally and results in no performance increase.

No actual testing was done on a Zen3 CPU, I don't have one.

VAES is full 256 bit in Zen3, I've tested it. VAES 256-bit instructions have the same latency/twice the throughput compared to 128-bit AES instructions. I've also tested two different VAES implementations for RandomX and they both didn't give any speedup. The AES part of RandomX is limited by AES instruction latency, not bandwidth.

Edit: the measured latency was 4 cycles for both AESENC and VAESENC. Throughput was 2 instructions/cycle for both.

JayDDee

full member

Activity: 1436

Merit: 232

cpuminer-opt-3.15.1

Fix compile on AMD Zen3 CPUs with VAES.
Force new work immediately after solving a block solo.

https://github.com/JayDDee/cpuminer-opt

Notes for Zen3:

Zen 3 adds VAES for 256 bit vectors. Although cpuminer-opt supports VAES with 512 bit vectors it does
not support 256 bit VAES. This will result in an algo's VAES optimizations not being used on Zen3.

Compilers don't yet support the new znver3 architecture flag. Using znver2 is recommended when compiling
from source code.

It is possible to compile Zen3 with VAES by using "-march=znver2 -mvaes", but it makes no difference to cpuminer-opt.

Windows users should use the cpuminer-zen build on Zen3 CPUs.

Once again the question of AVX2 performance on Ryzen has surfaced, this time with a new twist: VAES.
Although Zen 3 includes some significant architectural changes none of them seem to apply to the execution engine.
The ease it which 256 bit VAES was added without any additional instructions suggests VAES was implemented
using the existing 128 bit AES hardware. This would be consistent with the previous AVX2 implementation in Ryzen
which performs 128 bit instructions internally and results in no performance increase.

No actual testing was done on a Zen3 CPU, I don't have one.

JayDDee

full member

Activity: 1436

Merit: 232

Quote from: sech1 on November 06, 2020, 03:42:38 AM

Quote from: JayDDee on November 05, 2020, 09:18:28 PM

AMD will require a different CPUID configuration to distinguish Intel VAES from AMD VAES-256.

A new compile architecture is required to generate the new 256 bit AES instructions.

There is no performance gain, with Zen AVX2 256 bit instructions are implemented internally as
2 128 bit instructions.

In cpuminer-opt all VAES is 4-way 512 bits and requires proper AVX512 support (AVX512F for 512 bit)
with expected performance gains.

There is nothing to be gained in cpuminer-opt by coding for 256 bit AES or recompiling the existing code for Zen3.

Same CPUID bit as far as I can see: "VAES CPUID Fn0000_0007_ECX[VAES]_x0 (bit 9)" from https://www.amd.com/system/files/TechDocs/26568.pdf
"Zen AVX2 256 bit instructions are implemented internally as 2 128 bit instructions" what? Zen2/Zen3 has full 256-bit FP unit and AES instructions run there. VAES also runs there at full 256 bit throughput.

Throughput of AVX2 is easy to test. Just run a cpuminer-opt benchmark of sha256t algo using cpuminer-avx vs cpuminer-avx2.
The initial Zen1 AVX2 is known to be a hack that uses 128 bits internally and there's no indication in any reviews that Zen3
is any better. But a test will confirm it.

I'd appreciate if you could provide the exact feature list and do some performance testing of AVX vs AVX2 when you get yours.
Regardless, there is currently no code in cpuminer-opt that does 2 way parallel hashing, with of without AES. 2 way hashing, even
on Intel, doesn't provide enough performance gain to overcome the extra overhead.

Edit: there may be some issues compiling for Zen3 with vaes.
There is some code that assumes AVX512 is included if VAES is present. This will require a new release of cpuminer-opt.
In addition, there is no compiler support for znver3 yet.
The workaround is to compile with -march=znver2 until both are fixed.
There are no performance implications because cpuminer-opt has no code that can take advantage of 256 bit VAES.
Prebuilt Windows binaries are not affected, znver1 is used for the zen build.

sech1

member

Activity: 116

Merit: 66

Quote from: JayDDee on November 05, 2020, 09:18:28 PM

AMD will require a different CPUID configuration to distinguish Intel VAES from AMD VAES-256.

A new compile architecture is required to generate the new 256 bit AES instructions.

There is no performance gain, with Zen AVX2 256 bit instructions are implemented internally as
2 128 bit instructions.

In cpuminer-opt all VAES is 4-way 512 bits and requires proper AVX512 support (AVX512F for 512 bit)
with expected performance gains.

There is nothing to be gained in cpuminer-opt by coding for 256 bit AES or recompiling the existing code for Zen3.

Same CPUID bit as far as I can see: "VAES CPUID Fn0000_0007_ECX[VAES]_x0 (bit 9)" from https://www.amd.com/system/files/TechDocs/26568.pdf
"Zen AVX2 256 bit instructions are implemented internally as 2 128 bit instructions" what? Zen2/Zen3 has full 256-bit FP unit and AES instructions run there. VAES also runs there at full 256 bit throughput.

JayDDee

full member

Activity: 1436

Merit: 232

Comment regarding AMD Zen3 (Ryzen 5000)

Zen 3 includes VAES for 256 bit (AVX2) vectors. This is a 2 way parallel operation backported
from Intel AVX512VL. All Intel CPUs require AVX512VL and VAES to implement 256 bit parallel AES.

A few bullet points:

AMD will require a different CPUID configuration to distinguish Intel VAES from AMD VAES-256.

A new compile architecture is required to generate the new 256 bit AES instructions.

There is no performance gain, with Zen AVX2 256 bit instructions are implemented internally as
2 128 bit instructions.

In cpuminer-opt all VAES is 4-way 512 bits and requires proper AVX512 support (AVX512F for 512 bit)
with expected performance gains.

There is nothing to be gained in cpuminer-opt by coding for 256 bit AES or recompiling the existing code for Zen3.

JayDDee

full member

Activity: 1436

Merit: 232

cpuminer-opt-3.15.0

Fugue optimized with AES, improves many sha3 algos.
Minotaur algo optimized for all architectures.
Fixed neoscrypt BUG log.

https://github.com/JayDDee/cpuminer-opt/releases/tag/v3.15.0

BoozyTalking

newbie

Activity: 315

Merit: 0

QAC coin solo mining with local wallet.
Happen after share is found.

JayDDee

full member

Activity: 1436

Merit: 232

Quote from: BoozyTalking on August 13, 2020, 07:11:31 AM

Hi, JayDDee. What is this error mean:

Code:

[2020-08-13 15:08:46] JSON-RPC call failed: Block does not start with a coinbase
[2020-08-13 15:08:46] submit_upstream_work json_rpc_call failed
[2020-08-13 15:08:46] ...retry after 10 seconds

That is an error reported by the stratum server, never seen it before. Need more context.

BoozyTalking

newbie

Activity: 315

Merit: 0

Hi, JayDDee. What is this error mean:

Code:

[2020-08-13 15:08:46] JSON-RPC call failed: Block does not start with a coinbase
[2020-08-13 15:08:46] submit_upstream_work json_rpc_call failed
[2020-08-13 15:08:46] ...retry after 10 seconds

JayDDee

full member

Activity: 1436

Merit: 232

Quote from: webhead on August 10, 2020, 12:34:14 PM

hello can you add xla panthera algo many thanks.

The algo is a modified randomx and the existing miner is a fork of xmrig. I can't do better.

Topic: [ANN] cpuminer-opt v3.14.2, open source optimized multi-algo CPU miner - page 13. (Read 10546 times)