[ANN]: cpuminer-opt v3.8.8.1, open source optimized multi-algo CPU miner - page 78.

coinbutter

newbie

Activity: 25

Merit: 0

Quote from: ZenFr on April 05, 2017, 07:19:31 AM

Quote from: joblo on April 05, 2017, 07:06:15 AM

Quote from: onedeveloper on April 05, 2017, 02:47:06 AM

You can also use 0x0F0F or other combinations, providing there are only 4 CPUs (bits) active on first two digits and 4 on second.

I hope all is clear now Roll Eyes

Almost. Won't a mask of 0xf0f0 result in 4 idle cores and 4 cores with 2 threads each. Don't you want one thread
per physical core?

What are the masks to have only one thread per physical core (2 cores/4 threads CPUs (Core i3/i5) nd 4 cores/8 thread CPUs (Core i7)) ?

ZenFr,
2C/4T 0xA or 0x5
4C/8T 0xAA or 0x55

edit: onedeveloper, I was mis-remembering the L3 cache structure on the Ryzen, it's a unified 8mb victim cache per CCX not 2mb L3 victim cache per core. Still, there is a cache bandwidth issue and the nature of the victim cache means that if 8mb per CCX is reached then it starts moving data to the other CCX or if it is full as well it moves to system ram. If you mask 0x5 on Ryzen you mask the process to logical processor 0 and 2, core 1 and 2 of CCX 1. If you mask 0xA on Ryzen you mask the process to logical processor 1 and 3, core 1 and 2 of CCX 1.

ZenFr

legendary

Activity: 1260

Merit: 1046

Quote from: joblo on April 05, 2017, 07:06:15 AM

Quote from: onedeveloper on April 05, 2017, 02:47:06 AM

You can also use 0x0F0F or other combinations, providing there are only 4 CPUs (bits) active on first two digits and 4 on second.

I hope all is clear now Roll Eyes

Almost. Won't a mask of 0xf0f0 result in 4 idle cores and 4 cores with 2 threads each. Don't you want one thread
per physical core?

What are the masks to have only one thread per physical core (2 cores/4 threads CPUs (Core i3/i5) nd 4 cores/8 thread CPUs (Core i7)) ?

joblo

legendary

Activity: 1470

Merit: 1114

Quote from: onedeveloper on April 05, 2017, 02:47:06 AM

You can also use 0x0F0F or other combinations, providing there are only 4 CPUs (bits) active on first two digits and 4 on second.

I hope all is clear now Roll Eyes

Almost. Won't a mask of 0xf0f0 result in 4 idle cores and 4 cores with 2 threads each. Don't you want one thread
per physical core?

onedeveloper

full member

Activity: 143

Merit: 100

I must "enter the fray" because I see coinbutter is making a mistake. Take this notes into account.

The CPU affinity treats the bits as flags for each logical CPU. In Windows >= 8 one can look into the task manager to find how many CPUs is recognizing.
Each bit represents a CPU, as joblo said, being bit 0 the zeroth CPU, bit 1 the 1st, and so on.
On Ryzen case, as you know, the CPUS are organized in two blocks of 8 CPUS on each CCX, sharing a common 8MB cache each. This limits the number of threads per CCX for cryptonight to 8MB/2MB = 4 threads.
It doesn't matter which logical CPU gets the threads on each CCX. The maximum must be 4.
As each thread is a bit, Ryzen masks have 16 bits, i.e. 4 hexadecimal digits.
The hexadecimal numbers are represented high-bit to low-bit, so the first 1 is the 16th CPU in Ryzen case.
If you want to use the maximum 8 threads in Ryzen 7, you must use 4 threads in 1st CCX and other 4 in 2nd.
A mask like 0xF0F0 is enoug. This means that 8 threads will be assigned to the logical CPUs 15, 14, 13, 12 (second CCX) and 7, 6, 5, 4 (first CCX). The rest of the CPUs will be left free and useable for any other task.

How to mine:

cpuminer-aes-avx2 -a cryptonight -o -u -p -t 8 --cpu-affinity 0xF0F0

You can also use 0x0F0F or other combinations, providing there are only 4 CPUs (bits) active on first two digits and 4 on second.

I hope all is clear now Roll Eyes

andy75

full member

Activity: 141

Merit: 100

anyone tried to change the json parser , do you think it will help performance Huh

there are more faster and efficient parser out there

joblo

legendary

Activity: 1470

Merit: 1114

Quote from: coinbutter on April 04, 2017, 10:07:11 PM

Quote from: joblo on April 04, 2017, 09:06:47 PM

A mask of 0x54 doesn't seem to make sense, it's not symetric.

That mask is logical processors 2,4 and 6. 00101010

I don't load the first core for testing until I reach four threads and since this is cryptonight I was increasing it one thread at a time to find out where the L3 cache saturated.

I think the cryptographic functions in Ryzen are shared in each core so even though Simultaneous MultiThreading would allow two threads to execute on a core they get bottlenecked waiting for a thread to finish it's hash. I think it's most efficient to only put one thread on each core instead of loading every logical processor.

I'd be very interested in having the thread count to match the number of processors in the mask and to automatically assign each thread to it's own processor. Call it Ryzen mode!

00101010 is not 0x54, it's 0x2a and it's logical cpu 1, 3 & 5 the way I read it. But it still depends on how AMD maps
logical CPUs to physical cores and CCXs how to achieve one thread on every psysical core. If the default isn't optimum
AMD maps differently than intel and cpuminer requires a way to specify the optimum mapping for AMD.

I also made a mistake in my examples, every other cpu is 0x5555.

It also needs to be confirmed whether multiple bits set in the mask means that the thread can be mapped to to any
of the associated logical cpus as well as be moved among them.

The optimum configuration for cryptonight on 8/16 core Ryzen is to have 8 threads, one nailed to each physical core which also means 4 per CCX.
The question is how to do that. Need to understand AMD's mapping first.

coinbutter

newbie

Activity: 25

Merit: 0

Quote from: joblo on April 04, 2017, 09:06:47 PM

A mask of 0x54 doesn't seem to make sense, it's not symetric.

That mask is logical processors 2,4 and 6. 00101010

I don't load the first core for testing until I reach four threads and since this is cryptonight I was increasing it one thread at a time to find out where the L3 cache saturated.

I think the cryptographic functions in Ryzen are shared in each core so even though Simultaneous MultiThreading would allow two threads to execute on a core they get bottlenecked waiting for a thread to finish it's hash. I think it's most efficient to only put one thread on each core instead of loading every logical processor.

I'd be very interested in having the thread count to match the number of processors in the mask and to automatically assign each thread to it's own processor. Call it Ryzen mode!

joblo

legendary

Activity: 1470

Merit: 1114

Thanks for the testing, lots of info to digest.

deep: slower with avx2, may be an issue with cpu affinity (see below), maybe retest.

shat256t: all rejects, it doesn't use the code that broke groestl so needs more investigation.

Edit: please try sha256t with v3.5.9.1

CPU affinity:

Code:

[2017-04-04 18:27:47] Binding thread 2 to cpu mask 54

Here's a description of the Windows function SetThreadAffinityMask

https://msdn.microsoft.com/en-us/library/windows/desktop/ms686247(v=vs.85).aspx

The default whith no affinity affinity arg is to set one bit in the mask to match the thread with the cpu#:
cpu 0 = mask 0, cpu 1 = mask 1, cpu 2 = mask 4, cpu 3 = mask 8. Each thread is assigned to a different cpu.
On Intel i7 running 4 threads this works with one thread on each core and no core with 2 threads.

When multiple bits are set in the mask the thread can be assigned to any of the cpus represented in the mask.
If multiple threads are assigned using the same multibit mask I don't know.

The code as written doesn't seem to allow a different mask for each thread. Maybe it relies on the OS to sort
it out. I am speculating if multiple cpus are allowed the thread may be moved to another permitted cpu.

A mask of 0x54 doesn't seem to make sense, it's not symetric.

If i understand correctly a mask of...

0xffff: assign to any cpu
0x1111: assign 4 threads to either cpu 0, 4, 8 or 12
Edit: correction
~~0x3333~~0x5555: assign 8 threads to 0, 2, 4, 6, 8, 10, or 12
0x000f; assign to 0, 1, 2, or 3

On Ryzen you must consider the CCX as well as SMT (HT). depending on how the CPUs are mapped
CPUs 0 & 1 may be:
- two HT threads on the first core on the first CCX,
- the first thread on 2 different cores on the first CCX
- the first thread on the first core of 2 different CCXs
- something else

It's going to take a lot of playing around to figure it all out. Once the mapping is understood it can be determined
if a multibit mask allows the system to move threads to another CPU. Previous observations of better performance
with multiple miner instances suggests they can.

Any suggestions on how to specify a custom single bit mask for each thread? One though is to use the affinity arg as
spacing between cpus. For instance an arg of:

0 = consecutive CPUs, ie 0,1,2,3,4...
1 = alternate cpus, 0, 2, 4, 6, 8...
2 = every third cpu, 0, 3, 6, 9... useful for 6/12 core Ryzen maybe
3 = every 4th cpu

Edit:

Another thought is to have a binary option for affinity, consecutive or distributed. With distributed the spcacing
would be calculated automatically based on the cpu and thread counts.

coinbutter

newbie

Activity: 25

Merit: 0

Quote from: joblo on April 04, 2017, 12:53:22 PM

I retested v3.5.9.1 using the Windows AVX2 binary on Win 8.1 at suprnova groestlcoin and it works.
I can't explain your rejects.

I tried running it on my X5687 with -a dmd-gr (and -a groestl) and get the same issues with 3.6.1 and 3.5.9.1 (aes-sse42). Perhaps it's something dmd-gr related. I know it's the same algo but there has to be something that causes the (reject reason: low difficulty share of 1.3270640174699655e-7) error.

I'll have some sha256t and deepcoin results on my R7 in a bit.

edit: I'm solo mining dmd-gr and I'll let it run for a few days on the X5687.
edit2: It works fine pool mining GRS. Definitely something DMD related.
edit3: Ryzen R7 1800X @ 3.8 2993 DDR4

Deep algo pool: aes-sse42 ~533 kH/s/core, aes-avx ~543 kH/s/core, aes-avx2 ~492 kH/s/core
sha256t algo pool: aes-sse42 ~1333 kH/s/core, aes-avx ~1353 kH/s/core, aes-avx2 ~1639 kH/s/core (all rejected due to low difficulty)
groestl algo pool 3.6.1: aes-sse42 ~751 kH/s/core, aes-avx ~848 kH/s/core, aes-avx2 ~851 kH/s/core (all rejected due to low difficulty)
groestl algo pool 3.5.9.1: aes-sse42 ~656 kH/s/core, aes-avx ~724 kH/s/core, aes-avx2 ~731 kH/s/core (all accepted)
dmd-gr algo pool 3.6.1: aes-sse42 ~751 kH/s/core, aes-avx ~848 kH/s/core, aes-avx2 ~851 kH/s/core (all rejected due to low difficulty)
dmd-gr algo pool 3.5.9.1: aes-sse42 ~655 kH/s/core, aes-avx ~724 kH/s/core, aes-avx2 ~731 kH/s/core (all rejected due to low difficulty)

edit4: It looks like thread binding goes to the processor mask instead of to an physical core. The windows scheduler is probably moving the threads:

Code:

[2017-04-04 18:27:47] Binding process to cpu mask 54
[2017-04-04 18:27:47] Starting Stratum on stratum+tcp://xmr-usa.dwarfpool.com:8005
[2017-04-04 18:27:47] 3 miner threads started, using 'cryptonight' algorithm.
[2017-04-04 18:27:47] Binding thread 2 to cpu mask 54
[2017-04-04 18:27:47] Binding thread 1 to cpu mask 54
[2017-04-04 18:27:47] Binding thread 0 to cpu mask 54

Running an instance on one physical core (one thread) ~60 H/s/thread
Running an instance on three physical cores (same CCX) (three threads) ~52 H/s/thread
I'm going to try to run and instance across both CCX and see if worse performance results.
edit5: Running an instance on four physical cores (2Cx2 CCX) (four threads) ~53 H/s/thread
Running an instance on six physical cores (3Cx2 CCX) (four threads) ~60 H/s/thread
Running an instance on six physical cores (3Cx2 CCX) (six threads) ~55 H/s/thread
Running an instance on eight physical cores (4Cx2 CCX) (six threads) ~57 H/s/thread
Running an instance on eight physical cores (4Cx2 CCX) (eight threads) ~49 H/s/thread
Additionally, I could see the scheduler moving the threads when more cores than threads were assigned. Unsurprisingly, the scheduler kept a thread off of core 0 even when it was allowed in the processor mask. I think the cross-core cache bandwidth gets filled up at some point (also, once I hit 4 threads/CCX the L3 cache was probably stuffed and sending data to RAM. The hashrate got as low as 38.84 H/s on one thread at a point, possibly indicating exceeding the cache size or the data being on the other CCX due to the design of Ryzen's victim cache).
Perhaps assigning the threads on the cryptonight algorithm to a physical processor would help alleviate some of the architectural limitations. I'm done with cryptonight for tonight.

joblo

legendary

Activity: 1470

Merit: 1114

Quote from: giagge on April 04, 2017, 02:38:02 PM

I love hexxcoin , with my ryzen , but i stuck a version 3.6.0 , any news for boost in new update ? i use windows 10 x64 .

Hexxcoin is pure Lyra2, it's pretty much maxed out. Post your results.

giagge

legendary

Activity: 1134

Merit: 1001

I love hexxcoin , with my ryzen , but i stuck a version 3.6.0 , any news for boost in new update ? i use windows 10 x64 .

joblo

legendary

Activity: 1470

Merit: 1114

Quote from: onedeveloper on April 04, 2017, 01:42:58 PM

Maybe it's just the compiler doing its optimization work. I had the same workload on my computer on every test, so I was as surprised as you. I also checked the source code and it's true there's no AVX2 code there.

It's also faster than the openssl version (without HW SHA) which also surprised me. I would have expected
openssl to have AVX and AVX2 optimizations but it's slower than the SPH implementation included in cpuminer.

onedeveloper

full member

Activity: 143

Merit: 100

Maybe it's just the compiler doing its optimization work. I had the same workload on my computer on every test, so I was as surprised as you. I also checked the source code and it's true there's no AVX2 code there.

joblo

legendary

Activity: 1470

Merit: 1114

Quote from: onedeveloper on April 04, 2017, 10:42:22 AM

Doing the same tests, this time with the "sha256t" algo, I got these results. AVX-only version:

Code:

CPU: Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz
CPU features: SSE2 AES AVX AVX2
SW built on Mar 31 2017 with GCC 4.8.3
SW features: SSE2 AES AVX
Algo features: SSE2 SHA
Start mining with SSE2

[2017-04-04 16:35:52] 4 miner threads started, using 'sha256t' algorithm.
[2017-04-04 16:35:53] CPU #1: 262.14 kH, 321.25 kH/s
[2017-04-04 16:35:53] CPU #2: 262.14 kH, 321.25 kH/s
[2017-04-04 16:35:53] CPU #3: 262.14 kH, 321.25 kH/s
[2017-04-04 16:35:53] Total: 786.43 kH, 963.75 kH/s
[2017-04-04 16:35:53] CPU #0: 262.14 kH, 293.18 kH/s
[2017-04-04 16:35:57] CPU #0: 1172.72 kH, 297.42 kH/s
[2017-04-04 16:35:57] CPU #2: 1284.99 kH, 315.88 kH/s
[2017-04-04 16:35:57] CPU #3: 1284.99 kH, 315.88 kH/s
[2017-04-04 16:35:57] Total: 4004.85 kH, 1250.43 kH/s
[2017-04-04 16:35:57] CPU #1: 1284.99 kH, 313.47 kH/s
[2017-04-04 16:36:02] CPU #3: 1579.41 kH, 316.94 kH/s
[2017-04-04 16:36:02] Total: 5322.12 kH, 1243.72 kH/s
[2017-04-04 16:36:02] CPU #0: 1487.11 kH, 294.72 kH/s
[2017-04-04 16:36:02] CPU #2: 1579.41 kH, 312.05 kH/s
[2017-04-04 16:36:02] CPU #1: 1567.37 kH, 305.89 kH/s
[2017-04-04 16:36:06] CPU #0: 1473.59 kH, 303.84 kH/s
[2017-04-04 16:36:07] CPU #2: 1560.23 kH, 314.61 kH/s
[2017-04-04 16:36:07] CPU #3: 1584.69 kH, 313.62 kH/s
[2017-04-04 16:36:07] Total: 6185.88 kH, 1237.96 kH/s
[2017-04-04 16:36:07] CPU #1: 1529.45 kH, 301.75 kH/s
[2017-04-04 16:36:07] CTRL_C_EVENT received, exiting

And now the AVX2-optimized version:

Code:

CPU: Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz
CPU features: SSE2 AES AVX AVX2
SW built on Mar 31 2017 with GCC 4.8.3
SW features: SSE2 AES AVX AVX2
Algo features: SSE2 SHA
Start mining with SSE2

[2017-04-04 16:36:14] 4 miner threads started, using 'sha256t' algorithm.
[2017-04-04 16:36:15] CPU #2: 262.14 kH, 430.18 kH/s
[2017-04-04 16:36:15] CPU #1: 262.14 kH, 419.43 kH/s
[2017-04-04 16:36:15] CPU #0: 262.14 kH, 409.20 kH/s
[2017-04-04 16:36:15] CPU #3: 262.14 kH, 399.46 kH/s
[2017-04-04 16:36:15] Total: 1048.58 kH, 1658.27 kH/s
[2017-04-04 16:36:19] CPU #3: 1597.83 kH, 422.56 kH/s
[2017-04-04 16:36:19] Total: 2384.26 kH, 1681.37 kH/s
[2017-04-04 16:36:19] CPU #1: 1677.72 kH, 416.17 kH/s
[2017-04-04 16:36:19] CPU #0: 1636.80 kH, 404.45 kH/s
[2017-04-04 16:36:19] CPU #2: 1720.74 kH, 417.14 kH/s
[2017-04-04 16:36:24] CPU #3: 2112.80 kH, 420.91 kH/s
[2017-04-04 16:36:24] Total: 7148.05 kH, 1658.68 kH/s
[2017-04-04 16:36:24] CPU #0: 2022.27 kH, 409.25 kH/s
[2017-04-04 16:36:24] CPU #2: 2085.72 kH, 419.44 kH/s
[2017-04-04 16:36:24] CPU #1: 2080.86 kH, 409.45 kH/s
[2017-04-04 16:36:29] CPU #3: 2104.57 kH, 422.75 kH/s
[2017-04-04 16:36:29] Total: 8293.42 kH, 1660.89 kH/s
[2017-04-04 16:36:29] CPU #0: 2046.24 kH, 414.94 kH/s
[2017-04-04 16:36:29] CPU #2: 2097.18 kH, 417.33 kH/s
[2017-04-04 16:36:29] CPU #1: 2047.27 kH, 406.14 kH/s
[2017-04-04 16:36:29] CTRL_C_EVENT received, exiting

This time I went from 1245 kH/s to 1660 kH/s, a surprising 33.33% increase on speed!

With this algorithm, I really will like to see the performance with native HW SHA acceleration

I can't explain this. There is no AES or AVX optimized code in cpuminer for sha256t.

joblo

legendary

Activity: 1470

Merit: 1114

On the bright side I can check off another test case for SHA. Ryzen CPU and features are correctly detected.

CPU: AMD Ryzen 7 1800X Eight-Core Processor
CPU features: SSE2 AES AVX AVX2 SHA

joblo

legendary

Activity: 1470

Merit: 1114

Quote from: coinbutter on April 03, 2017, 10:15:38 PM

cpuminer-aes-avx2 -a dmd-gr -o stratum+tcp://us.miningfield.com:3377 -u x -p x --cpu-affinity 43690 --cpu-priority 0 --threads=8 --api-bind 127.0.0.1:4050

   ********** cpuminer-opt 3.5.9.1 ***********
   A CPU miner with multi algo support and optimized for CPUs
   with AES_NI and AVX extensions.
   BTC donation address: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT
   Forked from TPruvot's cpuminer-multi with credits
   to Lucas Jones, elmad, palmd, djm34, pooler, ig0tik3d,
   Wolf0, Jeff Garzik and Optiminer.

CPU: AMD Ryzen 7 1800X Eight-Core Processor
CPU features: SSE2 AES AVX AVX2
SW built on Mar 4 2017 with GCC 4.8.3
SW features: SSE2 AES AVX AVX2
Algo features: SSE2 AES
Start mining with SSE2 AES

[2017-04-03 22:14:15] Binding process to cpu mask aaaa
[2017-04-03 22:14:15] Starting Stratum on stratum+tcp://us.miningfield.com:3377
[2017-04-03 22:14:15] 8 miner threads started, using 'groestl' algorithm.
[2017-04-03 22:14:15] Stratum difficulty set to 2
[2017-04-03 22:14:18] groestl block 2095712, diff 25.841
[2017-04-03 22:14:20] CPU #4: 262.14 kH, 232.48 kH/s
-
[2017-04-03 22:14:20] Rejected 1/1 (100.0%), 1871.29 kH, 1841.03 kH/s
[2017-04-03 22:14:20] reject reason: low difficulty share of 1.030424417006305e-7
[2017-04-03 22:14:20] factor reduced to : 0.67

edit: Also tried -a groestl

I retested v3.5.9.1 using the Windows AVX2 binary on Win 8.1 at suprnova groestlcoin and it works.
I can't explain your rejects.

Can you (or maybe someone else) try with another CPU and/or another pool like suprnova?
I want to get to the bottom of this before releasing the next version.

Use the AES builds in the legacy release here https://drive.google.com/file/d/0B0lVSGQYLJIZT0tlY3o4ZjEycXM/view?usp=sharing,
v3.6.1 is broken. The non-AES builds of 3.5.9.1 are likely broken as well.

onedeveloper

full member

Activity: 143

Merit: 100

I was piqued with your selection of algos so I decided to try them on my Windows 8.1 machine. This is the result for AVX version of "deep" algo:

Code:

CPU: Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz
CPU features: SSE2 AES AVX AVX2
SW built on Mar 31 2017 with GCC 4.8.3
SW features: SSE2 AES AVX
Algo features: SSE2 AES AVX AVX2
Start mining with SSE2 AES AVX

[2017-04-04 16:31:20] 4 miner threads started, using 'deep' algorithm.
[2017-04-04 16:31:31] CPU #3: 2097.15 kH, 182.74 kH/s
[2017-04-04 16:31:31] Total: 2097.15 kH, 182.74 kH/s
[2017-04-04 16:31:31] CPU #2: 2097.15 kH, 182.49 kH/s
[2017-04-04 16:31:31] CPU #0: 2097.15 kH, 177.66 kH/s
[2017-04-04 16:31:32] CPU #1: 2097.15 kH, 175.57 kH/s
[2017-04-04 16:31:36] CPU #1: 702.27 kH, 171.87 kH/s
[2017-04-04 16:31:36] CPU #2: 912.45 kH, 183.92 kH/s
[2017-04-04 16:31:36] CPU #3: 913.69 kH, 182.45 kH/s
[2017-04-04 16:31:36] Total: 4625.55 kH, 715.90 kH/s
[2017-04-04 16:31:36] CPU #0: 888.29 kH, 180.76 kH/s
[2017-04-04 16:31:41] CPU #1: 859.36 kH, 172.99 kH/s
[2017-04-04 16:31:41] CPU #2: 919.62 kH, 183.39 kH/s
[2017-04-04 16:31:41] CPU #3: 912.25 kH, 183.06 kH/s
[2017-04-04 16:31:41] Total: 3579.51 kH, 720.20 kH/s
[2017-04-04 16:31:41] CPU #0: 903.81 kH, 179.12 kH/s
[2017-04-04 16:31:44] CTRL_C_EVENT received, exiting

And this is the same test using AVX2:

Code:

CPU: Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz
CPU features: SSE2 AES AVX AVX2
SW built on Mar 31 2017 with GCC 4.8.3
SW features: SSE2 AES AVX AVX2
Algo features: SSE2 AES AVX AVX2
Start mining with SSE2 AES AVX2

[2017-04-04 16:30:52] 4 miner threads started, using 'deep' algorithm.
[2017-04-04 16:31:01] CPU #3: 2097.15 kH, 230.87 kH/s
[2017-04-04 16:31:01] Total: 2097.15 kH, 230.87 kH/s
[2017-04-04 16:31:01] CPU #2: 2097.15 kH, 228.12 kH/s
[2017-04-04 16:31:01] CPU #1: 2097.15 kH, 219.54 kH/s
[2017-04-04 16:31:02] CPU #0: 2097.15 kH, 206.06 kH/s
[2017-04-04 16:31:06] CPU #0: 824.23 kH, 225.80 kH/s
[2017-04-04 16:31:06] CPU #3: 1154.34 kH, 229.71 kH/s
[2017-04-04 16:31:06] Total: 6172.87 kH, 903.17 kH/s
[2017-04-04 16:31:06] CPU #2: 1140.60 kH, 226.97 kH/s
[2017-04-04 16:31:06] CPU #1: 1097.69 kH, 226.17 kH/s
[2017-04-04 16:31:11] CPU #0: 1129.01 kH, 221.69 kH/s
[2017-04-04 16:31:11] CPU #3: 1148.54 kH, 231.20 kH/s
[2017-04-04 16:31:11] Total: 4515.84 kH, 906.03 kH/s
[2017-04-04 16:31:11] CPU #2: 1134.87 kH, 229.17 kH/s
[2017-04-04 16:31:11] CPU #1: 1130.86 kH, 220.70 kH/s
[2017-04-04 16:31:13] CTRL_C_EVENT received, exiting

This shows that the AVX2 optimized version is 25.83% faster than AVX-only version in the same architecture. Ryzen is said it only have AVX2 emulation, not real 256 bits, so it will be interesting to see the results there.

Doing the same tests, this time with the "sha256t" algo, I got these results. AVX-only version:

Code:

CPU: Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz
CPU features: SSE2 AES AVX AVX2
SW built on Mar 31 2017 with GCC 4.8.3
SW features: SSE2 AES AVX
Algo features: SSE2 SHA
Start mining with SSE2

[2017-04-04 16:35:52] 4 miner threads started, using 'sha256t' algorithm.
[2017-04-04 16:35:53] CPU #1: 262.14 kH, 321.25 kH/s
[2017-04-04 16:35:53] CPU #2: 262.14 kH, 321.25 kH/s
[2017-04-04 16:35:53] CPU #3: 262.14 kH, 321.25 kH/s
[2017-04-04 16:35:53] Total: 786.43 kH, 963.75 kH/s
[2017-04-04 16:35:53] CPU #0: 262.14 kH, 293.18 kH/s
[2017-04-04 16:35:57] CPU #0: 1172.72 kH, 297.42 kH/s
[2017-04-04 16:35:57] CPU #2: 1284.99 kH, 315.88 kH/s
[2017-04-04 16:35:57] CPU #3: 1284.99 kH, 315.88 kH/s
[2017-04-04 16:35:57] Total: 4004.85 kH, 1250.43 kH/s
[2017-04-04 16:35:57] CPU #1: 1284.99 kH, 313.47 kH/s
[2017-04-04 16:36:02] CPU #3: 1579.41 kH, 316.94 kH/s
[2017-04-04 16:36:02] Total: 5322.12 kH, 1243.72 kH/s
[2017-04-04 16:36:02] CPU #0: 1487.11 kH, 294.72 kH/s
[2017-04-04 16:36:02] CPU #2: 1579.41 kH, 312.05 kH/s
[2017-04-04 16:36:02] CPU #1: 1567.37 kH, 305.89 kH/s
[2017-04-04 16:36:06] CPU #0: 1473.59 kH, 303.84 kH/s
[2017-04-04 16:36:07] CPU #2: 1560.23 kH, 314.61 kH/s
[2017-04-04 16:36:07] CPU #3: 1584.69 kH, 313.62 kH/s
[2017-04-04 16:36:07] Total: 6185.88 kH, 1237.96 kH/s
[2017-04-04 16:36:07] CPU #1: 1529.45 kH, 301.75 kH/s
[2017-04-04 16:36:07] CTRL_C_EVENT received, exiting

And now the AVX2-optimized version:

Code:

CPU: Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz
CPU features: SSE2 AES AVX AVX2
SW built on Mar 31 2017 with GCC 4.8.3
SW features: SSE2 AES AVX AVX2
Algo features: SSE2 SHA
Start mining with SSE2

[2017-04-04 16:36:14] 4 miner threads started, using 'sha256t' algorithm.
[2017-04-04 16:36:15] CPU #2: 262.14 kH, 430.18 kH/s
[2017-04-04 16:36:15] CPU #1: 262.14 kH, 419.43 kH/s
[2017-04-04 16:36:15] CPU #0: 262.14 kH, 409.20 kH/s
[2017-04-04 16:36:15] CPU #3: 262.14 kH, 399.46 kH/s
[2017-04-04 16:36:15] Total: 1048.58 kH, 1658.27 kH/s
[2017-04-04 16:36:19] CPU #3: 1597.83 kH, 422.56 kH/s
[2017-04-04 16:36:19] Total: 2384.26 kH, 1681.37 kH/s
[2017-04-04 16:36:19] CPU #1: 1677.72 kH, 416.17 kH/s
[2017-04-04 16:36:19] CPU #0: 1636.80 kH, 404.45 kH/s
[2017-04-04 16:36:19] CPU #2: 1720.74 kH, 417.14 kH/s
[2017-04-04 16:36:24] CPU #3: 2112.80 kH, 420.91 kH/s
[2017-04-04 16:36:24] Total: 7148.05 kH, 1658.68 kH/s
[2017-04-04 16:36:24] CPU #0: 2022.27 kH, 409.25 kH/s
[2017-04-04 16:36:24] CPU #2: 2085.72 kH, 419.44 kH/s
[2017-04-04 16:36:24] CPU #1: 2080.86 kH, 409.45 kH/s
[2017-04-04 16:36:29] CPU #3: 2104.57 kH, 422.75 kH/s
[2017-04-04 16:36:29] Total: 8293.42 kH, 1660.89 kH/s
[2017-04-04 16:36:29] CPU #0: 2046.24 kH, 414.94 kH/s
[2017-04-04 16:36:29] CPU #2: 2097.18 kH, 417.33 kH/s
[2017-04-04 16:36:29] CPU #1: 2047.27 kH, 406.14 kH/s
[2017-04-04 16:36:29] CTRL_C_EVENT received, exiting

This time I went from 1245 kH/s to 1660 kH/s, a surprising 33.33% increase on speed!

With this algorithm, I really will like to see the performance with native HW SHA acceleration

joblo

legendary

Activity: 1470

Merit: 1114

Quote from: coinbutter on April 04, 2017, 06:54:25 AM

onedeveloper, your posts about Ryzen are what initially brought me to this forum Cheesy

cpuminer-multi didn't seem very optimized for my hardware and I'm looking for something better. I someone wants me to try out an algorithm or something out I'm happy to give it a try.

Regarding your groestl/dmd-gr problem, the error you see in v3.5.9.1 is the same as the bug I introduced in a later release
suggesting the Windows binaries were built incorrectly. I will retest that.

Other algos with Ryzen I am curious about:

deep: compute bound, good test of AVX2

sha256t: compute bound without any AVX2 code*

lyra2z330: I/O bound due to large data array

* sha256t would benefit from HW SHA acceleration available with Ryzen but requires a supported Linux compile environment.
See release announcement for v3.6.1 for details.

https://bitcointalksearch.org/topic/m.18406368

coinbutter

newbie

Activity: 25

Merit: 0

onedeveloper, your posts about Ryzen are what initially brought me to this forum Cheesy

cpuminer-multi didn't seem very optimized for my hardware and I'm looking for something better. I someone wants me to try out an algorithm or something out I'm happy to give it a try.

onedeveloper

full member

Activity: 143

Merit: 100

Coinbutter: you can check my messages in this thread on how to make more of your Ryzen CPU and cryptonight algo:

https://bitcointalksearch.org/topic/m.18239788
https://bitcointalksearch.org/topic/m.18240239

Finally, check how this worked for member giagge:

https://bitcointalksearch.org/topic/m.18279882
Cool

I am sorry I cannot help with groestl algo and your pool ...

Topic: [ANN]: cpuminer-opt v3.8.8.1, open source optimized multi-algo CPU miner - page 78. (Read 444129 times)