Thanks for the testing, lots of info to digest.
deep: slower with avx2, may be an issue with cpu affinity (see below), maybe retest.
shat256t: all rejects, it doesn't use the code that broke groestl so needs more investigation.
Edit: please try sha256t with v3.5.9.1
CPU affinity:
[2017-04-04 18:27:47] Binding thread 2 to cpu mask 54
Here's a description of the Windows function SetThreadAffinityMask
https://msdn.microsoft.com/en-us/library/windows/desktop/ms686247(v=vs.85).aspxThe default whith no affinity affinity arg is to set one bit in the mask to match the thread with the cpu#:
cpu 0 = mask 0, cpu 1 = mask 1, cpu 2 = mask 4, cpu 3 = mask 8. Each thread is assigned to a different cpu.
On Intel i7 running 4 threads this works with one thread on each core and no core with 2 threads.
When multiple bits are set in the mask the thread can be assigned to any of the cpus represented in the mask.
If multiple threads are assigned using the same multibit mask I don't know.
The code as written doesn't seem to allow a different mask for each thread. Maybe it relies on the OS to sort
it out. I am speculating if multiple cpus are allowed the thread may be moved to another permitted cpu.
A mask of 0x54 doesn't seem to make sense, it's not symetric.
If i understand correctly a mask of...
0xffff: assign to any cpu
0x1111: assign 4 threads to either cpu 0, 4, 8 or 12
Edit: correction
0x33330x5555: assign 8 threads to 0, 2, 4, 6, 8, 10, or 12
0x000f; assign to 0, 1, 2, or 3
On Ryzen you must consider the CCX as well as SMT (HT). depending on how the CPUs are mapped
CPUs 0 & 1 may be:
- two HT threads on the first core on the first CCX,
- the first thread on 2 different cores on the first CCX
- the first thread on the first core of 2 different CCXs
- something else
It's going to take a lot of playing around to figure it all out. Once the mapping is understood it can be determined
if a multibit mask allows the system to move threads to another CPU. Previous observations of better performance
with multiple miner instances suggests they can.
Any suggestions on how to specify a custom single bit mask for each thread? One though is to use the affinity arg as
spacing between cpus. For instance an arg of:
0 = consecutive CPUs, ie 0,1,2,3,4...
1 = alternate cpus, 0, 2, 4, 6, 8...
2 = every third cpu, 0, 3, 6, 9... useful for 6/12 core Ryzen maybe
3 = every 4th cpu
Edit:
Another thought is to have a binary option for affinity, consecutive or distributed. With distributed the spcacing
would be calculated automatically based on the cpu and thread counts.