Anyone planning on testing the 7980XE?
The specs don't impress me and it's overpriced. It has a low base clock and a relatively small cache, both
critical for CPU mining. Intel is known for better single threaded performance but that doesn't matter when
mining.
The 24 MB cache limits the number of threads mining cryptonight to 12, not even enough to load all the physical cores.
A Threadripper 1920X (12C/24T, 32MB cache) will likely perform better for less than half the price.
Compute intensive algos are irelevant because GPUs are much more efficient and CPUs can't comptete.
The 7980XE doesn't yet have SHA support, unlike Ryzen, but that's less of an issue because there are few algos
that can use it.
It does have AVX512 but I don't see much benefit in that because it only improves compute performance. On those
algos that could potentially use it the gain would be small, less than the gain from AVX to AVX2. There are fewer opportunities
to promote AVX2 to AVX512 because AVX512 works on larger vectors, only algos that use vectors of 512 bits or greater
can use it.
I'm curious for some real results to compare, but I won't be bying one.
Keep in mind that it extends the registers to 32 (xmm16-xmm31 / ymm16-ymm31 / zmm16-zmm31). If register pressure is an issue, it can help. It also offers masking with K registers which might be useful in some cases.
One problem though is that avx512 gets underclocked... in a xeon system which worked around 2 - 2.1 ghz normal, and typical code execution was boosted at 2.6ghz, avx512 was running at ~1.8ghz.
Google cloud has some servers with avx512 which you can play on without buying avx512 CPUs, but they kind of suck at benchmarking due to being VMs with unstable performance (resource sharing).
Thanks for sharing your thoughts.
I have not seen any register issues with the existing vectored code.
Having more registers at your disposal is always nice - it allows for new possibilities in how you write the code - especially if there are a lot of variables or tables.
The x, y & z regs are also overlaid in the 7980XE but only the lower 256 or 128
bits can be accessed by ymm or xmm respectively. This creates a lot of overhead when an app needs to revert to smaller vectors for some operations.
I think the problem is from xmm->ymm due to having 1x128 or 2x128 lanes which create a dependency issue, requiring the zeroing of the upper ymm part to use the xmm without a perf penalty. There's a lot of avx code that sucks without vzeroupper for this reason. IIRC ymm->zmm don't have the same issue, even if they overlap, but I may be wrong on this.
But having +16 more registers is good for such scenarios also... in case you want to reuse a register which was previously overlapped, you just use a new one thus avoiding false dependencies altogether (assuming at least a vzeroupper or a vzeroall at the start of the function).
AVX & AVX2 are also underclocked, AVX512 is underclocked more.
Something like that...
The K registers seem interesting. I don't fully understand them but they appear to be able to reduce the number of instructions when shuffling vector elements.
There's a lot of things they can be used for. Essentially they perform partial operations on the full width of a register, but this can be pretty useful. You can avoid doing some stuff twice and blending the two different stuff, (as you can do it in one go), you can read memory up to X bytes by using the appropriate mask, etc.
If you are using, say, a 512bit vector on 64bit elements, and want to perform something on 384 bits (6x64) you just put a 0b00111111 on the k register and then use the k register alongside with the instruction. Or you can do stuff like 0b01010101, thus working on first, third, fifth, seventh element and leaving the rest unchanged (or have it overwritten with zeroes on one go - depending the z flag setting, which is also new).
Now that I'm thinking about this, and this is relevant to what you said earlier on xmm/ymm/zmm overlap and performance issues, one can use just one type of register (like zmm or ymm) for all types of operations, whether small or large, assuming they also use the appropriate k register to set the width they want. In this way false dependencies should be nullified even between xmm/ymm. You want to do 128bit op? You use a ymm register with a 0b00001111 (32bit elements) or 0b0011 (64bit elements) k-mask and the ymm is addressed as ymm on the 128bit lower part. Opcode will probably be somewhat larger though.
K-regs are not too hard in their use, but I dislike the fact that they can't get fed with immediate values like general purpose registers and that I have to go immediate=>gpr=>k register or load values from memory.
The only thing that took me a while to find out (I thought I was hitting a gcc bug) is how to properly write the instruction with the proper syntax, for gcc assembly-within-c...
For example if I want to move 320 bits from memory to a zmm register it goes like this:
"mov $0b1111111111, %%eax\n" (10 x 1 bit = 10 x 32 bit elements)
"kmovd %%eax, %%k1\n"
"vmovdqu32 0(%0), %%zmm0 %{%%k1%}%{z%}\n"
The z flag is there to zero out the rest of the bits (if there was anything on zmm0).