hi ghostlander,
are you still monitoring this thread?
hi all,
oops, I've posted my comment in the 'wrong' thread
https://bitcointalksearch.org/topic/m.62832733reposting here
recently I tried getting cpuminer to run on a Raspberry Pi 4 (aarch64)
I'm using somewhat older codes (version 2.4) from
https://github.com/ghostlander/cpuminer-neoscryptI couldn't figure out how to get it to build with the ARM assembly codes, and apparently scrypt-arm.S seemed to be written for armhf (32 bit ARM microprocessors).
Hence, there could possibly with issues compiling in aarch64 (ARM 64bit instructions and OS)
Among the things I tried, I added the following flags
Code:
-O2 -ftree-vectorize -ftree-slp-vectorize -ftree-loop-vectorize -ffast-math -ftree-vectorizer-verbose=7 -funsafe-loop-optimizations -funsafe-math-optimizations
I checked and try compiling with "-S" option which makes it generate assembly codes, apparently among the suite of flags used above, it causes GCC to generate assembly codes with NEON SIMD.
This is without specific hand optimized assembly. It may possibly still make NEON assembly with a few less flags (e.g. possibly less -ftree-slp-vectorize, but that I think this is useful even without NEON), but that when missing some of the above flags, NEON assembly isn't generated.
I tried with and without the above flags e.g. just -O2, there is at least a slight difference in hash rates, from about 4 khash per sec on all 4 cores doing Neoscrypt - mining Feathercoin to about 5+ khash per sec with NEON optimised codes by GCC's -ftree-vectorize vectorizer, some 20-30% improvements. And the cpu runs hotter during mining along with the higher hash rates which indicates an improvement in efficiency. This is probably a useful thing to have around as manually writing hand optimized assembly e.g. for scrypt-arm.S would likely take a lot of effort and is likely less portable. granted, -ftree-vectorize won't make the fastest codes, but that the improvement is decent with much less manual efforts needed to make optimized assembly codes.
note that neon codes may possibly not work on some ARM cpus which may not support NEON codes, as I think I chanced upon some specs that says A53 cpus the simd extensions is possibly *optional*.
e.g. it is quite possible that some A53 in the wild e.g. the 'cheap' ones may not have NEON in it, even if they are A53 cpus
It used to be that Raspberry Pis are deemed 'too slow' to do mining but Raspberry Pi4 with A72 ARM cores are just borderline and 'punch above its weight' to mine alongside the big Mhash per seconds gpus, the differences is easily 1:1000 though.
--
By just using those flags mentioned, gcc builds binaries with NEON SIMD using that -ftree-vectorize flag, along with the other flags as otherwise it doesn't turn on SIMD codes.
This is a 'quick and easy' way to at least get some NEON SIMD on aarch64 and it isn't too bad as i've described.
off-topic:
just to add a note, I tried to 'hand optimize' it by making a c source where I re-arrange the c arrays in salsa20 to fall into 'lanes' and using the same -ftree-optimize flags, however, instead of being faster the original codes are optimised better even though it actually used less NEON SIMD codes, i looked closer at the generated neon simd codes, I think the problem is that simply 're-arranging' the arrays won't cut it as between the iterations/loops the array is permuted, so that gets streamed out to memory, this is a bummer I'd think lots of stalls then it gets loaded back from memory into a different permuted array of registers.
While with the original codes, there is actually less SIMD. it seemed -ftree-vectorize and other optimizations simply used the normal registers for part of the codes and passing them into simd registers for some sections of the codes, that in itself is faster than the 'rearranged array' codes.
--
there is a minor gain with -ftree-vectorize for Neoscrypt as it spend a large number of loops in salsa20, 1000 x some 200 rounds?
hence, NEON SIMD could potentially speed that up significantly, the trouble is that salsa20
https://en.wikipedia.org/wiki/Salsa20permutates, the arrays between the quarter rounds in each loop. I did a naive attempt by simply re-arrange the arrays in C codes so that they looked like they fall into 'lanes' (common to SIMD).
that oversimplified approach don't cut it with -ftree-vectorize, the registers get streamed out into memory (lots of wait states and cpu stalls for the small Raspberry Pi type boards and cpus).
but that hand optimized assembly won't be easy to write and that they'd take quite a lot of effort.
and the thing is this won't be the only thing that needs to be optimized.
Hence, for now the 'easy' way is to simply
-ftree-vectorize with the other flags in bundle so that at least some form of NEON SIMD is achieved.
There is a decent gain like 20% (for Neoscrypt) with vs without the compiler generated SIMD codes.