Pages:
Author

Topic: NSGminer v0.9.4: The Fastest NeoScrypt GPU Miner (Read 221582 times)

full member
Activity: 1386
Merit: 220
Among the things I tried, I added the following flags
Code:
Code:
-O2 -ftree-vectorize -ftree-slp-vectorize -ftree-loop-vectorize -ffast-math -ftree-vectorizer-verbose=7 -funsafe-loop-optimizations -funsafe-math-optimizations

This doesn't apply to asm, only to vectorize C code. There are flags that can be set to enable and disable ASM for various architectures.
23 bit ARM asm is likely disabled on AArch64.

Stay tuned...
https://github.com/JayDDee/cpuminer-opt/wiki/Support-for-AARCH64
legendary
Activity: 1239
Merit: 1020
No surrender, no retreat, no regret.
ag1233, I have written only i386 and amd64 asm code with and without SSE2 support. What's in scrypt-arm.S isn't of much importance to NeoScrypt in general as Salsa20 constitutes a rather small part of it. Sure, NEON can speed things up even if compiler generated. Memory bandwidth is another question. When I checked last, 32-bit LPDDR4 powered RPi 4B couldn't reach 5GB/s on memory reads or writes. Although a quad core 1.5GHz Cortex-A72 with 1Mb L2 cache doesn't seem a poor performer, I don't think it's much faster than my old Jetson TX1. Modern high end smartphones are much better in this regard.
newbie
Activity: 7
Merit: 0
hi ghostlander,
are you still monitoring this thread?

hi all,

oops, I've posted my comment in the 'wrong' thread
https://bitcointalksearch.org/topic/m.62832733
reposting here

recently I tried getting cpuminer to run on a Raspberry Pi 4 (aarch64)
I'm using somewhat older codes (version 2.4) from
https://github.com/ghostlander/cpuminer-neoscrypt

I couldn't figure out how to get it to build with the ARM assembly codes, and apparently scrypt-arm.S seemed to be written for armhf (32 bit ARM microprocessors).
Hence, there could possibly with issues compiling in aarch64 (ARM 64bit instructions and OS)

Among the things I tried, I added the following flags
Code:
Code:
-O2 -ftree-vectorize -ftree-slp-vectorize -ftree-loop-vectorize -ffast-math -ftree-vectorizer-verbose=7 -funsafe-loop-optimizations -funsafe-math-optimizations

I checked and try compiling with "-S" option which makes it generate assembly codes, apparently among the suite of flags used above, it causes GCC to generate assembly codes with NEON SIMD.
This is without specific hand optimized assembly. It may possibly still make NEON assembly with a few less flags (e.g. possibly less -ftree-slp-vectorize, but that I think this is useful even without NEON), but that when missing some of the above flags, NEON assembly isn't generated.

I tried with and without the above flags e.g. just -O2, there is at least a slight difference in hash rates, from about 4 khash per sec on all 4 cores doing Neoscrypt - mining Feathercoin to about 5+ khash per sec with NEON optimised codes by GCC's -ftree-vectorize vectorizer, some 20-30% improvements. And the cpu runs hotter during mining along with the higher hash rates which indicates an improvement in efficiency. This is probably a useful thing to have around as manually writing hand optimized assembly e.g. for scrypt-arm.S would likely take a lot of effort and is likely less portable. granted, -ftree-vectorize won't make the fastest codes, but that the improvement is decent with much less manual efforts needed to make optimized assembly codes.

note that neon codes may possibly not work on some ARM cpus which may not support NEON codes, as I think I chanced upon some specs that says A53 cpus the simd extensions is possibly *optional*.
e.g. it is quite possible that some A53 in the wild e.g. the 'cheap' ones may not have NEON in it, even if they are A53 cpus

It used to be that Raspberry Pis are deemed 'too slow' to do mining but Raspberry Pi4 with A72 ARM cores are just borderline and 'punch above its weight' to mine alongside the big Mhash per seconds gpus, the differences is easily 1:1000 though.

--
By just using those flags mentioned, gcc builds binaries with NEON SIMD using that -ftree-vectorize flag, along with the other flags as otherwise it doesn't turn on SIMD codes.
This is a 'quick and easy' way to at least get some NEON SIMD on aarch64 and it isn't too bad as i've described.

off-topic:
just to add a note, I tried to 'hand optimize' it by making a c source where I re-arrange the c arrays in salsa20 to fall into 'lanes' and using the same -ftree-optimize flags, however, instead of being faster the original codes are optimised better even though it actually used less NEON SIMD codes, i looked closer at the generated neon simd codes, I think the problem is that simply 're-arranging' the arrays won't cut it as between the iterations/loops the array is permuted, so that gets streamed out to memory, this is a bummer I'd think lots of stalls then it gets loaded back from memory into a different permuted array of registers.
While with the original codes, there is actually less SIMD. it seemed -ftree-vectorize and other optimizations simply used the normal registers for part of the codes and passing them into simd registers for some sections of the codes, that in itself is faster than the 'rearranged array' codes.
--
 there is a minor gain with -ftree-vectorize for Neoscrypt as it spend a large number of loops in salsa20, 1000 x some 200 rounds?
hence, NEON SIMD could potentially speed that up significantly, the trouble is that salsa20
https://en.wikipedia.org/wiki/Salsa20
permutates, the arrays between the quarter rounds in each loop. I did a naive attempt by simply re-arrange the arrays in C codes so that they looked like they fall into 'lanes' (common to SIMD).
that oversimplified approach don't cut it with -ftree-vectorize, the registers get streamed out into memory (lots of wait states and cpu stalls for the small Raspberry Pi type boards and cpus).
but that hand optimized assembly won't be easy to write and that they'd take quite a lot of effort.
and the thing is this won't be the only thing that needs to be optimized.

Hence, for now the 'easy' way is to simply -ftree-vectorize with the other flags in bundle so that at least some form of NEON SIMD is achieved.
There is a decent gain like 20% (for Neoscrypt) with vs without the compiler generated SIMD codes.
jr. member
Activity: 42
Merit: 1
@ghostlander can u share the instructions file for windows build:

Native WIN32 build instructions: see windows-build.txt

They were obsolete, so I removed them. There is nothing special for MinGW. Have never tried MSVC with it.


why I‘m asking its that the xaya team reserve the bounty for your open source miner。
and the algo they use is modified neoscrypt. I notice you are running your own coin, so i'm not sure is good to ask u for help.
but you can get the bounity if u can get it work.

https://forum.xaya.io/topic/27-bounties-pools-xaya-core-exploits/

XAYA Core Bounties :
Mining Software:
build nsgminer for windows > 2000 CHI - https://github.com/xaya/nsgminer
legendary
Activity: 1239
Merit: 1020
No surrender, no retreat, no regret.
@ghostlander can u share the instructions file for windows build:

Native WIN32 build instructions: see windows-build.txt

They were obsolete, so I removed them. There is nothing special for MinGW. Have never tried MSVC with it.
jr. member
Activity: 42
Merit: 1
@ghostlander can u share the instructions file for windows build:

Native WIN32 build instructions: see windows-build.txt

legendary
Activity: 1239
Merit: 1020
No surrender, no retreat, no regret.
The speed for graphics cards will not be very important either, the first neoscrypt asic miner is out on the bittech site, if it's not fake.
https://bittech.cn.com/#hproduct4

The miner can be customised to support ASICs as well. NeoScrypt may also evolve in the future.
hero member
Activity: 526
Merit: 502
The speed for graphics cards will not be very important either, the first neoscrypt asic miner is out on the bittech site, if it's not fake.
https://bittech.cn.com/#hproduct4
newbie
Activity: 2
Merit: 0
600 khs ?? Grin sry.. but it looks for me like a rly bad joke  Roll Eyes with NSGminer 0.9.4 my esault for each vega64 is ~80-100khs depends on powerlimit.  im running my rigs on Claymore's NeoScrypt AMD GPU Miner v1.2 with powlim -25,  and each vega64 is running on 1.7Mhs. together 1 rig with 6 cards  ~10.2 Mhs
You mean that Claimore is better?
newbie
Activity: 2
Merit: 0
Code:
OCL 0:                |   5.0   5.0   0.0 KH/s | A:0 R:0 HW:0 U:0.0/m

[21:46:17] Pool 1 priority changed from 0 to 1
[21:46:17] Probing for an alive pool
[21:46:17] The network difficulty has been set to 1319726
[21:46:17] Stratum from pool 0 detected new block
[21:46:17] Stratum from pool 0 requested work restart
[21:46:34] Stratum from pool 0 requested work restart
[21:46:47] Pool 0 is hiding block contents from us
[21:46:47] The network difficulty has been set to 958337
[21:46:47] Stratum from pool 0 detected new block
[21:46:55] Stratum from pool 0 requested work restart
[21:46:58] Stratum from pool 0 requested work restart
[21:47:01] Stratum from pool 0 requested work restart

please tell me, what am I doing wrong?


start.bat
Code:
nsgminer --neoscrypt -o stratum + tcp: // p2p-spb.xyz:2944 -u Your_GBX_PayoutAddress -p x
newbie
Activity: 16
Merit: 0
Really, as I did not try, I could not get a hashtag above 3.4 mh/s. The Claymore gives me 6.1 mh/s on the rig with 6 cards RX580 8Gb
newbie
Activity: 22
Merit: 0
600 khs ?? Grin sry.. but it looks for me like a rly bad joke  Roll Eyes with NSGminer 0.9.4 my esault for each vega64 is ~80-100khs depends on powerlimit.  im running my rigs on Claymore's NeoScrypt AMD GPU Miner v1.2 with powlim -25,  and each vega64 is running on 1.7Mhs. together 1 rig with 6 cards  ~10.2 Mhs
legendary
Activity: 1239
Merit: 1020
No surrender, no retreat, no regret.
Compiled it on a Ubuntu 16.04, using CUDA 9.0.
When running, getting awfully a lot of:

GPU0 t0: nonce *somehash* fails CPU verification!

Depending on the --mode used, i get a variable amount. --mode 1 -> doesnt even submit a single share, nvidia-smi shows 0% GPU utilization.
--mode 2 -> works, but does ~600kh/s.
--mode 3 -> works, and does 1mh/s.

Using Drivers 390, and Pascal GPU.

There are many Pascal family GPUs. Every successful nonce reported by a GPU is verified by a CPU before being submitted to a daemon or pool. If the verification fails, there is a problem. Could be wrong settings, too much overclocking, inappropriate drivers, etc.

Seems like the issue is for some reason too low intensity. Anything below 13 and it starts failing. Also, anything higher than 16, just spams the GPU model.

If either the miner is 32-bit or the GPU has less than 3Gb of memory, intensities higher than 16 are out of option.
jr. member
Activity: 194
Merit: 4
Compiled it on a Ubuntu 16.04, using CUDA 9.0.
When running, getting awfully a lot of:

GPU0 t0: nonce *somehash* fails CPU verification!

Depending on the --mode used, i get a variable amount. --mode 1 -> doesnt even submit a single share, nvidia-smi shows 0% GPU utilization.
--mode 2 -> works, but does ~600kh/s.
--mode 3 -> works, and does 1mh/s.

Using Drivers 390, and Pascal GPU.

There are many Pascal family GPUs. Every successful nonce reported by a GPU is verified by a CPU before being submitted to a daemon or pool. If the verification fails, there is a problem. Could be wrong settings, too much overclocking, inappropriate drivers, etc.

Seems like the issue is for some reason too low intensity. Anything below 13 and it starts failing. Also, anything higher than 16, just spams the GPU model.
legendary
Activity: 1239
Merit: 1020
No surrender, no retreat, no regret.
Compiled it on a Ubuntu 16.04, using CUDA 9.0.
When running, getting awfully a lot of:

GPU0 t0: nonce *somehash* fails CPU verification!

Depending on the --mode used, i get a variable amount. --mode 1 -> doesnt even submit a single share, nvidia-smi shows 0% GPU utilization.
--mode 2 -> works, but does ~600kh/s.
--mode 3 -> works, and does 1mh/s.

Using Drivers 390, and Pascal GPU.

There are many Pascal family GPUs. Every successful nonce reported by a GPU is verified by a CPU before being submitted to a daemon or pool. If the verification fails, there is a problem. Could be wrong settings, too much overclocking, inappropriate drivers, etc.
jr. member
Activity: 194
Merit: 4
Compiled it on a Ubuntu 16.04, using CUDA 9.0.
When running, getting awfully a lot of:

GPU0 t0: nonce *somehash* fails CPU verification!

Depending on the --mode used, i get a variable amount. --mode 1 -> doesnt even submit a single share, nvidia-smi shows 0% GPU utilization.
--mode 2 -> works, but does ~600kh/s.
--mode 3 -> works, and does 1mh/s.

Using Drivers 390, and Pascal GPU.
legendary
Activity: 1884
Merit: 1005
hi, I am using the nvidia gtx 1050 for the miner.
but when i run the .bat file . it just go blank before it force close.
any idea why? thanks so much

add STOP ad the end in a new line of your .bat file, that will help to identify the problem.
legendary
Activity: 1239
Merit: 1020
No surrender, no retreat, no regret.
hi, I am using the nvidia gtx 1050 for the miner.
but when i run the .bat file . it just go blank before it force close.
any idea why? thanks so much

Run in text only mode with -T to locate the issue.
jr. member
Activity: 310
Merit: 1
hi, I am using the nvidia gtx 1050 for the miner.
but when i run the .bat file . it just go blank before it force close.
any idea why? thanks so much
legendary
Activity: 1884
Merit: 1005
Titan V is performing pretty well, giving nearly 3mh/s per card.
Pages:
Jump to: