Pages:
Author

Topic: NSGminer v0.9.4: The Fastest NeoScrypt GPU Miner - page 25. (Read 221777 times)

legendary
Activity: 1242
Merit: 1020
No surrender, no retreat, no regret.
Well, with 7950 at 435kh/s and 280X at 500kh/s, it seems my Neoscrypt is straight up compute dependent, as that's very close to the same percentage increase as the CU count... the wall I'm running into is compute, then, not memory. Damn.

500KH/s on a reference R9 280X is nearly what I get now by optimising my v6 kernel. I think it can do more. That's for vector Salsa, ChaCha and BLAKE2s. Scalar are -10KH/s.



EDIT: Updated the screen shot with a longer run time.

ANN: If anyone wants NVIDIA OpenCL support in NSGminer with optimisations and hardware monitoring, donate to the addresses in the OP and post to this thread. The target is 0.5 BTC to cover the hardware cost at least.
sr. member
Activity: 506
Merit: 252
In future you should advice these coin devs on their algos cuz quite often they fail on their own.

A memory intensive algo being compute strained lol  Roll Eyes
member
Activity: 81
Merit: 1002
It was only the wind.
Attempting to run miner on win 7-64 get the libwinpthread-1.dll is missing error.

Try copying it from any other miner.
legendary
Activity: 1242
Merit: 1020
No surrender, no retreat, no regret.
Looks very good. Have you tweaked the kernel settings or left the defaults there?

I actually rewrote most of it:

- Chacha and Salsa are now done vectorized on GCN. Unroll level is still three for both.
- Blake2S is done parallel, too
- Your bytewise copies were left for now - the bytewise XORs are now done by uints
- Removed your little AND operation on bufptr
- Replaced your if/else structure for creating the output with a single loop doing a bytewise XOR (yes, it works in 100% of cases)
- Created a BlkMix() function for cleanliness
- Split the work over several kernels
- Added ScratchpadLoad/ScratchpadStore/ScratchpadMix functions for cleanliness and a better striped access pattern in memory
- Parallelized the SMix() calls
- Abused the TMTO vulnerability, and made it configurable in the miner
- Shrunk code size by a lot

Well, we can make a much better progress if you upload your work somewhere to take a closer look. I'm very flexible on NSGminer and can do things SGminer will not in order to keep compatibility with their bunch of various algos and kernels. NSGminer isn't my private project, you can also commit your changes.

While optimising for GCN, I also try not to break support for VLIW. For example, this kernel is about 2x faster than yours on the VLIW5 & VLIW4 hardware. I admit most miners are on GCN now, but it's a good thing to keep the older hardware useful.

BLAKE2S_COMPACT just butchered the hashrate. About the miner, though... one thing bugs me. I know it's based on BFGMiner, but it terminates my X server with *extreme* prejudice - killing it and then NSGMiner dies in an uncontrolled fashion. I can tell because of the error from the X server dumped right before NSGMiner dies without taking care of ncurses, meaning I can't see what I type in that shell until I do a reset of the shell, reboot, etc.

Well, BLAKE2S_COMPACT is just an option which doesn't do any good now apart of reducing compiled kernel size. NSGminer is a fork of BFGminer v2.10.14 last updated 2 years ago. Although there was no ASIC related code which I would have to remove otherwise. There are also a few good things not found in CGminer/SGminer. I have rewritten many parts anyway to make it really work rather than just work. For example, SGminer displays incorrect block hashes, bogus network diff and the best share, share diff is inflated by 16, maybe solo mining is still broken. diff1 for NeoScrypt (and Scrypt, too) is 0.00024414 of the BTC diff1, that's nBits = 0x1E0FFFF0 big endian which gets decompressed to this uint256 target:

0000000000000000000000000000000000000000000000000000000FFFF00000x0

SGminer lost one most significant zero for some arcane reason. I guess it was also bad in the initial CGminer port to NeoScrypt. Divide this by a share hash/target to get the share diff. That's why all share diffs are 16x higher than actual.

I run NSGminer in lxterminal usually with a cron powered watchdog script, though I don't recall it crashing. Maybe it needs an update in this area to make X happy.
legendary
Activity: 1302
Merit: 1000
ORB has a good chance to grow.
Improving hash by working on my aligned copy funcs - they need amd_bfm, amd_bitalign, etc.

you changed the kernel?

I rewrote the entire thing, and had to make a good amount of changes to the CPU code to get it to run my new kernel. Actually kernels, plural. Didn't you read above?

give me the kernel for test Wink

I know I've heard that one before... Tongue

haha no i test only for me Smiley
member
Activity: 81
Merit: 1002
It was only the wind.
I'm reading your whitepaper now, because I'm really curious about something...

What exactly was your goal when creating Neoscrypt? You list GPU/ASIC computation costs (or lack thereof) seemingly in a bad light - yet you haven't made Neoscrypt actually resistant to such.

You also state "A single instance of NeoScrypt utilises (N + 3) * r * 128 bytes of memory space, i.e. 32.75 Kb, in series mode or (2 * N + 3) * r * 128 bytes, i.e. 64.75Kb, in parallel mode." This is technically correct because you didn't say Neoscrypt *requires* this much, only that it *utilizes* that much - which in the case of your implementation is quite true. But you don't make a case for it being memory-hard in the paper, so I'm assuming that wasn't a design goal?
legendary
Activity: 1302
Merit: 1000
ORB has a good chance to grow.
Improving hash by working on my aligned copy funcs - they need amd_bfm, amd_bitalign, etc.

you changed the kernel?

I rewrote the entire thing, and had to make a good amount of changes to the CPU code to get it to run my new kernel. Actually kernels, plural. Didn't you read above?

give me the kernel for test Wink
member
Activity: 81
Merit: 1002
It was only the wind.
EDIT: How are you running intensity 16 and 17? Not even my 290X or Fury will enqueue the kernel with that setting.

Hope your OS is 64-bit? It's possible if you have a lot of free system memory. The driver allocates huge buffers there for swapping. CL_DEVICE_GLOBAL_MEM_SIZE should tell how much is available. clinfo outputs it under "Global memory size".


It is indeed. 32GB of DDR3 in the case of the 290X and Fury... Odd.
legendary
Activity: 1302
Merit: 1000
ORB has a good chance to grow.
Improving hash by working on my aligned copy funcs - they need amd_bfm, amd_bitalign, etc.

you changed the kernel?
legendary
Activity: 1302
Merit: 1000
ORB has a good chance to grow.

I achieved the following after 5 minutes of mining from various AMD drivers when placed in your miner's folder:


let run longer for ghost miner you will get a very good perfomance
member
Activity: 81
Merit: 1002
It was only the wind.
NSGminer v0.9.1 released with my NeoScrypt OpenCL kernel v6. Should be compatible with the latest AMD Catalyst drivers. Also delivers a little performance improvement over the previous release.

Tried to use it, but this one complains about libcurl not supporting stratum+tcp, and if I remove it, the miner just tries http...

EDIT: Sorry, I'm an idiot; my test pool is down, but their website is up...

You have debug enabled probably. Of course libcurl doesn't support stratum+tcp. Disregard it.


No, it wouldn't run at all, which is why I reported it. Reason being it wasn't getting work, reason being stupid pool backend being down.

EDIT: How are you running intensity 16 and 17? Not even my 290X or Fury will enqueue the kernel with that setting.
legendary
Activity: 1242
Merit: 1020
No surrender, no retreat, no regret.
Looks very good. Have you tweaked the kernel settings or left the defaults there?

I actually rewrote most of it:

- Chacha and Salsa are now done vectorized on GCN. Unroll level is still three for both.
- Blake2S is done parallel, too
- Your bytewise copies were left for now - the bytewise XORs are now done by uints
- Removed your little AND operation on bufptr
- Replaced your if/else structure for creating the output with a single loop doing a bytewise XOR (yes, it works in 100% of cases)
- Created a BlkMix() function for cleanliness
- Split the work over several kernels
- Added ScratchpadLoad/ScratchpadStore/ScratchpadMix functions for cleanliness and a better striped access pattern in memory
- Parallelized the SMix() calls
- Abused the TMTO vulnerability, and made it configurable in the miner
- Shrunk code size by a lot

Well, we can make a much better progress if you upload your work somewhere to take a closer look. I'm very flexible on NSGminer and can do things SGminer will not in order to keep compatibility with their bunch of various algos and kernels. NSGminer isn't my private project, you can also commit your changes.

While optimising for GCN, I also try not to break support for VLIW. For example, this kernel is about 2x faster than yours on the VLIW5 & VLIW4 hardware. I admit most miners are on GCN now, but it's a good thing to keep the older hardware useful.
legendary
Activity: 1242
Merit: 1020
No surrender, no retreat, no regret.
I haven't seen any real improvement on vector BLAKE2s, though I have added it as a feature.

Yeah, that makes sense.  Lot's of ways to represent success.  

Could you switch to vector code for Salsa and ChaCha to see if it makes a positive difference on GCN with the 15.x drivers?

Code:
#elif (__Tahiti__) || (__Pitcairn__) || (__Capeverde__) || \
(__Oland__) || (__Hainan__) || \
(__Hawaii__) || (__Bonaire__) || \
(__Kalindi__) || (__Mullins__) || (__Spectre__) || (__Spooky__) || \
(__Tonga__) || (__Iceland__)
#define SALSA_SCALAR 0
#define CHACHA_SCALAR 0
#define BLAKE2S_SCALAR 1
#define FASTKDF_SCALAR 0

FASTKDF_COMPACT 1 also seems to improve performance a little on GCN. Maybe SALSA_UNROLL_LEVEL and CHACHA_UNROLL_LEVEL are better if set to 3 like previously instead of 4.
member
Activity: 81
Merit: 1002
It was only the wind.
NSGminer v0.9.1 released with my NeoScrypt OpenCL kernel v6. Should be compatible with the latest AMD Catalyst drivers. Also delivers a little performance improvement over the previous release.

Tried to use it, but this one complains about libcurl not supporting stratum+tcp, and if I remove it, the miner just tries http...

EDIT: Sorry, I'm an idiot; my test pool is down, but their website is up...
hero member
Activity: 935
Merit: 1001
I don't always drink...
Yeah, that makes sense.  Lot's of ways to represent success. 
hero member
Activity: 935
Merit: 1001
I don't always drink...
Here's what I got with your new Windows 64 miner, all GPU voltages at 1.2, I used Sapphire HD 7970s (Hynix), and your recommended configuration:

Code:
@echo off
setx GPU_MAX_ALLOC_PERCENT 100
setx GPU_USE_SYNC_OBJECTS 1
nsgminer 2>logfile.txt --neoscrypt -g 1 -w 128 -I 16 --gpu-engine 1000 --gpu-memclock 1500 -o
stratum+tcp://neoscrypt.usa.nicehash.com:3341 -O [ADDRESS]:[x]

I rebooted the rig between each test and deleted the existing .bin file.  I didn't use any type of driver cleaner utility.
I achieved the following after 5 minutes of mining from various AMD drivers when placed in your miner's folder:
------------------------------------------------------------------------------------------
AMD 15.7.1 (stock installation on rig):

[08:20:20] OCL0                | 5s:  0.0 avg:226.3 u:316.5 KH/s | A:24 R:0 HW:0 WU:18.1/m
[08:20:20] OCL1                | 5s:  0.0 avg:230.4 u:184.6 KH/s | A:14 R:0 HW:0 WU:10.6/m
[08:20:20] OCL2                | 5s:  0.0 avg:226.7 u:224.2 KH/s | A:17 R:0 HW:0 WU:12.8/m
[08:20:20] OCL3                | 5s:  0.0 avg:226.9 u:158.3 KH/s | A:12 R:0 HW:0 WU:9.1/m

------------------------------------------------------------------------------------------
AMD 14.12:

[08:29:23] OCL0                | 5s:  0.0 avg:239.2 u:228.7 KH/s | A:18 R:0 HW:0 WU:13.1/m
[08:29:23] OCL1                | 5s:  0.0 avg:239.4 u:282.5 KH/s | A:21 R:0 HW:0 WU:16.2/m
[08:29:23] OCL2                | 5s:  0.0 avg:239.2 u:228.7 KH/s | A:17 R:0 HW:0 WU:13.1/m
[08:29:23] OCL3                | 5s:  0.0 avg:239.6 u:336.3 KH/s | A:25 R:0 HW:0 WU:19.2/m

--------------------------------------------------------------------------------------------
AMD 14.9:

[08:13:28] OCL0                | 5s:  0.0 avg:379.2 u:459.8 KH/s | A:37 R:0 HW:0 WU:26.3/m
[08:13:28] OCL1                | 5s:  0.0 avg:378.2 u:410.1 KH/s | A:33 R:0 HW:0 WU:23.5/m
[08:13:28] OCL2                | 5s:  0.0 avg:378.6 u:360.4 KH/s | A:29 R:0 HW:0 WU:20.6/m
[08:13:28] OCL3                | 5s:  0.0 avg:378.4 u:298.2 KH/s | A:24 R:0 HW:0 WU:17.1/m

-------------------------------------------------------------------------------------------
AMD 14.7rc3:

[09:11:02] OCL0                | 5s:  0.0 avg:370.1 u:409.0 KH/s | A:31 R:0 HW:0 WU:23.4/m
[09:11:02] OCL1                | 5s:  0.0 avg:370.7 u:277.1 KH/s | A:21 R:0 HW:0 WU:15.9/m
[09:11:02] OCL2                | 5s:  0.0 avg:370.1 u:303.5 KH/s | A:23 R:0 HW:0 WU:17.4/m
[09:11:02] OCL3                | 5s:  0.0 avg:371.1 u:422.2 KH/s | A:32 R:0 HW:0 WU:24.2/m

--------------------------------------------------------------------------------------------
AMD 14.6:

[08:51:58] OCL0                | 5s:  0.0 avg:368.9 u:290.3 KH/s | A:22 R:0 HW:0 WU:16.6/m
[08:51:58] OCL1                | 5s:  0.0 avg:369.0 u:395.9 KH/s | A:30 R:0 HW:0 WU:22.7/m
[08:51:58] OCL2                | 5s:  0.0 avg:368.9 u:382.7 KH/s | A:29 R:0 HW:0 WU:21.9/m
[08:51:58] OCL3                | 5s:  0.0 avg:368.3 u:435.5 KH/s | A:33 R:0 HW:0 WU:24.9/m

--------------------------------------------------------------------------------------------
AMD 14.4:

[08:42:44] OCL0                | 5s:  0.0 avg:351.1 u:255.0 KH/s | A:19 R:0 HW:0 WU:14.6/m
[08:42:44] OCL1                | 5s:  0.0 avg:351.3 u:241.6 KH/s | A:18 R:0 HW:0 WU:13.8/m
[08:42:44] OCL2                | 5s:  0.0 avg:351.1 u:308.7 KH/s | A:23 R:0 HW:0 WU:17.7/m
[08:42:44] OCL3                | 5s:  0.0 avg:350.9 u:295.3 KH/s | A:22 R:0 HW:0 WU:16.9/m

--------------------------------------------------------------------------------------------

Okay, so that was for a rig of 4 HD 7970s, all Hynix memory.  Next I tested the 14.9 drivers on
4 Gigabyte Windforce R9 280x GPUs, all Elpida.  They wouldn't start with I 16, so I bumped them
down to I 15.

AMD 14.9 at I=15:

[09:36:18] OCL0                | 5s:  0.0 avg:374.0 u:284.6 KH/s | A:22 R:0 HW:0 WU:16.3/m
[09:36:18] OCL1                | 5s:361.6 avg:374.9 u:271.7 KH/s | A:21 R:0 HW:0 WU:15.5/m
[09:36:18] OCL2                | 5s:361.7 avg:374.0 u:323.5 KH/s | A:25 R:0 HW:0 WU:18.5/m
[09:36:18] OCL3                | 5s:361.4 avg:373.9 u:258.8 KH/s | A:20 R:0 HW:0 WU:14.8/m

That looks like comparable output for Hynix memory but at lower intensity.


==============================================================================

Now I will test Wolf0's neoscrypt.cl and .bin from the most recent Nicehash miner bin folder with sgminer-5-0-1, xI=2. gpu-threads=2, 1000/1500:


280x Elpida, with Wolf0's neoscrypt kernel with AMD 14.7rc3 drivers I get:

[09:53:00] GPU0                | (5s):403.8K (avg):424.4Kh/s | A:2496 R:0 HW:0 WU:403.230/m
[09:53:00] GPU1                | (5s):406.7K (avg):426.6Kh/s | A:2432 R:0 HW:0 WU:399.363/m
[09:53:00] GPU2                | (5s):406.3K (avg):425.7Kh/s | A:1728 R:0 HW:0 WU:253.722/m
[09:53:00] GPU3                | (5s):405.3K (avg):425.6Kh/s | A:2048 R:0 HW:0 WU:331.606/m


And going back to the HD7970 cards with 14.7rc3 drivers I get:

[09:57:30] GPU0                | (5s):407.0K (avg):417.7Kh/s | A:2368 R:0 HW:0 WU:373.366/m
[09:57:30] GPU1                | (5s):408.4K (avg):419.9Kh/s | A:2368 R:0 HW:0 WU:380.408/m
[09:57:30] GPU2                | (5s):410.5K (avg):419.4Kh/s | A:1856 R:0 HW:0 WU:291.159/m
[09:57:30] GPU3                | (5s):406.9K (avg):416.4Kh/s | A:2368 R:0 HW:0 WU:387.625/m


So, in summary, from the perspective of this dead-end user, it seems that for the moment Wolf0's kernel and .bin file from the most recent Nicehash miner bin folder and AMD 14.7rc3 drivers has the better hashrate.

Also, I cannot explain the apparent discrepancy in sgminer's accepted as compared to nsgminer's accepted.  I assume that there is a factor of 100 somewhere.
legendary
Activity: 1242
Merit: 1020
No surrender, no retreat, no regret.
Looks very good. Have you tweaked the kernel settings or left the defaults there?
legendary
Activity: 1242
Merit: 1020
No surrender, no retreat, no regret.
EDIT: Odd, with the latest on 7950, I'm getting ~155kH/s on 7950 clocked at 950/1250. Intensity 15, worksize 128. Is this correct?

I have run some tests and it seems the GCN cards are more comfortable with scalar Salsa and ChaCha. The AMD compiler optimises them better. For example, I get 680KH/s with vector and 690KH/s with scalar on 14.6. By default, the v6 kernel uses scalar for GCN and vector for VLIW5 and VLIW4. You can edit the kernel to run tests with different settings. Are you on Crimson or 15.11?


Both of those suck - 15.7.x seems to work best for me. What hashrate did you get with a 7950? Surely not 680 - 690kh/s?

That's for HD7990, dual Tahiti, downvolted and downclocked to 850/1250. I have no HD7950, though most people o/c them from 800/1250 to 1000/1500 or higher where they perform like a single GPU of my HD7990, so 350KH/s should be achievable. HD7970 (R9 280X) delivers 380 to 450KH/s depending on overclock and memory (Elpida sucks, Hynix wins).
member
Activity: 81
Merit: 1002
It was only the wind.
I have compiled and uploaded the 64-bit Windows binaries per numerous requests. Tested fine on a Win7 notebook I've got recently.

https://github.com/ghostlander/nsgminer/releases/tag/nsgminer-v0.9.0

Wolf0, your kernel distributed by NiceHash with their miner is also well done, could use an idea or two out of it. One of my primary concerns was to get rid of scratch register usage to run more than one wavefront concurrently, but FastKDF seems to be too complicated to fit VRegs and SRegs alone. Although I've cut the scratch reg usage down by half which also helps.


You can run more than one wavefront concurrently while using scratch regs - I'm doing it now - but the cause of this issue isn't FastKDF's complexity, I think, it's the stupid accesses. Since VGPRs on GCN are 4 bytes wide, it shits itself trying to keep the data in regs while you access it in such a misaligned fashion - so it dumps it to global.
legendary
Activity: 1242
Merit: 1020
No surrender, no retreat, no regret.
EDIT: Odd, with the latest on 7950, I'm getting ~155kH/s on 7950 clocked at 950/1250. Intensity 15, worksize 128. Is this correct?

I have run some tests and it seems the GCN cards are more comfortable with scalar Salsa and ChaCha. The AMD compiler optimises them better. For example, I get 680KH/s with vector and 690KH/s with scalar on 14.6. By default, the v6 kernel uses scalar for GCN and vector for VLIW5 and VLIW4. You can edit the kernel to run tests with different settings. Are you on Crimson or 15.11?
Pages:
Jump to: