Pages:
Author

Topic: NSGminer v0.9.4: The Fastest NeoScrypt GPU Miner - page 23. (Read 221582 times)

legendary
Activity: 1239
Merit: 1020
No surrender, no retreat, no regret.
iBeLink DM384M Dash Miner

If what they advertise is true, 384MH/s for 715W and $2098 in cash, say good-bye to X11 GPU mining. A reference R9 280X outputs 2MH/s for maybe 150W. If they produce enough of these ASICs for themselves, they can do a 51% attack on any X11 coin including Dash.
legendary
Activity: 1239
Merit: 1020
No surrender, no retreat, no regret.
I don't say this often, but Ghostlander - really well done on the improvements to FastKDF and everything it calls. I know I gave the idea for the aligned copies/XORs, but your implementation is quite nice. I do wish you would use loops with #pragma unroll just a little more often, but your code is decently readable regardless. Extremely well done implementation on that. In fact, I wouldn't be surprised if, comparing only FastKDF, your implementation surpasses my own.

However, I must say, you still haven't optimized quite an important bit here... the main loop.

FastKDF was a major bottleneck in v6, so I had to fix it first. Catalysts above 14.7 lost ability to align reads and writes properly on their own, so bitalign was the way to go. I know there are other places which need optimisations. That's for the next release.


Thanks for all your work Ghost, but just a heads up R9 Nano gets hardware errors with same settings from 0.9.0. BIG jumps on my other GPUs.

I wish I had a Nano or Fury for testing. I have added their ID as well as the Carrizo ID (the last AMD APU) to the kernel. Hope they work well with the default GCN settings. The ISA code looks good at least. Pull it from my GitHub and let me know.


nice work @Ghostlander

thanks

does this miner work solo ??

Of course it does. That's how I use it most of the time.
hero member
Activity: 528
Merit: 500
nice work @Ghostlander

thanks

does this miner work solo ??
newbie
Activity: 33
Merit: 0
@Ghostlander, thanks for all your work on NSGminer. Triple kudos  Smiley
legendary
Activity: 1239
Merit: 1020
No surrender, no retreat, no regret.
NSGminer v0.9.2 released with my NeoScrypt OpenCL kernel v7 and other enhancements.

1. All GCN based AMD Radeons get a hash rate increase of 20% to 100% depending on driver version. The difference between 14.6 and 15.7 drivers is less than 2% now, so R9 280X @ 1000MHz may deliver 500KH/s.

2. Performance of the older VLIW based AMD Radeons was doubled simply. HD6970 @ 925MHz delivers 255KH/s now.

3. Added support for the very old VLIW based AMD Radeons of HD4000 series. HD4870 @ 750MHz can do 60KH/s. Not very much, but what do you expect of a card 8 years old?

4. Added initial support for the NVIDIA hardware. Thanks to the Feathercoin community for their donation of 0.3 BTC spent on a GTX 750 Ti. Performance improved from 50KH/s to 185KH/s @ 1400MHz shaders. Older GeForce cards down to the very old 8000 series are also known to work.

5. NVIDIA Management Library (NVML) may be used to provide with temperature and fan speed data. Copy nvml.dll from your driver distribution package to the miner's directory.

A good deal of work has been put into this release, so consider a donation. The addresses and download links are in the OP.
legendary
Activity: 1239
Merit: 1020
No surrender, no retreat, no regret.
cl_amd_media_ops is for bitalign/bytealign mostly which are not used in v6 directly. The compiler is supposed to take care of this, but it doesn't do well in the drivers newer than 14.7. It won't be an issue in the next release.
member
Activity: 81
Merit: 1002
It was only the wind.
Improving hash by working on my aligned copy funcs - they need amd_bfm, amd_bitalign, etc.

you changed the kernel?

I rewrote the entire thing, and had to make a good amount of changes to the CPU code to get it to run my new kernel. Actually kernels, plural. Didn't you read above?

give me the kernel for test Wink

I know I've heard that one before... Tongue
member
Activity: 181
Merit: 11
Hi ghostlander,

many thanks for your help, I appreciate it! Unfortunately I'm still not running (see comments below), but I believe there will be the "right way", how to do it Wink



That's interesting. I haven't tried it myself with an open source Radeon driver even though I support open source development in many ways.

First of all, -g 2 -I 8 is no good. Start with -g 1 -I 10. As far as I can tell, it allows to maintain desktop interactivity while doing most office tasks, watching online videos, etc. without a significant discomfort.

Yes, of course, it was just an example, in "production environment" I'll change these values to something "work & real".


Second, would be kind enough to tell us that you attempt to use Wolf0's old kernel rather than my one bundled with NSGminer.

"Device does not support unaligned stores" refers to the cl_khr_byte_addressable_store extension disabled or missing. It's required for all released NeoScrypt kernels, though will be unnecessary for my upcoming v7 kernel. Try to enable this extension:

Code:
#pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable

I added this line everywhere where I found that line :
Code:
#pragma OPENCL EXTENSION cl_amd_media_ops : enable

In fact I found it in four source files : diablo.cl, diakgcn.cl, phatk.cl, poclbm.cl.

But after a compilation, this error is still present...



Next, "error: OpenCL does not support the 'static' storage class specifier". It refers to the following code:

Code:
/* Initialisation vector */
static const __constant uint8 blake2s_IV4[1] = {
    (uint8)(0x6A09E667, 0xBB67AE85, 0x3C6EF372, 0xA54FF53A,
            0x510E527F, 0x9B05688C, 0x1F83D9AB, 0x5BE0CD19)
};

static const __constant uchar blake2s_sigma[10][16] = {
    {  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15 } ,
    { 14, 10,  4,  8,  9, 15, 13,  6,  1, 12,  0,  2, 11,  7,  5,  3 } ,
    { 11,  8, 12,  0,  5,  2, 15, 13, 10, 14,  3,  6,  7,  1,  9,  4 } ,
    {  7,  9,  3,  1, 13, 12, 11, 14,  2,  6,  5, 10,  4,  0, 15,  8 } ,
    {  9,  0,  5,  7,  2,  4, 10, 15, 14,  1, 11, 12,  6,  8,  3, 13 } ,
    {  2, 12,  6, 10,  0, 11,  8,  3,  4, 13,  7,  5, 15, 14,  1,  9 } ,
    { 12,  5,  1, 15, 14, 13,  4, 10,  0,  7,  6,  3,  9,  2,  8, 11 } ,
    { 13, 11,  7, 14, 12,  1,  3,  9,  5,  0, 15,  4,  8,  6,  2, 10 } ,
    {  6, 15, 14,  9, 11,  3,  0,  8, 12,  2, 13,  7,  1,  4, 10,  5 } ,
    { 10,  2,  8,  4,  7,  6,  1,  5, 15, 11,  9, 14,  3, 12, 13 , 0 } ,
};

Well, this is a bug of your compiler actually. According to Khronos, Storage-class Qualifiers, static is allowed for global variables and constants. However you may remove it safely from the source code as const __constant is good enough to describe this data. The LLVM based AMD compiler doesn't care.

I tried to remove this code from neoscrypt.cl with two variants - remove both paragraphs, and remove just the second one (beginning with ...
Code:
static const __constant uchar blake2s_sigma[10][16] = {
), but after make procedures and run ./nsgminer, that error is still persists too...


"error: use of undeclared identifier 'MAX_GLOBAL_THREADS' -- neither NSGminer nor my kernel use it.

The last warning may be disregarded.


Yep, I believe this error is just a minor issue, when I'll be able to solve the previous two, this would be "a quick action" - I hope! Wink


Many thanks again for your help and co-operation!
legendary
Activity: 2814
Merit: 1091
--- ChainWorks Industries ---
Couldn't even compile his ccminer for Windows XP where I have my 750 Ti running now. NVCC rejects MinGW and insists on M$ Visual Studio. VS2013 doesn't produce valid code even with vs120_xp target. Missing entry points in kernel32.dll like GetTickCount64 or InitializeCriticalSectionEx. So I had to strip ccminer down to NeoScrypt only and patch for VS2010 compatibility.


damn i like your style ...

and the resultant strip down and test? ...

btw - its one of the many reasons why i dont use windows ... too messy with compilations ...

#crysx

GTX 750 Ti @ 1400 = 285KH/s with CUDA 6.5 or 280KH/s with CUDA 7.5

Although I've somehow broken Stratum in the process, so it's solo mining through Getwork now.


i actually get higher with c75 than with c65 - when compile under fedora 23 x64 ...

i dunno what it is at the moment- but if you want to know - i can pull a miner off decred for testing neoscrypt if you like ...

but without stratum - its a pretty mess on getwork only Smiley ...

but you seem to be making leaps with this ...

btw - i dont oc anything ... just factory clocks ... my cards are gigabyte 750ti oc lp ( non powered ) ...

#crysx
member
Activity: 81
Merit: 1002
It was only the wind.
Improving hash by working on my aligned copy funcs - they need amd_bfm, amd_bitalign, etc.

you changed the kernel?

I rewrote the entire thing, and had to make a good amount of changes to the CPU code to get it to run my new kernel. Actually kernels, plural. Didn't you read above?
legendary
Activity: 1239
Merit: 1020
No surrender, no retreat, no regret.
Couldn't even compile his ccminer for Windows XP where I have my 750 Ti running now. NVCC rejects MinGW and insists on M$ Visual Studio. VS2013 doesn't produce valid code even with vs120_xp target. Missing entry points in kernel32.dll like GetTickCount64 or InitializeCriticalSectionEx. So I had to strip ccminer down to NeoScrypt only and patch for VS2010 compatibility.


damn i like your style ...

and the resultant strip down and test? ...

btw - its one of the many reasons why i dont use windows ... too messy with compilations ...

#crysx

GTX 750 Ti @ 1400 = 285KH/s with CUDA 6.5 or 280KH/s with CUDA 7.5

Although I've somehow broken Stratum in the process, so it's solo mining through Getwork now.
legendary
Activity: 2814
Merit: 1091
--- ChainWorks Industries ---
More improvements.

GTX 750 Ti @ 1400 = 180KH/s

R9 280X @ 1000 = 500KH/s

Older Radeons boosted up again: HD6970 @ 925 = 255KH/s


thats some nice improvements ...

checkout sp thread for ccminer ( which has djm34 neoscrypt included ) and see what you get from that? ...

djm34 released his neoscrypt kernel only recently mate ...

so a lot can be had ( and maybe even improved - though djm34 does a very thorough job ) from the kernel ...

it is cuda based - but im sure the opensource kernel could help in some ways ...

really impressed with your improvements ... when you decide you would like to get some work outside of here done - let me know ... i could use your help and optimizations with granite ...

tanx ...

#crysx

Couldn't even compile his ccminer for Windows XP where I have my 750 Ti running now. NVCC rejects MinGW and insists on M$ Visual Studio. VS2013 doesn't produce valid code even with vs120_xp target. Missing entry points in kernel32.dll like GetTickCount64 or InitializeCriticalSectionEx. So I had to strip ccminer down to NeoScrypt only and patch for VS2010 compatibility.


damn i like your style ...

and the resultant strip down and test? ...

btw - its one of the many reasons why i dont use windows ... too messy with compilations ...

#crysx
member
Activity: 81
Merit: 1002
It was only the wind.
Improving hash by working on my aligned copy funcs - they need amd_bfm, amd_bitalign, etc.
legendary
Activity: 1239
Merit: 1020
No surrender, no retreat, no regret.
More improvements.

GTX 750 Ti @ 1400 = 180KH/s

R9 280X @ 1000 = 500KH/s

Older Radeons boosted up again: HD6970 @ 925 = 255KH/s


thats some nice improvements ...

checkout sp thread for ccminer ( which has djm34 neoscrypt included ) and see what you get from that? ...

djm34 released his neoscrypt kernel only recently mate ...

so a lot can be had ( and maybe even improved - though djm34 does a very thorough job ) from the kernel ...

it is cuda based - but im sure the opensource kernel could help in some ways ...

really impressed with your improvements ... when you decide you would like to get some work outside of here done - let me know ... i could use your help and optimizations with granite ...

tanx ...

#crysx

Couldn't even compile his ccminer for Windows XP where I have my 750 Ti running now. NVCC rejects MinGW and insists on M$ Visual Studio. VS2013 doesn't produce valid code even with vs120_xp target. Missing entry points in kernel32.dll like GetTickCount64 or InitializeCriticalSectionEx. So I had to strip ccminer down to NeoScrypt only and patch for VS2010 compatibility.
legendary
Activity: 2814
Merit: 1091
--- ChainWorks Industries ---
More improvements.

GTX 750 Ti @ 1400 = 180KH/s

R9 280X @ 1000 = 500KH/s

Older Radeons boosted up again: HD6970 @ 925 = 255KH/s


thats some nice improvements ...

checkout sp thread for ccminer ( which has djm34 neoscrypt included ) and see what you get from that? ...

djm34 released his neoscrypt kernel only recently mate ...

so a lot can be had ( and maybe even improved - though djm34 does a very thorough job ) from the kernel ...

it is cuda based - but im sure the opensource kernel could help in some ways ...

really impressed with your improvements ... when you decide you would like to get some work outside of here done - let me know ... i could use your help and optimizations with granite ...

tanx ...

#crysx
member
Activity: 81
Merit: 1002
It was only the wind.
Looks very good. Have you tweaked the kernel settings or left the defaults there?

I actually rewrote most of it:

- Chacha and Salsa are now done vectorized on GCN. Unroll level is still three for both.
- Blake2S is done parallel, too
- Your bytewise copies were left for now - the bytewise XORs are now done by uints
- Removed your little AND operation on bufptr
- Replaced your if/else structure for creating the output with a single loop doing a bytewise XOR (yes, it works in 100% of cases)
- Created a BlkMix() function for cleanliness
- Split the work over several kernels
- Added ScratchpadLoad/ScratchpadStore/ScratchpadMix functions for cleanliness and a better striped access pattern in memory
- Parallelized the SMix() calls
- Abused the TMTO vulnerability, and made it configurable in the miner
- Shrunk code size by a lot

Well, we can make a much better progress if you upload your work somewhere to take a closer look. I'm very flexible on NSGminer and can do things SGminer will not in order to keep compatibility with their bunch of various algos and kernels. NSGminer isn't my private project, you can also commit your changes.

While optimising for GCN, I also try not to break support for VLIW. For example, this kernel is about 2x faster than yours on the VLIW5 & VLIW4 hardware. I admit most miners are on GCN now, but it's a good thing to keep the older hardware useful.

BLAKE2S_COMPACT just butchered the hashrate. About the miner, though... one thing bugs me. I know it's based on BFGMiner, but it terminates my X server with *extreme* prejudice - killing it and then NSGMiner dies in an uncontrolled fashion. I can tell because of the error from the X server dumped right before NSGMiner dies without taking care of ncurses, meaning I can't see what I type in that shell until I do a reset of the shell, reboot, etc.
legendary
Activity: 1239
Merit: 1020
No surrender, no retreat, no regret.
More improvements.

GTX 750 Ti @ 1400 = 180KH/s

R9 280X @ 1000 = 500KH/s

Older Radeons boosted up again: HD6970 @ 925 = 255KH/s
legendary
Activity: 1239
Merit: 1020
No surrender, no retreat, no regret.
Code:
./nsgminer --neoscrypt -g 2 -I 8 -o stratum+tcp://...

I got this error :

Code:
[05:52:21] Probing for an alive pool
[05:52:23] Error -11: Building Program (clBuildProgram)
[05:52:23] input.cl:21:2: error: "Device does not support unaligned stores"
input.cl:68:1: error: OpenCL does not support the 'static' storage class specifier
input.cl:74:1: error: OpenCL does not support the 'static' storage class specifier
input.cl:495:86: error: use of undeclared identifier 'MAX_GLOBAL_THREADS'
input.cl:513:11: warning: incompatible pointer types passing '__global ulong16 *' to parameter of type '__global uint16 *
'
input.cl:469:39: note: passing argument to parameter 'V' here

[05:52:23] Failed to init GPU thread 0, disabling device 0
[05:52:23] Restarting the GPU from the menu will not fix this.
[05:52:23] Try to restart the miner.

Is there any way how to fix this? Thanks for any suggestions!

That's interesting. I haven't tried it myself with an open source Radeon driver even though I support open source development in many ways.

First of all, -g 2 -I 8 is no good. Start with -g 1 -I 10. As far as I can tell, it allows to maintain desktop interactivity while doing most office tasks, watching online videos, etc. without a significant discomfort.

Second, would be kind enough to tell us that you attempt to use Wolf0's old kernel rather than my one bundled with NSGminer.

"Device does not support unaligned stores" refers to the cl_khr_byte_addressable_store extension disabled or missing. It's required for all released NeoScrypt kernels, though will be unnecessary for my upcoming v7 kernel. Try to enable this extension:

Code:
#pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable

Next, "error: OpenCL does not support the 'static' storage class specifier". It refers to the following code:

Code:
/* Initialisation vector */
static const __constant uint8 blake2s_IV4[1] = {
    (uint8)(0x6A09E667, 0xBB67AE85, 0x3C6EF372, 0xA54FF53A,
            0x510E527F, 0x9B05688C, 0x1F83D9AB, 0x5BE0CD19)
};

static const __constant uchar blake2s_sigma[10][16] = {
    {  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15 } ,
    { 14, 10,  4,  8,  9, 15, 13,  6,  1, 12,  0,  2, 11,  7,  5,  3 } ,
    { 11,  8, 12,  0,  5,  2, 15, 13, 10, 14,  3,  6,  7,  1,  9,  4 } ,
    {  7,  9,  3,  1, 13, 12, 11, 14,  2,  6,  5, 10,  4,  0, 15,  8 } ,
    {  9,  0,  5,  7,  2,  4, 10, 15, 14,  1, 11, 12,  6,  8,  3, 13 } ,
    {  2, 12,  6, 10,  0, 11,  8,  3,  4, 13,  7,  5, 15, 14,  1,  9 } ,
    { 12,  5,  1, 15, 14, 13,  4, 10,  0,  7,  6,  3,  9,  2,  8, 11 } ,
    { 13, 11,  7, 14, 12,  1,  3,  9,  5,  0, 15,  4,  8,  6,  2, 10 } ,
    {  6, 15, 14,  9, 11,  3,  0,  8, 12,  2, 13,  7,  1,  4, 10,  5 } ,
    { 10,  2,  8,  4,  7,  6,  1,  5, 15, 11,  9, 14,  3, 12, 13 , 0 } ,
};

Well, this is a bug of your compiler actually. According to Khronos, Storage-class Qualifiers, static is allowed for global variables and constants. However you may remove it safely from the source code as const __constant is good enough to describe this data. The LLVM based AMD compiler doesn't care.

"error: use of undeclared identifier 'MAX_GLOBAL_THREADS' -- neither NSGminer nor my kernel use it.

The last warning may be disregarded.
member
Activity: 81
Merit: 1002
It was only the wind.
I haven't seen any real improvement on vector BLAKE2s, though I have added it as a feature.

Yeah, that makes sense.  Lot's of ways to represent success.  

Could you switch to vector code for Salsa and ChaCha to see if it makes a positive difference on GCN with the 15.x drivers?

Code:
#elif (__Tahiti__) || (__Pitcairn__) || (__Capeverde__) || \
(__Oland__) || (__Hainan__) || \
(__Hawaii__) || (__Bonaire__) || \
(__Kalindi__) || (__Mullins__) || (__Spectre__) || (__Spooky__) || \
(__Tonga__) || (__Iceland__)
#define SALSA_SCALAR 0
#define CHACHA_SCALAR 0
#define BLAKE2S_SCALAR 1
#define FASTKDF_SCALAR 0

FASTKDF_COMPACT 1 also seems to improve performance a little on GCN. Maybe SALSA_UNROLL_LEVEL and CHACHA_UNROLL_LEVEL are better if set to 3 like previously instead of 4.


I did; this is what my code is using now (my own vector implementations) - very little difference - really within the margin of error. My deciding factor in using it was the fact that my code for it is far cleaner and nicely readable. Let me play with FastKDF some more, though. I'm working my way out from the "hottest" (most executed/runtime) parts of the code out to the coldest.
member
Activity: 181
Merit: 11
Hi gents,

I tried mining with CPU version (of NeoScrypt algo), it works like a charm. I'd like to mine also on GPU - but with OSS driver (radeon.ko), not these closed one (Catalyst). Of course I understand, that proprietary driver is (and probably will be forever) more efficient for these purposes, but a/ I'd like to see a progress within OSS drivers, b/ I don't want / cannot built into server closed drivers, etc. In general - would it be possible ? My system is Fedora 23, with latest packages (Mesa 11.1.0-2, kernel 3.4.3-300, etc.).

I grab latest NeoScrypt GPU Miner from git repo
Code:
git clone https://github.com/ghostlander/nsgminer
,  everything went fine.
Code:
./autogen.sh
was also without troubles :

Code:
------------------------------------------------------------------------
nsgminer 0.9.1
------------------------------------------------------------------------


Configuration Options Summary:

  curses TUI...........: FOUND: ncursesw5

  NeoScrypt............: Enabled
  Scrypt...............: Enabled

  OpenCL...............: Enabled
    ADL monitoring.....: Enabled

  BitForce FPGAs.......: Disabled
  Icarus FPGAs.........: Disabled
  ModMiner FPGAs.......: Disabled
  X6500 FPGAs..........: Disabled
  ZTEX FPGAs...........: Disabled
  libudev detection....: yes

...and the same for
Code:
make
. But when I run nsgminer :

Code:
./nsgminer --neoscrypt -g 2 -I 8 -o stratum+tcp://...

I got this error :

Code:
[05:52:21] Probing for an alive pool
[05:52:23] Error -11: Building Program (clBuildProgram)
[05:52:23] input.cl:21:2: error: "Device does not support unaligned stores"
input.cl:68:1: error: OpenCL does not support the 'static' storage class specifier
input.cl:74:1: error: OpenCL does not support the 'static' storage class specifier
input.cl:495:86: error: use of undeclared identifier 'MAX_GLOBAL_THREADS'
input.cl:513:11: warning: incompatible pointer types passing '__global ulong16 *' to parameter of type '__global uint16 *
'
input.cl:469:39: note: passing argument to parameter 'V' here

[05:52:23] Failed to init GPU thread 0, disabling device 0
[05:52:23] Restarting the GPU from the menu will not fix this.
[05:52:23] Try to restart the miner.

Is there any way how to fix this? Thanks for any suggestions!
Pages:
Jump to: