NSGminer v0.9.4: The Fastest NeoScrypt GPU Miner - page 24.

chrysophylax

legendary

Activity: 2968

Merit: 1091

--- ChainWorks Industries ---

Quote from: ghostlander on February 02, 2016, 08:14:30 PM

Making progress.

GTX 750 Ti @ 1200 = 135KH/s, @ 1400 = 155KH/s

R9 280X @ 1000 = 480KH/s

A nice improvement for older Radeons: HD6970 @ 925 = 188KH/s

thats great to hear ...

looking forward to seeing it in action ...

#crysx

ghostlander

legendary

Activity: 1242

Merit: 1020

No surrender, no retreat, no regret.

Making progress.

GTX 750 Ti @ 1200 = 135KH/s, @ 1400 = 155KH/s

R9 280X @ 1000 = 480KH/s

A nice improvement for older Radeons: HD6970 @ 925 = 188KH/s

Wolf0

member

Activity: 81

Merit: 1002

It was only the wind.

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

Quote from: Conqueror on January 26, 2016, 12:51:54 PM

Quote from: ghostlander on January 25, 2016, 05:26:03 PM

About 40K FTC received in donations, that's about 0.3 BTC. Keep'em coming Wink

I’ve ordered a Gigabyte GTX 750 Ti (GV-N75TOC-2GI). Have got it today and testing now. I'm able to get 115KH/s at 1200MHz shaders or 133KH/s at 1400MHz shaders (~70% TDP). There are minor issues, but shares get solved just fine, no HW errors.

I suggest to point your miners to ghost's add for few days.
He doubled your hashrate Wink

733GhostxAXH9DEoTh2SzpcP9xh7CAeURF

Good job. The 750ti does around 200khash @ 1200 and 240@1400 in djm34's opensource ccminer. the 980ti does 1MHASH.

Conqueror

legendary

Activity: 1354

Merit: 1020

I was diagnosed with brain parasite

Quote from: ghostlander on January 25, 2016, 05:26:03 PM

About 40K FTC received in donations, that's about 0.3 BTC. Keep'em coming Wink

I’ve ordered a Gigabyte GTX 750 Ti (GV-N75TOC-2GI). Have got it today and testing now. I'm able to get 115KH/s at 1200MHz shaders or 133KH/s at 1400MHz shaders (~70% TDP). There are minor issues, but shares get solved just fine, no HW errors.

I suggest to point your miners to ghost's add for few days.
He doubled your hashrate Wink

733GhostxAXH9DEoTh2SzpcP9xh7CAeURF

Wolf0

member

Activity: 81

Merit: 1002

It was only the wind.

Quote from: ghostlander on January 17, 2016, 10:41:18 AM

Looks very good. Have you tweaked the kernel settings or left the defaults there?

I actually rewrote most of it:

- Chacha and Salsa are now done vectorized on GCN. Unroll level is still three for both.
- Blake2S is done parallel, too
- Your bytewise copies were left for now - the bytewise XORs are now done by uints
- Removed your little AND operation on bufptr
- Replaced your if/else structure for creating the output with a single loop doing a bytewise XOR (yes, it works in 100% of cases)
- Created a BlkMix() function for cleanliness
- Split the work over several kernels
- Added ScratchpadLoad/ScratchpadStore/ScratchpadMix functions for cleanliness and a better striped access pattern in memory
- Parallelized the SMix() calls
- Abused the TMTO vulnerability, and made it configurable in the miner
- Shrunk code size by a lot

EDIT:
Forgot some things:

- SMix() itself is completely redesigned from the ground up - needed to fit in the AMD GCN code cache or it shits all over the hashrate. (see side note 0)
- A macro has been added to support moving the Salsa permutation required for parallel computation outside of the actual Salsa implementation. Disabled because it sucks ass, for some reason.
- Worksizes and kernel local size dimensions have had to be changed - this was fixed in the miner host code to accomodate the new kernel.
- Host code now allocates twice the size it used to per work item IF you don't use the TMTO - it adjusts the buffer accordingly when that option is used (halving it, cutting it to one fourth, and so on)

Next up:

- Salsa, ChaCha, Blake2S, SMix, and BlkMix are now all my own code - while FastKDF I have heavily modified, it still could use some rewrite work, I think. (see side note 1)
- Support for devices with no capability regarding unaligned stores could be added - this I'm a lot more likely to do if that code performs well on GCN.
- Code cleanups - got a lot of warnings, unused vars and unused functions laying around, many of which are remnants of the original code.

Side note 0:
SMix is now done in two seperate work-items for one Neoscrypt digest - this I did with more kernel dimensions. Code size was a bitch - if it doesn't fit in the code cache, you take a substantial hit. Now, it's not too hard to figure out this with no TMTO support, but I had to structure it EXTREMELY oddly to FINALLY convince the stupid AMD OpenCL compiler to make the damned thing not repeat BlkMix code in the binary. Easy to tell when it did, as code size doubled or more. As an aside, I *could* have dropped to GCN assembly here, as it has true support for function calls, which the nigh useless OpenCL compiler doesn't fucking use.

Side note 1:
FastKDF could probably benefit from a little more (or different usage of) LDS - not sure how yet, but it's a hunch; I'll experiment with it. The annoying shit is the scratch regs still present, requiring global memory reads/writes. It could also do with dropping some registers on Tahiti/Pitcairn chips. Code size on both runs of FastKDF have room to breathe, so if I have to raise code size a bit to benefit in other areas, so be it as long as it fits in the cache. The annoying and stupid bytewise addressing done by FastKDF can be mitigated - allowing us full uint-sized loads/stores. I've replaced the bytewise XOR with an implementation doing this; results are excellent, but the scratch regs remain regardless. Code size and branching can also likely be cut somewhat - I'm hoping all uint-sized accesses remove the scratch registers that the OpenCL compiler keeps using.

ghostlander

legendary

Activity: 1242

Merit: 1020

No surrender, no retreat, no regret.

Quote from: ?? on ??

Quote from: ghostlander on January 25, 2016, 09:21:33 AM

Quote from: chrysophylax on January 25, 2016, 12:50:53 AM

so the cuda toolkit ( currently 7.5 ) is the closest anyone has of getting optimum performance from the nvidia based cards ...

so the question remains ...

nsgminer is opencl based and for the time being is NOT comparable to nvidia based cuda performance ... will there be such a performance gain ( whether opencl or cuda ) ? ...

#crysx

CUDA allows inline PTX asm code, that's the most important advantage. Otherwise not much difference. Although their CUDA toolkits focus on the newest hardware only. 7.5 doesn't support anything older than Fermi, while OpenCL is supported even for the ancient 8800 series and can run NSGminer. CUDA support in NSGminer is possible. Depends on demand and community support.

PTX isn't all that awesome. There's no Nvidia SASS assembler; PTX is too high level.

Now, with AMD... Cheesy

Yeah, I know, it's still better than nothing.

-----

About 40K FTC received in donations, that's about 0.3 BTC. Keep'em coming Wink

I’ve ordered a Gigabyte GTX 750 Ti (GV-N75TOC-2GI). Have got it today and testing now. I'm able to get 115KH/s at 1200MHz shaders or 133KH/s at 1400MHz shaders (~70% TDP). There are minor issues, but shares get solved just fine, no HW errors.

ghostlander

legendary

Activity: 1242

Merit: 1020

No surrender, no retreat, no regret.

Quote from: chrysophylax on January 25, 2016, 12:50:53 AM

so the cuda toolkit ( currently 7.5 ) is the closest anyone has of getting optimum performance from the nvidia based cards ...

so the question remains ...

nsgminer is opencl based and for the time being is NOT comparable to nvidia based cuda performance ... will there be such a performance gain ( whether opencl or cuda ) ? ...

#crysx

CUDA allows inline PTX asm code, that's the most important advantage. Otherwise not much difference. Although their CUDA toolkits focus on the newest hardware only. 7.5 doesn't support anything older than Fermi, while OpenCL is supported even for the ancient 8800 series and can run NSGminer. CUDA support in NSGminer is possible. Depends on demand and community support.

Wolf0

member

Activity: 81

Merit: 1002

It was only the wind.

Also, I just checked - both are Elpida EDW2032BBBG.

chrysophylax

legendary

Activity: 2968

Merit: 1091

--- ChainWorks Industries ---

Quote from: ghostlander on January 24, 2016, 11:20:02 PM

Quote from: chrysophylax on January 24, 2016, 07:16:45 PM

Quote from: ghostlander on January 20, 2016, 08:06:13 AM

Quote from: MaxDZ8 on January 20, 2016, 03:06:23 AM

Quote from: Grim on January 18, 2016, 07:42:18 PM

In future you should advice these coin devs on their algos cuz quite often they fail on their own.

A memory intensive algo being compute strained lol Roll Eyes

Quoted for emphasis. With all those rounds of salsa/chacha no idea how they managed to get different expectations... scrypt was already compute bound with GAP 2 and NeoScrypt is ~4 times more intensive!

However Scrypt wasn't compute bound without gapping, was it?

are there any plans on an nvidia ( cuda ) based miner? ...

just asking out of curiosity ...

#crysx

I'd like to return the question. If NSGminer with OpenCL starts to offer comparable performance to ccminer some day, will be a direct CUDA support necessary?

of course not ... but that day will probably never come ...

due to nvidia being - well nvidia - the opencl 'market' will be pretty much behind the cuda 'market as far as cuda is concerned ... unless someone can come up with a direct way of implementing opencl 'properly' ( and i dont mean nvidia way of doing it ) - then we will always be stuck with a mediocre solution for opencl on nvidia cards ...

dont get me wrong - i think nvidia make massive improvements on their technologies on the cards ... but releasing opensource software for the benefit of the community - and not making money out of it? ... hehehe ... yup - thats going to happen from nvidia ...

so the cuda toolkit ( currently 7.5 ) is the closest anyone has of getting optimum performance from the nvidia based cards ...

so the question remains ...

nsgminer is opencl based and for the time being is NOT comparable to nvidia based cuda performance ... will there be such a performance gain ( whether opencl or cuda ) ? ...

#crysx

Wolf0

member

Activity: 81

Merit: 1002

It was only the wind.

Quote from: ghostlander on January 17, 2016, 10:10:12 AM

Quote from: Wolf0 on January 17, 2016, 09:57:24 AM

Quote from: ghostlander on January 17, 2016, 09:52:58 AM

Quote from: Wolf0 on January 16, 2016, 08:35:16 PM

EDIT: Odd, with the latest on 7950, I'm getting ~155kH/s on 7950 clocked at 950/1250. Intensity 15, worksize 128. Is this correct?

I have run some tests and it seems the GCN cards are more comfortable with scalar Salsa and ChaCha. The AMD compiler optimises them better. For example, I get 680KH/s with vector and 690KH/s with scalar on 14.6. By default, the v6 kernel uses scalar for GCN and vector for VLIW5 and VLIW4. You can edit the kernel to run tests with different settings. Are you on Crimson or 15.11?

Both of those suck - 15.7.x seems to work best for me. What hashrate did you get with a 7950? Surely not 680 - 690kh/s?

That's for HD7990, dual Tahiti, downvolted and downclocked to 850/1250. I have no HD7950, though most people o/c them from 800/1250 to 1000/1500 or higher where they perform like a single GPU of my HD7990, so 350MH/s should be achievable. HD7970 (R9 280X) delivers 380 to 450KH/s depending on overclock and memory (Elpida sucks, Hynix wins).

Ah, all right. I've managed to get solidly above 410H/s on 7950 - the 15.7.x Arch packages are usually my mainstay, but this test shows that 15.9-9 does the same hashrate. Interesting.

Screenshot (NSFW): https://ottrbutt.com/miner/neoscryptwolf-01172016.png

I don't really wanna clock Mithra's cards too high above 1050 for longer runs, and 1100 for test runs because the cards have been mining for me forever.

ghostlander

legendary

Activity: 1242

Merit: 1020

No surrender, no retreat, no regret.

Quote from: chrysophylax on January 24, 2016, 07:16:45 PM

Quote from: ghostlander on January 20, 2016, 08:06:13 AM

Quote from: MaxDZ8 on January 20, 2016, 03:06:23 AM

Quote from: Grim on January 18, 2016, 07:42:18 PM

In future you should advice these coin devs on their algos cuz quite often they fail on their own.

A memory intensive algo being compute strained lol Roll Eyes

Quoted for emphasis. With all those rounds of salsa/chacha no idea how they managed to get different expectations... scrypt was already compute bound with GAP 2 and NeoScrypt is ~4 times more intensive!

However Scrypt wasn't compute bound without gapping, was it?

are there any plans on an nvidia ( cuda ) based miner? ...

just asking out of curiosity ...

#crysx

I'd like to return the question. If NSGminer with OpenCL starts to offer comparable performance to ccminer some day, will be a direct CUDA support necessary?

chrysophylax

legendary

Activity: 2968

Merit: 1091

--- ChainWorks Industries ---

Quote from: ghostlander on January 20, 2016, 08:06:13 AM

Quote from: MaxDZ8 on January 20, 2016, 03:06:23 AM

Quote from: Grim on January 18, 2016, 07:42:18 PM

In future you should advice these coin devs on their algos cuz quite often they fail on their own.

A memory intensive algo being compute strained lol Roll Eyes

Quoted for emphasis. With all those rounds of salsa/chacha no idea how they managed to get different expectations... scrypt was already compute bound with GAP 2 and NeoScrypt is ~4 times more intensive!

However Scrypt wasn't compute bound without gapping, was it?

are there any plans on an nvidia ( cuda ) based miner? ...

just asking out of curiosity ...

#crysx

Wolf0

member

Activity: 81

Merit: 1002

It was only the wind.

Quote from: ghostlander on January 17, 2016, 09:52:58 AM

Quote from: Wolf0 on January 16, 2016, 08:35:16 PM

EDIT: Odd, with the latest on 7950, I'm getting ~155kH/s on 7950 clocked at 950/1250. Intensity 15, worksize 128. Is this correct?

I have run some tests and it seems the GCN cards are more comfortable with scalar Salsa and ChaCha. The AMD compiler optimises them better. For example, I get 680KH/s with vector and 690KH/s with scalar on 14.6. By default, the v6 kernel uses scalar for GCN and vector for VLIW5 and VLIW4. You can edit the kernel to run tests with different settings. Are you on Crimson or 15.11?

Both of those suck - 15.7.x seems to work best for me. What hashrate did you get with a 7950? Surely not 680 - 690kh/s?

Conqueror

legendary

Activity: 1354

Merit: 1020

I was diagnosed with brain parasite

Fundraising for Ghostlander's new GPU is gaining some traction.

Consider sending few FTC to his add: 733GhostxAXH9DEoTh2SzpcP9xh7CAeURF

Thank you.

nunofsp

newbie

Activity: 2

Merit: 0

Quote from: ghostlander on January 16, 2016, 06:27:54 PM

NSGminer v0.9.1 released with my NeoScrypt OpenCL kernel v6. Should be compatible with the latest AMD Catalyst drivers. Also delivers a little performance improvement over the previous release.

sent you 1500FTC tip. nice work! Wink

aciddude

member

Activity: 179

Merit: 27

EDIT:
I'm running 4x MSI 280x

with the following miner settings:

Code:

 --neoscrypt -g 1 -w 128 -I 15 -o stratum+tcp://PoolHostnameHere.com:Port -O WorkerName:WorkerPassword

using Kernel v7 (which I believe Ghostlander is still working on so it's in Beta for now)

ghostlander

legendary

Activity: 1242

Merit: 1020

No surrender, no retreat, no regret.

Quote from: MaxDZ8 on January 20, 2016, 03:06:23 AM

Quote from: Grim on January 18, 2016, 07:42:18 PM

In future you should advice these coin devs on their algos cuz quite often they fail on their own.

A memory intensive algo being compute strained lol Roll Eyes

Quoted for emphasis. With all those rounds of salsa/chacha no idea how they managed to get different expectations... scrypt was already compute bound with GAP 2 and NeoScrypt is ~4 times more intensive!

However Scrypt wasn't compute bound without gapping, was it?

Wolf0

member

Activity: 81

Merit: 1002

It was only the wind.

Quote from: ghostlander on January 16, 2016, 08:30:13 PM

There is also CL_DEVICE_MAX_MEM_ALLOC_SIZE which is maximum 25% of CL_DEVICE_GLOBAL_MEM_SIZE. It's used for actual allocation by the miner. Maybe the driver adjusts it according to the number of GPUs available in the system to make sure they don't run out of memory in the worst case.

The goal of NeoScrypt was to create something both computationally and memory intensive yet suitable for practical use. It wasn't a design objective to make it GPU resistant. Quite the opposite actually. I didn't want it CPU only or ASIC only. Botnets rule in the 1st case and ASIC manufacturers with their farms in the 2nd one. The GPUs are optimal for decentralisation. If I increased memory hardness by designing a complicated tree structure resistant to TMTO attacks, it could phase the GPUs out or reduce their efficiency to an intolerable level. I also wanted NeoScrypt to be backward compatible with Scrypt for ease of deployment. Although NeoScrypt is memory strong anyway, it's a more balanced solution somewhere between computationally intensive algorithms like SHA-256 and memory dependent like Scrypt. I think neither SHA-256 nor Scrypt ASICs can support NeoScrypt easily as a by-product and its market share is too low to produce a customised ASIC design. Maybe in a few years, who knows.

You wouldn't need a tree structure, just modify the scratchpad as it's wandered. It is NOT memory strong - not by a LONG shot; about ASICs, we can agree, though.

EDIT: Odd, with the latest on 7950, I'm getting ~155kH/s on 7950 clocked at 950/1250. Intensity 15, worksize 128. Is this correct?

MaxDZ8

hero member

Activity: 672

Merit: 500

Quote from: Grim on January 18, 2016, 07:42:18 PM

In future you should advice these coin devs on their algos cuz quite often they fail on their own.

A memory intensive algo being compute strained lol Roll Eyes

Quoted for emphasis. With all those rounds of salsa/chacha no idea how they managed to get different expectations... scrypt was already compute bound with GAP 2 and NeoScrypt is ~4 times more intensive!

Topic: NSGminer v0.9.4: The Fastest NeoScrypt GPU Miner - page 24. (Read 221796 times)