SILENTARMY v5: Zcash miner, 115 sol/s on R9 Nano, 70 sol/s on GTX 1070 - page 73.

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: scavern on November 10, 2016, 04:33:00 PM

How close are we to Claymore speeds? I just can't get myself to move my rigs to Windows...

About 25% behind Claymore's V4 miner right now. There's at least 3 other people (including myself) besides Marc working on improving the SA kernel. There are a number of optimization ideas being discussed that I expect should get performance up to 100sol/s on a Rx 480. Longer term I'm now convinced that 150sol/s is possible.

krnlx

full member

Activity: 243

Merit: 105

Quote from: mrb on November 10, 2016, 03:03:29 PM

Well except Nvidia because their OpenCL implementation implements busy waits, but I'll check in a workaround soon.

Low cpu usage on celeron and 6 1070 cards now

https://bitcointalksearch.org/topic/m.16818120

But it is more correct to preload library from python

Code:

@asyncio.coroutine
    def start_solvers(self, devid):
        verbose('Solver %s: launching' % devid)
        os.environ["LD_PRELOAD"]="./libtime.so"
        # execute "sa-solver --mining --use "
        create = asyncio.create_subprocess_exec(
                self.solver_binary, '--mining', '--use', devid.split('.')[0],
                stdin=asyncio.subprocess.PIPE, stdout=asyncio.subprocess.PIPE,
                stderr=asyncio.subprocess.STDOUT)

jstefanop

legendary

Activity: 2182

Merit: 1401

Quote from: mrb on November 10, 2016, 03:51:59 PM

Quote from: jeanjean15 on November 10, 2016, 03:17:28 AM

Quote from: du44 on November 09, 2016, 11:57:22 PM

new vesrion with 45 sols per gtx 1070 can be runned on windows(10 x64) ?

+1

Is it so hard to add windows support ? Cry

Sorry I am still not, at the moment, working on Windows. But optimizations and various improvements.

Yea, not exactly sure why windows users are coming on here and demanding a windows version. Linux users got screwed over by this whole shit, and I have rigs that can't be booted under windows and require linux. Be happy with your claymour 100h/s miner.

@mbr id rather you get close to clamour performance, charge a devfee to get that done, and then worry about windows.

scavern

full member

Activity: 120

Merit: 100

How close are we to Claymore speeds? I just can't get myself to move my rigs to Windows...

hypercrypto

newbie

Activity: 19

Merit: 0

Quote from: mrb on November 10, 2016, 03:03:29 PM

Dramatic CPU usage savings and PCIe bandwidth savings are now committed, thanks to on-device filtering of invalid solutions: https://github.com/mbevand/silentarmy/commit/146b8dc0b6618852e2f322fab51f3ed3739da07a

PCIe bandwidth usage dropped from ~100 MB/s to 500kB/s per GPU! This should really help those with PCIe ×1 risers. MAX_SOLS is now reduced from 2000 to 10

CPU usage should also now be close to zero. (Well except Nvidia because their OpenCL implementation implements busy waits, but I'll check in a workaround soon.)

As always, check the changelog which I always update in real-time: https://github.com/mbevand/silentarmy/blob/master/CHANGELOG.md

I just tried this update and i can confirm that cpu usage is near zero

also i can see 2~5% more speed improvement on my 8 rigs of rx 470 .. good work

eXtremal

sr. member

Activity: 2106

Merit: 282

👉bit.ly/3QXp3oh | 🔥 Ultimate Launc

Quote from: nerdralph on November 10, 2016, 03:36:06 PM

Quote from: eXtremal on November 10, 2016, 03:12:01 PM

nerdralph
Did you try to use local memory for atomic increment (store all data to global memory and walk through data in seperate kernel) ?

Each compute unit has 64KB of LDS, so a Rx 470 with 32 CUs has 2MB of LDS. 1 million (2^20) 32-bit counters needs 4MB. atomic_inc works only with ints, so even if the counters are packed into 8 bits each so they'll all fit in LDS, there doesn't seem to be a way in opencl to atomically increment them.

See pm.

mrb

legendary

Activity: 1512

Merit: 1028

Quote from: nerdralph on November 10, 2016, 03:48:25 PM

I'm seeing a 7-9% speed improvement between 2 GPUs. One is in a 16x slot and the other on a 1x riser. For the card on the 1x riser the speed improvement is ~10%, and for the card in the 16x slot ~5%.

Nice, thanks for confirming.

mrb

legendary

Activity: 1512

Merit: 1028

Quote from: jeanjean15 on November 10, 2016, 03:17:28 AM

Quote from: du44 on November 09, 2016, 11:57:22 PM

new vesrion with 45 sols per gtx 1070 can be runned on windows(10 x64) ?

+1

Is it so hard to add windows support ? Cry

Sorry I am still not, at the moment, working on Windows. But optimizations and various improvements.

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: mrb on November 10, 2016, 03:03:29 PM

Dramatic CPU usage savings and PCIe bandwidth savings are now committed, thanks to on-device filtering of invalid solutions: https://github.com/mbevand/silentarmy/commit/146b8dc0b6618852e2f322fab51f3ed3739da07a

PCIe bandwidth usage dropped from ~100 MB/s to 500kB/s per GPU! This should really help those with PCIe ×1 risers. MAX_SOLS is now reduced from 2000 to 10

CPU usage should also now be close to zero. (Well except Nvidia because their OpenCL implementation implements busy waits, but I'll check in a workaround soon.)

As always, check the changelog which I always update in real-time: https://github.com/mbevand/silentarmy/blob/master/CHANGELOG.md

I'm seeing a 7-9% speed improvement between 2 GPUs. One is in a 16x slot and the other on a 1x riser. For the card on the 1x riser the speed improvement is ~10%, and for the card in the 16x slot ~5%.

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: eXtremal on November 10, 2016, 03:12:01 PM

nerdralph
Did you try to use local memory for atomic increment (store all data to global memory and walk through data in seperate kernel) ?

Each compute unit has 64KB of LDS, so a Rx 470 with 32 CUs has 2MB of LDS. 1 million (2^20) 32-bit counters needs 4MB. atomic_inc works only with ints, so even if the counters are packed into 8 bits each so they'll all fit in LDS, there doesn't seem to be a way in opencl to atomically increment them.

panv

newbie

Activity: 5

Merit: 0

How do I run this windows version on windows?

eXtremal

sr. member

Activity: 2106

Merit: 282

👉bit.ly/3QXp3oh | 🔥 Ultimate Launc

nerdralph
Did you try to use local memory for atomic increment (store all data to global memory and walk through data in seperate kernel) ?

mrb

legendary

Activity: 1512

Merit: 1028

Dramatic CPU usage savings and PCIe bandwidth savings are now committed, thanks to on-device filtering of invalid solutions: https://github.com/mbevand/silentarmy/commit/146b8dc0b6618852e2f322fab51f3ed3739da07a

PCIe bandwidth usage dropped from ~100 MB/s to 500kB/s per GPU! This should really help those with PCIe ×1 risers. MAX_SOLS is now reduced from 2000 to 10

CPU usage should also now be close to zero. (Well except Nvidia because their OpenCL implementation implements busy waits, but I'll check in a workaround soon.)

As always, check the changelog which I always update in real-time: https://github.com/mbevand/silentarmy/blob/master/CHANGELOG.md

laik2

sr. member

Activity: 652

Merit: 266

Quote from: Genoil on November 10, 2016, 02:35:04 PM

Quote from: adaseb on November 10, 2016, 10:04:41 AM

Quote from: Genoil on November 10, 2016, 09:41:52 AM

Quote from: enerbyte on November 10, 2016, 09:38:24 AM

Quote from: Nikolaj on November 10, 2016, 08:54:15 AM

Quote from: marvykkio on November 10, 2016, 08:52:03 AM

version for windows? I have 2 x gtx 1070, have only windows
Cry

+1 windows please

Support windows please

https://github.com/Genoil/silentarmy/tree/windows

WHOA!

Thanks man

EDIT: I see there is no binary there yet. You got me all excited.

There is a binary solver and a python script. That's what SA is.

I think there should be Roadmap for app enhancement and better usability for both Linux,Windows and even OSX + embedded hw.
Unified C/C++ app with daemonize,syslog and api cli support for command/monitor. Simplified OpenCL switch between AMD/Intel/nVidia with hwmon support.

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: eXtremal on November 10, 2016, 02:15:07 PM

Quote from: laik2 on November 10, 2016, 07:59:26 AM

Quote from: eXtremal on November 10, 2016, 07:29:04 AM

I have another idea for optimize equihash round kernel, results will be in next 12-24h

Quote

You must have o/c'd. I don't believe 16.30 is 37% faster than 16.40.

Only memory, 1100/2160 and low DRAM timings preset (modded ROM).

Excellent! We believe in you!

Got only 2%, because need optimize another place - function ht_store at kernel. This one row in code:

Quote

124 cnt = atomic_inc((__global uint *)p);

Takes a half of all iteration time!

Good work, but you're a little behind. Here's part of an email I sent to JW and Marc 4 days ago:
"I think the atomic_inc in ht_store is a bottleneck. As you probably already know, incrementing it non-atomically (even if it is a volatile) fails to maintain data consistency between the threads. On fglrx the atomic_inc compiles to flat_atomic_inc and s_waitcnt:
flat_atomic_inc v24, v[24:25], v26 glc // 000000000270: DD2D0000 18001A18
s_waitcnt vmcnt(0) & lgkmcnt(0) // 000000000278: BF8C0070"

Genoil

sr. member

Activity: 438

Merit: 250

Quote from: adaseb on November 10, 2016, 10:04:41 AM

Quote from: Genoil on November 10, 2016, 09:41:52 AM

Quote from: enerbyte on November 10, 2016, 09:38:24 AM

Quote from: Nikolaj on November 10, 2016, 08:54:15 AM

Quote from: marvykkio on November 10, 2016, 08:52:03 AM

version for windows? I have 2 x gtx 1070, have only windows
Cry

+1 windows please

Support windows please

https://github.com/Genoil/silentarmy/tree/windows

WHOA!

Thanks man

EDIT: I see there is no binary there yet. You got me all excited.

There is a binary solver and a python script. That's what SA is.

toptek

legendary

Activity: 1274

Merit: 1000

I hope there is windows version any way this can and will run faster then claymore over time no comment why we need windows , we do .

eXtremal

sr. member

Activity: 2106

Merit: 282

👉bit.ly/3QXp3oh | 🔥 Ultimate Launc

Quote from: laik2 on November 10, 2016, 07:59:26 AM

Quote from: eXtremal on November 10, 2016, 07:29:04 AM

I have another idea for optimize equihash round kernel, results will be in next 12-24h

Quote

You must have o/c'd. I don't believe 16.30 is 37% faster than 16.40.

Only memory, 1100/2160 and low DRAM timings preset (modded ROM).

Excellent! We believe in you!

Got only 2%, because need optimize another place - function ht_store at kernel. This one row in code:

Quote

124 cnt = atomic_inc((__global uint *)p);

Takes a half of all iteration time!

osnwt

sr. member

Activity: 353

Merit: 251

Quote from: laik2 on November 10, 2016, 11:53:03 AM

Just booted windows disk and tested newer Claymore 4.0
GPU #0: Ellesmere, 8192 MB available, 36 compute units
GPU #1: Ellesmere, 8192 MB available, 36 compute units
ZEC - Total Speed: 135.354 H/s, Total Shares: 151, Rejected: 1, Time: 00:05
ZEC: GPU0 68.841 H/s, GPU1 66.452 H/s
Pool switches: ZEC - 0
Current ZEC pool share target: 0x0025d4c3 (diff: 1732H)
GPU1 t=49C fan=62%, GPU2 t=60C fan=65%
//
This is with intensity set to 2 and my quad amd APU A8-7600 is 50% loaded. Using -i 0 I get almost equal results with silentarmy's miner.

CZM 4.0: PowerColor 390X Devil (hybrid cooling) OC'ed: 100 Sol/s for primary card. 93-98 for others (and I don't know why - all risers are x1, Gen.2 PCI-e mode, CPU load with -i 0 is low, using 1-2 slowdowns with my weak G3240 CPU.

Agreed we need a Linux miner first, Windows users are happy with CZM.

PS. Host is Windows, just forgot to update the label in my web monitor config.

Dr_Victor

sr. member

Activity: 954

Merit: 250

When can we get a windows version?

Topic: SILENTARMY v5: Zcash miner, 115 sol/s on R9 Nano, 70 sol/s on GTX 1070 - page 73. (Read 209337 times)