Pages:
Author

Topic: SILENTARMY v5: Zcash miner, 115 sol/s on R9 Nano, 70 sol/s on GTX 1070 - page 73. (Read 209309 times)

sr. member
Activity: 588
Merit: 251
How close are we to Claymore speeds? I just can't get myself to move my rigs to Windows...

About 25% behind Claymore's V4 miner right now.  There's at least 3 other people (including myself) besides Marc working on improving the SA kernel.  There are a number of optimization ideas being discussed that I expect should get performance up to 100sol/s on a Rx 480.  Longer term I'm now convinced that 150sol/s is possible.


full member
Activity: 243
Merit: 105
Well except Nvidia because their OpenCL implementation implements busy waits, but I'll check in a workaround soon.

Low cpu usage on celeron and 6 1070 cards now

https://bitcointalksearch.org/topic/m.16818120

But it is more correct to preload library from python

Code:
@asyncio.coroutine
    def start_solvers(self, devid):
        verbose('Solver %s: launching' % devid)
        os.environ["LD_PRELOAD"]="./libtime.so"
        # execute "sa-solver --mining --use "
        create = asyncio.create_subprocess_exec(
                self.solver_binary, '--mining', '--use', devid.split('.')[0],
                stdin=asyncio.subprocess.PIPE, stdout=asyncio.subprocess.PIPE,
                stderr=asyncio.subprocess.STDOUT)
legendary
Activity: 2174
Merit: 1401
new vesrion with 45 sols per gtx 1070 can be runned on windows(10 x64) ? 

+1

Is it so hard to add windows support ?  Cry

Sorry I am still not, at the moment, working on Windows. But optimizations and various improvements.

Yea, not exactly sure why windows users are coming on here and demanding a windows version. Linux users got screwed over by this whole shit, and I have rigs that can't be booted under windows and require linux. Be happy with your claymour 100h/s miner.

@mbr id rather you get close to clamour performance, charge a devfee to get that done, and then worry about windows.
full member
Activity: 120
Merit: 100
How close are we to Claymore speeds? I just can't get myself to move my rigs to Windows...
newbie
Activity: 19
Merit: 0
Dramatic CPU usage savings and PCIe bandwidth savings are now committed, thanks to on-device filtering of invalid solutions: https://github.com/mbevand/silentarmy/commit/146b8dc0b6618852e2f322fab51f3ed3739da07a

PCIe bandwidth usage dropped from ~100 MB/s to 500kB/s per GPU! This should really help those with PCIe ×1 risers. MAX_SOLS is now reduced from 2000 to 10 Smiley CPU usage should also now be close to zero. (Well except Nvidia because their OpenCL implementation implements busy waits, but I'll check in a workaround soon.)

As always, check the changelog which I always update in real-time: https://github.com/mbevand/silentarmy/blob/master/CHANGELOG.md

I just tried this update and i can confirm that  cpu usage is near zero  Smiley

 also i can see 2~5% more speed improvement on my 8 rigs of rx 470 .. good work
sr. member
Activity: 2106
Merit: 282
👉bit.ly/3QXp3oh | 🔥 Ultimate Launc
nerdralph
Did you try to use local memory for atomic increment (store all data to global memory and walk through data in seperate kernel) ?

Each compute unit has 64KB of LDS, so a Rx 470 with 32 CUs has 2MB of LDS.  1 million (2^20) 32-bit counters needs 4MB.  atomic_inc works only with ints, so even if the counters are packed into 8 bits each so they'll all fit in LDS, there doesn't seem to be a way in opencl to atomically increment them.

See pm.
mrb
legendary
Activity: 1512
Merit: 1028
I'm seeing a 7-9% speed improvement between 2 GPUs.  One is in a 16x slot and the other on a 1x riser.  For the card on the 1x riser the speed improvement is ~10%, and for the card in the 16x slot ~5%.

Nice, thanks for confirming.
mrb
legendary
Activity: 1512
Merit: 1028
new vesrion with 45 sols per gtx 1070 can be runned on windows(10 x64) ? 

+1

Is it so hard to add windows support ?  Cry

Sorry I am still not, at the moment, working on Windows. But optimizations and various improvements.
sr. member
Activity: 588
Merit: 251
Dramatic CPU usage savings and PCIe bandwidth savings are now committed, thanks to on-device filtering of invalid solutions: https://github.com/mbevand/silentarmy/commit/146b8dc0b6618852e2f322fab51f3ed3739da07a

PCIe bandwidth usage dropped from ~100 MB/s to 500kB/s per GPU! This should really help those with PCIe ×1 risers. MAX_SOLS is now reduced from 2000 to 10 Smiley CPU usage should also now be close to zero. (Well except Nvidia because their OpenCL implementation implements busy waits, but I'll check in a workaround soon.)

As always, check the changelog which I always update in real-time: https://github.com/mbevand/silentarmy/blob/master/CHANGELOG.md

I'm seeing a 7-9% speed improvement between 2 GPUs.  One is in a 16x slot and the other on a 1x riser.  For the card on the 1x riser the speed improvement is ~10%, and for the card in the 16x slot ~5%.
sr. member
Activity: 588
Merit: 251
nerdralph
Did you try to use local memory for atomic increment (store all data to global memory and walk through data in seperate kernel) ?

Each compute unit has 64KB of LDS, so a Rx 470 with 32 CUs has 2MB of LDS.  1 million (2^20) 32-bit counters needs 4MB.  atomic_inc works only with ints, so even if the counters are packed into 8 bits each so they'll all fit in LDS, there doesn't seem to be a way in opencl to atomically increment them.
newbie
Activity: 5
Merit: 0
How do I run this windows version on windows?
sr. member
Activity: 2106
Merit: 282
👉bit.ly/3QXp3oh | 🔥 Ultimate Launc
nerdralph
Did you try to use local memory for atomic increment (store all data to global memory and walk through data in seperate kernel) ?
mrb
legendary
Activity: 1512
Merit: 1028
Dramatic CPU usage savings and PCIe bandwidth savings are now committed, thanks to on-device filtering of invalid solutions: https://github.com/mbevand/silentarmy/commit/146b8dc0b6618852e2f322fab51f3ed3739da07a

PCIe bandwidth usage dropped from ~100 MB/s to 500kB/s per GPU! This should really help those with PCIe ×1 risers. MAX_SOLS is now reduced from 2000 to 10 Smiley CPU usage should also now be close to zero. (Well except Nvidia because their OpenCL implementation implements busy waits, but I'll check in a workaround soon.)

As always, check the changelog which I always update in real-time: https://github.com/mbevand/silentarmy/blob/master/CHANGELOG.md
sr. member
Activity: 652
Merit: 266

WHOA!

Thanks man

EDIT: I see there is no binary there yet. You got me all excited.

There is a binary solver and a python script. That's what SA is.

I think there should be Roadmap for app enhancement and better usability for both Linux,Windows and even OSX + embedded hw.
Unified C/C++ app with daemonize,syslog and api cli support for command/monitor. Simplified OpenCL switch between AMD/Intel/nVidia with hwmon support.
sr. member
Activity: 588
Merit: 251
I have another idea for optimize equihash round kernel, results will be in next 12-24h

Quote
You must have o/c'd. I don't believe 16.30 is 37% faster than 16.40.
Only memory, 1100/2160 and low DRAM timings preset (modded ROM).
Excellent! We believe in you! Smiley
Got only 2%, because need optimize another place - function ht_store at kernel. This one row in code:
Quote
124             cnt = atomic_inc((__global uint *)p);
Takes a half of all iteration time! Smiley

Good work, but you're a little behind.  Here's part of an email I sent to JW and Marc 4 days ago:
"I think the atomic_inc in ht_store is a bottleneck.  As you probably already know, incrementing it non-atomically (even if it is a volatile) fails to maintain data consistency between the threads.  On fglrx the atomic_inc compiles to flat_atomic_inc and s_waitcnt:
  flat_atomic_inc  v24, v[24:25], v26 glc               // 000000000270: DD2D0000 18001A18
  s_waitcnt     vmcnt(0) & lgkmcnt(0)                   // 000000000278: BF8C0070"
sr. member
Activity: 438
Merit: 250

WHOA!

Thanks man

EDIT: I see there is no binary there yet. You got me all excited.

There is a binary solver and a python script. That's what SA is.
legendary
Activity: 1274
Merit: 1000
I hope there is windows version any way this can and will run faster then claymore over time no comment why we need windows , we do .
sr. member
Activity: 2106
Merit: 282
👉bit.ly/3QXp3oh | 🔥 Ultimate Launc
I have another idea for optimize equihash round kernel, results will be in next 12-24h

Quote
You must have o/c'd. I don't believe 16.30 is 37% faster than 16.40.
Only memory, 1100/2160 and low DRAM timings preset (modded ROM).
Excellent! We believe in you! Smiley
Got only 2%, because need optimize another place - function ht_store at kernel. This one row in code:
Quote
124             cnt = atomic_inc((__global uint *)p);
Takes a half of all iteration time! Smiley
sr. member
Activity: 353
Merit: 251
Just booted windows disk and tested newer Claymore 4.0
GPU #0: Ellesmere, 8192 MB available, 36 compute units
GPU #1: Ellesmere, 8192 MB available, 36 compute units
ZEC - Total Speed: 135.354 H/s, Total Shares: 151, Rejected: 1, Time: 00:05
ZEC: GPU0 68.841 H/s, GPU1 66.452 H/s
Pool switches: ZEC - 0
Current ZEC pool share target: 0x0025d4c3 (diff: 1732H)
GPU1 t=49C fan=62%, GPU2 t=60C fan=65%
//
This is with intensity set to 2 and my quad amd APU A8-7600 is 50% loaded. Using -i 0 I get almost equal results with silentarmy's miner.

CZM 4.0: PowerColor 390X Devil (hybrid cooling) OC'ed: 100 Sol/s for primary card. 93-98 for others (and I don't know why - all risers are x1, Gen.2 PCI-e mode, CPU load with -i 0 is low, using 1-2 slowdowns with my weak G3240 CPU.



Agreed we need a Linux miner first, Windows users are happy with CZM.

PS. Host is Windows, just forgot to update the label in my web monitor config.
sr. member
Activity: 954
Merit: 250
When can we get a windows version?
Pages:
Jump to: