SILENTARMY v5: Zcash miner, 115 sol/s on R9 Nano, 70 sol/s on GTX 1070 - page 22.

reb0rn21

legendary

Activity: 1901

Merit: 1024

1070, win10

instances=2 gives 99% gpu usage and ~135sol/s
threads=2 does not work on windows 0 sol

antantti

legendary

Activity: 1176

Merit: 1015

4x970 with zawawawa-r12-nv doing ~400 S/s, w7 and some oc. Power consumption down a bit from sp_ version.

r6 was hashing ~340 and sp_1 ~380 with same clocks.

About competition against amd in equihash, I am afraid that we haven't seen nothing yet from high end older amd cards.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: nerdralph on November 22, 2016, 03:18:50 PM

Quote from: laik2 on November 22, 2016, 02:41:19 PM

Quote from: zawawa on November 22, 2016, 02:39:26 PM

Oh, I forgot to mention that my Windows port always shows a 5 min average for total hashrate.
You have to wait a little, but you get a more accurate number that way.

I wish I could do 130S/s with my AMDs.

Here's how to get a lot more than 130:
https://bitcointalksearch.org/topic/m.16957668

Marvelous! An excellent analysis! I will take up the challenge with the GCN assembly.
This is so much fun!!

laik2

sr. member

Activity: 652

Merit: 266

Quote from: nerdralph on November 22, 2016, 03:18:50 PM

Quote from: laik2 on November 22, 2016, 02:41:19 PM

Quote from: zawawa on November 22, 2016, 02:39:26 PM

Oh, I forgot to mention that my Windows port always shows a 5 min average for total hashrate.
You have to wait a little, but you get a more accurate number that way.

I wish I could do 130S/s with my AMDs.

Here's how to get a lot more than 130:
https://bitcointalksearch.org/topic/m.16957668

This is basicly what I could understand from your writings.
wrong paste

Quote

Using LDS or L1 Cache

There are a number of considerations when deciding between LDS and L1 cache for a given algorithm.

LDS supports read/modify/write operations, as well as atomics. It is well-suited for code that requires fast read/write, read/modify/write, or scatter operations that otherwise are directed to global memory. On current AMD hardware, L1 is part of the read path; hence, it is suited to cache-read-sensitive algorithms, such as matrix multiplication or convolution.

LDS is typically larger than L1 (for example: 64 kB vs 16 kB on Southern Islands devices). If it is not possible to obtain a high L1 cache hit rate for an algorithm, the larger LDS size can help. On the AMD Radeon  HD 7970 device, the theoretical LDS peak bandwidth is 3.8 TB/s, compared to L1 at 1.9 TB/sec.

The native data type for L1 is a four-vector of 32-bit words. On L1, fill and read addressing are linked. It is important that L1 is initially filled from global memory with a coalesced access pattern; once filled, random accesses come at no extra processing cost.

Currently, the native format of LDS is a 32-bit word. The theoretical LDS peak bandwidth is achieved when each thread operates on a two-vector of 32-bit words (16 threads per clock operate on 32 banks). If an algorithm requires coalesced 32-bit quantities, it maps well to LDS. The use of four-vectors or larger can lead to bank conflicts, although the compiler can mitigate some of these.

From an application point of view, filling LDS from global memory, and reading from it, are independent operations that can use independent addressing. Thus, LDS can be used to explicitly convert a scattered access pattern to a coalesced pattern for read and write to global memory. Or, by taking advantage of the LDS read broadcast feature, LDS can be filled with a coalesced pattern from global memory, followed by all threads iterating through the same LDS words simultaneously.

LDS reuses the data already pulled into cache by other wavefronts. Sharing across work-groups is not possible because OpenCL does not guarantee that LDS is in a particular state at the beginning of work-group execution. L1 content, on the other hand, is independent of work-group execution, so that successive work-groups can share the content in the L1 cache of a given Vector ALU. However, it currently is not possible to explicitly control L1 sharing across work-groups.

The use of LDS is linked to GPR usage and wavefront-per-Vector ALU count. Better sharing efficiency requires a larger work-group, so that more work-items share the same LDS. Compiling kernels for larger work-groups typically results in increased register use, so that fewer wavefronts can be scheduled simultaneously per Vector ALU. This, in turn, reduces memory latency hiding. Requesting larger amounts of LDS per work-group results in fewer wavefronts per Vector ALU, with the same effect.

LDS typically involves the use of barriers, with a potential performance impact. This is true even for read-only use cases, as LDS must be explicitly filled in from global memory (after which a barrier is required before reads can commence).

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: laik2 on November 22, 2016, 02:41:19 PM

Quote from: zawawa on November 22, 2016, 02:39:26 PM

Oh, I forgot to mention that my Windows port always shows a 5 min average for total hashrate.
You have to wait a little, but you get a more accurate number that way.

I wish I could do 130S/s with my AMDs.

Here's how to get a lot more than 130:
https://bitcointalksearch.org/topic/m.16957668

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: laik2 on November 22, 2016, 02:41:19 PM

Quote from: zawawa on November 22, 2016, 02:39:26 PM

Oh, I forgot to mention that my Windows port always shows a 5 min average for total hashrate.
You have to wait a little, but you get a more accurate number that way.

I wish I could do 130S/s with my AMDs.

I'm pretty sure we will get there. It is just that there is no "easy" optimizations left for AMD cards because they were the first targets of this miner. The next optimization requires a massive rewrite, bit it can be done, methinks.

laik2

sr. member

Activity: 652

Merit: 266

Quote from: Amph on November 22, 2016, 02:41:43 PM

Quote from: ioglnx on November 22, 2016, 02:38:56 PM

Its weeks ago before all these optimizations have been taken place.
For me its working and give me 10sols more :-D

@AMPH: Did you noticed less power consumption too?

yeah only 650 watt or around that, but still without 200 sol per gpu is not competitive enough against amd...sadly

200 with claymore and the price is much high electricity bill... I can't do more than 110 with my RX480s but wattage is only 450(4 cards)

Amph

legendary

Activity: 3248

Merit: 1070

Quote from: ioglnx on November 22, 2016, 02:38:56 PM

Its weeks ago before all these optimizations have been taken place.
For me its working and give me 10sols more :-D

@AMPH: Did you noticed less power consumption too?

yeah only 650 watt or around that, but still without 200 sol per gpu is not competitive enough against amd...sadly

laik2

sr. member

Activity: 652

Merit: 266

Quote from: zawawa on November 22, 2016, 02:39:26 PM

Oh, I forgot to mention that my Windows port always shows a 5 min average for total hashrate.
You have to wait a little, but you get a more accurate number that way.

I wish I could do 130S/s with my AMDs.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Oh, I forgot to mention that my Windows port always shows a 5 min average for total hashrate.
You have to wait a little, but you get a more accurate number that way.

ioglnx

sr. member

Activity: 574

Merit: 250

Fighting mob law and inquisition in this forum

Its weeks ago before all these optimizations have been taken place.
For me its working and give me 10sols more :-D

@AMPH: Did you noticed less power consumption too?

Amph

legendary

Activity: 3248

Merit: 1070

that is a good boost over the sp one, getting 120 sol per gpu(1070), with my -502 mem setting and zero core, in my cose ocing it give very small boost over underclocking not worth it

laik2

sr. member

Activity: 652

Merit: 266

Quote from: ioglnx on November 22, 2016, 02:36:24 PM

Many of us wish that dream..but at least some 3Dfx tech made another revival in Maxwell GPUs and also in Pascal there are some additions again from 3Dfx :-D
Maybe nvidia started to sort the ideas board / paper bags of 3Dfx offices.

I think someone said that on nvidia one must use 1 instance, 2 won't work(or will be the same).
@zawawa - tell 'em that they can

but its not worthy at all...don't have the capacity of nv/amd.

ioglnx

sr. member

Activity: 574

Merit: 250

Fighting mob law and inquisition in this forum

Many of us wish that dream..but at least some 3Dfx tech made another revival in Maxwell GPUs and also in Pascal there are some additions again from 3Dfx :-D
Maybe nvidia started to sort the ideas board / paper bags of 3Dfx offices.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

You would be surprised to know that some people asked me in the past if they could use Intel HD Graphics for GPGPU...
You are mostly right about "other vendors," though. I wish these companies were still around.

ioglnx

sr. member

Activity: 574

Merit: 250

Fighting mob law and inquisition in this forum

Lol do you need to mention not other vendors :-D just say NV only is much shorter :-P
There is basically just AMD left..S3 gone, VIA gone. 3Dfx long gone..SGi gone :-D so..

Since 5min poolside reports 601.30 Sol/s

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

I just added Windows binaries with krnlx's optimized kernel for NVIDIA cards. Thank you, krnlx!

https://github.com/zawawawa/silentarmy/releases/tag/v5-win64standalone-r12

Please note that the new NVIDIA version is not compatible with GPU's from other vendors.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

testing2 is a keeper, then. Very well.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: ioglnx on November 22, 2016, 02:05:58 PM

Look at my update :-D
The increase is there hitting 500 on 4 Cards..2GTX1080 and 2 GTX1070 @601-619Watts before with all miners around 450Sol/s @ 750Watts.

It's instances=2 which is working ..threads is not.. :-D

VERY interesting... It's the other way around with AMD's drivers for Windows.
Gotta love those crazy OpenCL implementations...

TIKCrazy

member

Activity: 73

Merit: 10

testing 2 is working with 104 sol\s on 1060

Topic: SILENTARMY v5: Zcash miner, 115 sol/s on R9 Nano, 70 sol/s on GTX 1070 - page 22. (Read 209334 times)