Pages:
Author

Topic: SILENTARMY v5: Zcash miner, 115 sol/s on R9 Nano, 70 sol/s on GTX 1070 - page 22. (Read 209263 times)

legendary
Activity: 1898
Merit: 1024
1070, win10

instances=2 gives 99% gpu usage and ~135sol/s
threads=2 does not work on windows 0 sol
legendary
Activity: 1176
Merit: 1015
4x970 with zawawawa-r12-nv doing ~400 S/s, w7 and some oc. Power consumption down a bit from sp_ version.

r6 was hashing ~340 and sp_1 ~380 with same clocks.

About competition against amd in equihash, I am afraid that we haven't seen nothing yet from high end older amd cards.
sr. member
Activity: 728
Merit: 304
Miner Developer
Oh, I forgot to mention that my Windows port always shows a 5 min average for total hashrate.
You have to wait a little, but you get a more accurate number that way.
I wish I could do 130S/s with my AMDs.

Here's how to get a lot more than 130:
https://bitcointalksearch.org/topic/m.16957668


Marvelous! An excellent analysis! I will take up the challenge with the GCN assembly.
This is so much fun!!
sr. member
Activity: 652
Merit: 266
Oh, I forgot to mention that my Windows port always shows a 5 min average for total hashrate.
You have to wait a little, but you get a more accurate number that way.
I wish I could do 130S/s with my AMDs.

Here's how to get a lot more than 130:
https://bitcointalksearch.org/topic/m.16957668

This is basicly what I could understand from your writings.
wrong paste Smiley
Quote
Using LDS or L1 Cache

There are a number of considerations when deciding between LDS and L1 cache for a given algorithm.

LDS supports read/modify/write operations, as well as atomics. It is well-suited for code that requires fast read/write, read/modify/write, or scatter operations that otherwise are directed to global memory. On current AMD hardware, L1 is part of the read path; hence, it is suited to cache-read-sensitive algorithms, such as matrix multiplication or convolution.

LDS is typically larger than L1 (for example: 64 kB vs 16 kB on Southern Islands devices). If it is not possible to obtain a high L1 cache hit rate for an algorithm, the larger LDS size can help. On the AMD Radeon  HD 7970 device, the theoretical LDS peak bandwidth is 3.8 TB/s, compared to L1 at 1.9 TB/sec.

The native data type for L1 is a four-vector of 32-bit words. On L1, fill and read addressing are linked. It is important that L1 is initially filled from global memory with a coalesced access pattern; once filled, random accesses come at no extra processing cost.

Currently, the native format of LDS is a 32-bit word. The theoretical LDS peak bandwidth is achieved when each thread operates on a two-vector of 32-bit words (16 threads per clock operate on 32 banks). If an algorithm requires coalesced 32-bit quantities, it maps well to LDS. The use of four-vectors or larger can lead to bank conflicts, although the compiler can mitigate some of these.

From an application point of view, filling LDS from global memory, and reading from it, are independent operations that can use independent addressing. Thus, LDS can be used to explicitly convert a scattered access pattern to a coalesced pattern for read and write to global memory. Or, by taking advantage of the LDS read broadcast feature, LDS can be filled with a coalesced pattern from global memory, followed by all threads iterating through the same LDS words simultaneously.

LDS reuses the data already pulled into cache by other wavefronts. Sharing across work-groups is not possible because OpenCL does not guarantee that LDS is in a particular state at the beginning of work-group execution. L1 content, on the other hand, is independent of work-group execution, so that successive work-groups can share the content in the L1 cache of a given Vector ALU. However, it currently is not possible to explicitly control L1 sharing across work-groups.

The use of LDS is linked to GPR usage and wavefront-per-Vector ALU count. Better sharing efficiency requires a larger work-group, so that more work-items share the same LDS. Compiling kernels for larger work-groups typically results in increased register use, so that fewer wavefronts can be scheduled simultaneously per Vector ALU. This, in turn, reduces memory latency hiding. Requesting larger amounts of LDS per work-group results in fewer wavefronts per Vector ALU, with the same effect.

LDS typically involves the use of barriers, with a potential performance impact. This is true even for read-only use cases, as LDS must be explicitly filled in from global memory (after which a barrier is required before reads can commence).
sr. member
Activity: 588
Merit: 251
Oh, I forgot to mention that my Windows port always shows a 5 min average for total hashrate.
You have to wait a little, but you get a more accurate number that way.
I wish I could do 130S/s with my AMDs.

Here's how to get a lot more than 130:
https://bitcointalksearch.org/topic/m.16957668
sr. member
Activity: 728
Merit: 304
Miner Developer
Oh, I forgot to mention that my Windows port always shows a 5 min average for total hashrate.
You have to wait a little, but you get a more accurate number that way.
I wish I could do 130S/s with my AMDs.

I'm pretty sure we will get there. It is just that there is no "easy" optimizations left for AMD cards because they were the first targets of this miner. The next optimization requires a massive rewrite, bit it can be done, methinks.
sr. member
Activity: 652
Merit: 266
Its weeks ago before all these optimizations have been taken place.
For me its working and give me 10sols more :-D

@AMPH: Did you noticed less power consumption too?

yeah only 650 watt or around that, but still without 200 sol per gpu is not competitive enough against amd...sadly
200 with claymore and the price is much high electricity bill... I can't do more than 110 with my RX480s but wattage is only 450(4 cards)
legendary
Activity: 3206
Merit: 1069
Its weeks ago before all these optimizations have been taken place.
For me its working and give me 10sols more :-D

@AMPH: Did you noticed less power consumption too?

yeah only 650 watt or around that, but still without 200 sol per gpu is not competitive enough against amd...sadly
sr. member
Activity: 652
Merit: 266
Oh, I forgot to mention that my Windows port always shows a 5 min average for total hashrate.
You have to wait a little, but you get a more accurate number that way.
I wish I could do 130S/s with my AMDs.
sr. member
Activity: 728
Merit: 304
Miner Developer
Oh, I forgot to mention that my Windows port always shows a 5 min average for total hashrate.
You have to wait a little, but you get a more accurate number that way.
sr. member
Activity: 574
Merit: 250
Fighting mob law and inquisition in this forum
Its weeks ago before all these optimizations have been taken place.
For me its working and give me 10sols more :-D

@AMPH: Did you noticed less power consumption too?
legendary
Activity: 3206
Merit: 1069
that is a good boost over the sp one, getting 120 sol per gpu(1070), with my -502 mem setting and zero core, in my cose ocing it give very small boost over underclocking not worth it
sr. member
Activity: 652
Merit: 266
Many of us wish that dream..but at least some 3Dfx tech made another revival in Maxwell GPUs and also in Pascal there are some additions again from 3Dfx :-D
Maybe nvidia started to sort the ideas board / paper bags of 3Dfx offices.
I think someone said that on nvidia one must use 1 instance, 2 won't work(or will be the same).
@zawawa - tell 'em that they can Smiley but its not worthy at all...don't have the capacity of nv/amd.
sr. member
Activity: 574
Merit: 250
Fighting mob law and inquisition in this forum
Many of us wish that dream..but at least some 3Dfx tech made another revival in Maxwell GPUs and also in Pascal there are some additions again from 3Dfx :-D
Maybe nvidia started to sort the ideas board / paper bags of 3Dfx offices.
sr. member
Activity: 728
Merit: 304
Miner Developer
You would be surprised to know that some people asked me in the past if they could use Intel HD Graphics for GPGPU...
You are mostly right about "other vendors," though. I wish these companies were still around.
sr. member
Activity: 574
Merit: 250
Fighting mob law and inquisition in this forum
Lol do you need to mention not other vendors :-D just say NV only is much shorter :-P
There is basically just AMD left..S3 gone, VIA gone. 3Dfx long gone..SGi gone :-D so..

Since 5min poolside reports 601.30 Sol/s
sr. member
Activity: 728
Merit: 304
Miner Developer
I just added Windows binaries with krnlx's optimized kernel for NVIDIA cards. Thank you, krnlx!

https://github.com/zawawawa/silentarmy/releases/tag/v5-win64standalone-r12

Please note that the new NVIDIA version is not compatible with GPU's from other vendors.
sr. member
Activity: 728
Merit: 304
Miner Developer
testing2 is a keeper, then. Very well.
sr. member
Activity: 728
Merit: 304
Miner Developer
Look at my update :-D
The increase is there hitting 500 on 4 Cards..2GTX1080 and 2 GTX1070 @601-619Watts before with all miners around 450Sol/s @ 750Watts.

It's instances=2 which is working ..threads is not.. :-D

VERY interesting... It's the other way around with AMD's drivers for Windows.
Gotta love those crazy OpenCL implementations...
member
Activity: 73
Merit: 10
testing 2 is working with 104 sol\s on 1060
Pages:
Jump to: