Author

Topic: CCminer(SP-MOD) Modded NVIDIA Maxwell / Pascal kernels. - page 861. (Read 2347659 times)

legendary
Activity: 2940
Merit: 1091
--- ChainWorks Industries ---


I have tested some more. Release 74 is using more power and heat, so the card trottle and performance is lost.
My test was conducted in a closed case rig with a EVGA superclocked card (2x 6pins power)

but my dev card:
The gigabyte 970oc G1 comes with 1x8 pin and 1x6 pin and doesn't trottle and give bether performance in release 74 than 66.



SP_, all 5.2 maxwells do at least 1500MHz under load. In open case, limiting factor is TDP. Nothing to do with power pins.

-edit

Testing how far I can go with reference 970. You need to give this baby almost +300 to get there (1500). Around 10500 kH/s mining lyra2v2.

-edit2

Same settings with quark, hitting TDP and only hashing about 17200 kH/s. TDP mod is all you need if you want to get all out of these.

It has to do with power pins: max TDP depends on how the card is powered. I have two kinds of 970 and the max TDP is higher on the one that has one 8 pin and one 6 pins instead of 2x6.
Furthermore, the power you need to reach a certain frequency varies based on a lot of variables like room temperature, efficiency of cooling, fan speed, the chip itself, etc.

Correlation does not imply causation. I have some cards with less pins that draw more power then ones with more pins. It just happens that in this case sometimes you get cards with more pins that draw more power then cards with less pins. Take some low clocked Gigabyte cards (not Windforce) and compare them to Asus Strix cards...

They can program the bios to draw as much power as they want and it's up to the manufacturer to decide how many aux connectors they want on the board.

It's actually scary thinking some people are using 750tis without aux connectors and sucking it all through the board because they think that wont have any adverse effects. I don't think I'd ever do that or buy 750tis without a aux connector.

thefarm in its ( almost ) entirety are the gigabyte 750ti oc lp ( low profile ) which have NO aux power connectors ...

BUT - they draw their power from the powered usb 3.0 risers as well as from the pcie bus ...

so even though they have no aux power connector - they run really well for a 750ti card ...

there was a time when the whole farm was based on gigabyte 280x oc cards which were ALL run off the ribbon risers - 5 x 280x oc cards per machine ... cable risers melted and burned through their power connection pins - and some destroyed cards like they were firecrackers ...

changing to the usb 3.0 powered risers fixed the majority of power issues - but the gigabyte 7970 oc / 280x oc cards ended dying due to their fan malfunctions ... so not a power issue but a manufacturing / heat dissipation issue on that end ...

#crysx
legendary
Activity: 1764
Merit: 1024


I have tested some more. Release 74 is using more power and heat, so the card trottle and performance is lost.
My test was conducted in a closed case rig with a EVGA superclocked card (2x 6pins power)

but my dev card:
The gigabyte 970oc G1 comes with 1x8 pin and 1x6 pin and doesn't trottle and give bether performance in release 74 than 66.



SP_, all 5.2 maxwells do at least 1500MHz under load. In open case, limiting factor is TDP. Nothing to do with power pins.

-edit

Testing how far I can go with reference 970. You need to give this baby almost +300 to get there (1500). Around 10500 kH/s mining lyra2v2.

-edit2

Same settings with quark, hitting TDP and only hashing about 17200 kH/s. TDP mod is all you need if you want to get all out of these.

It has to do with power pins: max TDP depends on how the card is powered. I have two kinds of 970 and the max TDP is higher on the one that has one 8 pin and one 6 pins instead of 2x6.
Furthermore, the power you need to reach a certain frequency varies based on a lot of variables like room temperature, efficiency of cooling, fan speed, the chip itself, etc.

Correlation does not imply causation. I have some cards with less pins that draw more power then ones with more pins. It just happens that in this case sometimes you get cards with more pins that draw more power then cards with less pins. Take some low clocked Gigabyte cards (not Windforce) and compare them to Asus Strix cards...

They can program the bios to draw as much power as they want and it's up to the manufacturer to decide how many aux connectors they want on the board.

It's actually scary thinking some people are using 750tis without aux connectors and sucking it all through the board because they think that wont have any adverse effects. I don't think I'd ever do that or buy 750tis without a aux connector.
legendary
Activity: 2716
Merit: 1094
Black Belt Developer


I have tested some more. Release 74 is using more power and heat, so the card trottle and performance is lost.
My test was conducted in a closed case rig with a EVGA superclocked card (2x 6pins power)

but my dev card:
The gigabyte 970oc G1 comes with 1x8 pin and 1x6 pin and doesn't trottle and give bether performance in release 74 than 66.



SP_, all 5.2 maxwells do at least 1500MHz under load. In open case, limiting factor is TDP. Nothing to do with power pins.

-edit

Testing how far I can go with reference 970. You need to give this baby almost +300 to get there (1500). Around 10500 kH/s mining lyra2v2.

-edit2

Same settings with quark, hitting TDP and only hashing about 17200 kH/s. TDP mod is all you need if you want to get all out of these.

It has to do with power pins: max TDP depends on how the card is powered. I have two kinds of 970 and the max TDP is higher on the one that has one 8 pin and one 6 pins instead of 2x6.
Furthermore, the power you need to reach a certain frequency varies based on a lot of variables like room temperature, efficiency of cooling, fan speed, the chip itself, etc.
legendary
Activity: 2940
Merit: 1091
--- ChainWorks Industries ---
IBM(Xilinix) and Intel(Altera ) are both working with FPGA makers to produce CPUs with FPGAS built in.

Nvidia Pascal looks to be 10 times faster than Maxwell and is expected to be released in 2016

If the Pascal specs are true, it would breathe some new life into GPU mining



I already have an ARM dual-core with an FPGA on the same chip.

that can process x11? ... ...

Wink ...

#crysx

The Cyclone V itself isn't big enough. I do know how to get boards that are on the cheap.

would they be difficult to code to do x11 optimized efficiently? ...

#crysx

It's not really coding, it's chip design. And it'd be VERY tedious, but doable.

tedious and doable - but worth doing? ...

#crysx

DEFINITELY.

well - that says it all doesnt it Smiley ...

ill pm you for any details you wish to share - and whether you are interested in maybe doing it as a project ...

you know all the details - so its just a matter of when where and how much? ...

hang on a moment ... thats a proposition for a service - but not this one ... ok ...

Tongue ...

#crysx

Why would he share it with small time GPU miners? People are already doing this and making cloud mining services like Genesis Mining. They say they're using GPUs, but for their prices they definitely aren't. Two year ROI on cryptos at extremely large investments is suicide, there is obviously alternative options.

Doesn't help with a 1.5x faster hardware when the software is 2x slower.. So you will need someone to create a good compiler, and someone to mod the code..

The 980ti is around 3x faster than the 780ti mining quark.
The 980ti is around 2x faster than the 780ti mining x11.
The 980ti is around 1.5x faster than the 780ti mining lyra2v2.


Yeah, take a look at how fast code has been designed for Fury as a baseline... Although there isn't much demand for Fury based optimizations. There may be for Pascal.

a project such as this is very challenging - i must admit - but would be a satisfactory accomplishment if it were to be successful ... and can be quite 'profitable' ( in a fiat sense ) to have running for a short while before any sort of release is made ...

in any respect - i would be hard pressed to think it would be an overnight project - let alone an overnight success ... and it would have its trials in some massive ways - as i agree with you in that large farms would normally keep it to themselves ... but i have yet to ask the questions to wolf about all of this - so its all still vaporware for the time being ...

it would take a bit to get it out to the public anyway - so im guessing it would be an expensive affair to undertake ... as well as a dedicated amount of time and effort ... it would be up to those that take it on - as i am interested in such projects that would enhance mining in a massive way ... gpu is still the way to with this though - so the focus is still gpu based ...

pascal based hardware will be interesting ...

#crysx

Been trying to find you on IRC.

apologies mate ...

when im away from the office - i have no connection with irc ... i use irc only in the office ...

skype is the next best thing ( as i have that everywhere ) and here ...

i will be back there the day after tomorrow ... needed to get myself out for a couple of days to sort a whole heap of 'personal business' ...

if you can skype - please do ...

#crysx
full member
Activity: 201
Merit: 100


I have tested some more. Release 74 is using more power and heat, so the card trottle and performance is lost.
My test was conducted in a closed case rig with a EVGA superclocked card (2x 6pins power)

but my dev card:
The gigabyte 970oc G1 comes with 1x8 pin and 1x6 pin and doesn't trottle and give bether performance in release 74 than 66.



SP_, all 5.2 maxwells do at least 1500MHz under load. In open case, limiting factor is TDP. Nothing to do with power pins.

-edit

Testing how far I can go with reference 970. You need to give this baby almost +300 to get there (1500). Around 10500 kH/s mining lyra2v2.

-edit2

Same settings with quark, hitting TDP and only hashing about 17200 kH/s. TDP mod is all you need if you want to get all out of these.

Yeah right... Roll Eyes

I have 6 maxwells 5.2 + 3x 750ti and only 1 of them is able to do a stable 1500+. All of them are over 1400 but no way to do a stable 1500 without a vcore rise.  
Each of them was slowly overclocked by 20Mhz step. When I have noticed any crash of a miner I put it 20Mhz back. Gigabyte G1 970 was also benchmarked in some Windows "gaming" benchmarks.
Those are the clocks I'm using 24/7 without any stability problem and temperatures from 65-74°C:
gigabyte G1 980 1450Mhz
gigabyte G1 970 1560Mhz
3x Asus strix dc2oc 970 1440/1450/1480 Mhz
960 1411 Mhz
3x 750ti 1400/1420/1440 Mhz

Btw I don't care optimizing H/W ratio, because the electricity price is still low at my location.

With the r.74 I get more or less the same speeds as in r.66.
legendary
Activity: 2002
Merit: 1051
ICO? Not even once.
anyone know the rpcport to solo-mine MONA ?

RCP ports doesn't matter, it's whatever you give it as long as you point the miner to it.
But don't waste too much time trying to solomine Mona because it uses a different JSON/RPC protocol or something like that because ccminer can't communicate with it:

< HTTP/1.1 404 Not Found
< Connection: close
< Content-Length: 76
< Content-Type: application/json
< Server: monacoin-json-rpc/v0.10.2.2-3dc2e6a-hotfix
<
* Closing connection 0

JSON protocol response:
{
   "error": {
      "code": -32601,
      "message": "Method not found"
   },
   "result": null,
   "id": 0
}

JSON-RPC call failed: Method not found
json_rpc_call failed, retry after x seconds


Mona source: https://github.com/monacoinproject/monacoin/blob/master-0.10/src/rpcserver.cpp#L982

I always only solomine so I asked sp_, djm and tpruvot about it but it's not something they're interested in investigating.
sr. member
Activity: 427
Merit: 250
anyone know the rpcport to solo-mine MONA ?
legendary
Activity: 2940
Merit: 1091
--- ChainWorks Industries ---
IBM(Xilinix) and Intel(Altera ) are both working with FPGA makers to produce CPUs with FPGAS built in.

Nvidia Pascal looks to be 10 times faster than Maxwell and is expected to be released in 2016

If the Pascal specs are true, it would breathe some new life into GPU mining



I already have an ARM dual-core with an FPGA on the same chip.

that can process x11? ... ...

Wink ...

#crysx

The Cyclone V itself isn't big enough. I do know how to get boards that are on the cheap.

would they be difficult to code to do x11 optimized efficiently? ...

#crysx

It's not really coding, it's chip design. And it'd be VERY tedious, but doable.

tedious and doable - but worth doing? ...

#crysx

DEFINITELY.

well - that says it all doesnt it Smiley ...

ill pm you for any details you wish to share - and whether you are interested in maybe doing it as a project ...

you know all the details - so its just a matter of when where and how much? ...

hang on a moment ... thats a proposition for a service - but not this one ... ok ...

Tongue ...

#crysx

Why would he share it with small time GPU miners? People are already doing this and making cloud mining services like Genesis Mining. They say they're using GPUs, but for their prices they definitely aren't. Two year ROI on cryptos at extremely large investments is suicide, there is obviously alternative options.

Doesn't help with a 1.5x faster hardware when the software is 2x slower.. So you will need someone to create a good compiler, and someone to mod the code..

The 980ti is around 3x faster than the 780ti mining quark.
The 980ti is around 2x faster than the 780ti mining x11.
The 980ti is around 1.5x faster than the 780ti mining lyra2v2.


Yeah, take a look at how fast code has been designed for Fury as a baseline... Although there isn't much demand for Fury based optimizations. There may be for Pascal.

a project such as this is very challenging - i must admit - but would be a satisfactory accomplishment if it were to be successful ... and can be quite 'profitable' ( in a fiat sense ) to have running for a short while before any sort of release is made ...

in any respect - i would be hard pressed to think it would be an overnight project - let alone an overnight success ... and it would have its trials in some massive ways - as i agree with you in that large farms would normally keep it to themselves ... but i have yet to ask the questions to wolf about all of this - so its all still vaporware for the time being ...

it would take a bit to get it out to the public anyway - so im guessing it would be an expensive affair to undertake ... as well as a dedicated amount of time and effort ... it would be up to those that take it on - as i am interested in such projects that would enhance mining in a massive way ... gpu is still the way to with this though - so the focus is still gpu based ...

pascal based hardware will be interesting ...

#crysx
legendary
Activity: 1176
Merit: 1015


I have tested some more. Release 74 is using more power and heat, so the card trottle and performance is lost.
My test was conducted in a closed case rig with a EVGA superclocked card (2x 6pins power)

but my dev card:
The gigabyte 970oc G1 comes with 1x8 pin and 1x6 pin and doesn't trottle and give bether performance in release 74 than 66.



SP_, all 5.2 maxwells do at least 1500MHz under load. In open case, limiting factor is TDP. Nothing to do with power pins.

-edit

Testing how far I can go with reference 970. You need to give this baby almost +300 to get there (1500). Around 10500 kH/s mining lyra2v2.

-edit2

Same settings with quark, hitting TDP and only hashing about 17200 kH/s. TDP mod is all you need if you want to get all out of these.
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
Sorry for dumb question - "GPU-Z - PerfCap Reason - VOp" is normal? (GTX750TIOC2GD5)

You should add a little overclocking for optimal performance.
newbie
Activity: 26
Merit: 0
Sorry for dumb question - "GPU-Z - PerfCap Reason - VOp" is normal? (GTX750TIOC2GD5)
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
Nope. Compute 3.5 also has max 255 regs per thread. https://docs.nvidia.com/cuda/cuda-c-programming-guide/#compute-capabilities Table 13.
I can see it in the link. I don't have a compute 3.5 card. Maybe there are some possible speedups to be made on the 780ti.
Anyone with a 780ti card who can compile the latest version (add compute 3.5 in the projectfile (or makefile)
What hashrates are you getting?
EVGA 780ti SC +100 GPU OC
Quark:   11.7 MH/s
X11:       6.35 MH/s
Lyra2v2: 7.7 MH/s
Neo:       330 KH/s   (375 with r58)
I think djm34's original lyra2v2 does around 9 in lyra2v2?
must be about that. However it is (I think) related to the 64bit instruction which is a lot faster on the 780ti for some reasons...

I think it's because the compute 3.5 kernal has a memshift variable that allign all reads to 4x32bit boundary. Since the vector instructions needs the memory to be alligned.

Or it could be the level1 cache wich works on memory banks just like shared mem. If two threads are reading from the same memory block you get a stall. Djm's  compute 3.5 core is using 25% more memory in the random access matrix and is not faster on the maxwell.


legendary
Activity: 1764
Merit: 1024
IBM(Xilinix) and Intel(Altera ) are both working with FPGA makers to produce CPUs with FPGAS built in.

Nvidia Pascal looks to be 10 times faster than Maxwell and is expected to be released in 2016

If the Pascal specs are true, it would breathe some new life into GPU mining



I already have an ARM dual-core with an FPGA on the same chip.

that can process x11? ... ...

Wink ...

#crysx

The Cyclone V itself isn't big enough. I do know how to get boards that are on the cheap.

would they be difficult to code to do x11 optimized efficiently? ...

#crysx

It's not really coding, it's chip design. And it'd be VERY tedious, but doable.

tedious and doable - but worth doing? ...

#crysx

DEFINITELY.

well - that says it all doesnt it Smiley ...

ill pm you for any details you wish to share - and whether you are interested in maybe doing it as a project ...

you know all the details - so its just a matter of when where and how much? ...

hang on a moment ... thats a proposition for a service - but not this one ... ok ...

Tongue ...

#crysx

Why would he share it with small time GPU miners? People are already doing this and making cloud mining services like Genesis Mining. They say they're using GPUs, but for their prices they definitely aren't. Two year ROI on cryptos at extremely large investments is suicide, there is obviously alternative options.

Doesn't help with a 1.5x faster hardware when the software is 2x slower.. So you will need someone to create a good compiler, and someone to mod the code..

The 980ti is around 3x faster than the 780ti mining quark.
The 980ti is around 2x faster than the 780ti mining x11.
The 980ti is around 1.5x faster than the 780ti mining lyra2v2.


Yeah, take a look at how fast code has been designed for Fury as a baseline... Although there isn't much demand for Fury based optimizations. There may be for Pascal.
member
Activity: 81
Merit: 10
Nope. Compute 3.5 also has max 255 regs per thread. https://docs.nvidia.com/cuda/cuda-c-programming-guide/#compute-capabilities Table 13.
I can see it in the link. I don't have a compute 3.5 card. Maybe there are some possible speedups to be made on the 780ti.
Anyone with a 780ti card who can compile the latest version (add compute 3.5 in the projectfile (or makefile)
What hashrates are you getting?
EVGA 780ti SC +100 GPU OC
Quark:   11.7 MH/s
X11:       6.35 MH/s
Lyra2v2: 7.7 MH/s
Neo:       330 KH/s   (375 with r58)

I think djm34's original lyra2v2 does around 9 in lyra2v2?

must be about that. However it is (I think) related to the 64bit instruction which is a lot faster on the 780ti for some reasons...
sr. member
Activity: 438
Merit: 250
Nope. Compute 3.5 also has max 255 regs per thread. https://docs.nvidia.com/cuda/cuda-c-programming-guide/#compute-capabilities Table 13.
I can see it in the link. I don't have a compute 3.5 card. Maybe there are some possible speedups to be made on the 780ti.
Anyone with a 780ti card who can compile the latest version (add compute 3.5 in the projectfile (or makefile)
What hashrates are you getting?
EVGA 780ti SC +100 GPU OC
Quark:   11.7 MH/s
X11:       6.35 MH/s
Lyra2v2: 7.7 MH/s
Neo:       330 KH/s   (375 with r58)

I think djm34's original lyra2v2 does around 9 in lyra2v2?

Gigabyte 980ti OC g1 +100mhz

quark: 28.5MHASH
x11: 13.5MHASH
lyra2v2: 18MHASH
Neo: 650KH/s

Must be LOP3.LUT, Nvidia's answer (and more) to AMD's native bitselect. A while ago I tried using inline PTX lop3.lut on some of your algo's, only to find out that the PTX -> SASS compiler already took care of that  Undecided
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
Nope. Compute 3.5 also has max 255 regs per thread. https://docs.nvidia.com/cuda/cuda-c-programming-guide/#compute-capabilities Table 13.
I can see it in the link. I don't have a compute 3.5 card. Maybe there are some possible speedups to be made on the 780ti.
Anyone with a 780ti card who can compile the latest version (add compute 3.5 in the projectfile (or makefile)
What hashrates are you getting?
EVGA 780ti SC +100 GPU OC
Quark:   11.7 MH/s
X11:       6.35 MH/s
Lyra2v2: 7.7 MH/s
Neo:       330 KH/s   (375 with r58)

I think djm34's original lyra2v2 does around 9 in lyra2v2?

Gigabyte 980ti OC g1 +100mhz (release 74)

quark: 28.5MHASH
x11: 13.5MHASH
lyra2v2: 18MHASH
Neo: 650KH/s

legendary
Activity: 1470
Merit: 1114
Nope. Compute 3.5 also has max 255 regs per thread. https://docs.nvidia.com/cuda/cuda-c-programming-guide/#compute-capabilities Table 13.

I can see it in the link. I don't have a compute 3.5 card. Maybe there are some possible speedups to be made on the 780ti.


Anyone with a 780ti card who can compile the latest version (add compute 3.5 in the projectfile (or makefile)

What hashrates are you getting?

EVGA 780ti SC +100 GPU OC

Quark:   11.7 MH/s
X11:       6.35 MH/s
Lyra2v2: 7.7 MH/s
Neo:       330 KH/s   (375 with r58)


sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
Nope. Compute 3.5 also has max 255 regs per thread. https://docs.nvidia.com/cuda/cuda-c-programming-guide/#compute-capabilities Table 13.
I can see it in the link. I don't have a compute 3.5 card. Maybe there are some possible speedups to be made on the 780ti.
I did most of the work on ethminer on a 780. It's mostly the same as Maxwell, but on SASS level (or PTX if you use Cuda 7.5) the biggest difference is in the absence of LOP3.LUT. Another big difference is that for kernels with low reg counts, you can do double the amount of blocks per SM on Maxwell (32 vs 16). And more shared mem per SM.

Would you be so kind and test the speed of the different algos in my fork?
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
1.5.66(sp-MOD) - Lyra2v2 - GTX750Ti  - 1431/1440 - 5177kh/s
1.5.69(sp-MOD) - Lyra2v2 - GTX750Ti  - 1431/1440 - 5160kh/s
1.5.73(sp-MOD) - Lyra2v2 - GTX750Ti  - 1431/1440 - 5100kh/s
1.5.74(sp-MOD) - Lyra2v2 - GTX750Ti  - 1431/1440 - 5145kh/s
1.5.66(sp-MOD) - Quark - GTX750Ti  - 1431/1440 - 7195kh/s
1.5.69(sp-MOD) - Quark - GTX750Ti  - 1431/1440 - 7251kh/s
1.5.73(sp-MOD) - Quark - GTX750Ti  - 1431/1440 - 7238kh/s
1.5.74(sp-MOD) - Quark - GTX750Ti  - 1431/1440 - 7190kh/s
Thanks for testing. Something happed between release 66 and 69.
Looks like 66 is the fastest. but the clocks are higher.

lyra2v2:
release 66: 9805
gtx 970 (core 1354Mhz, mem 1502MHz)

release 69: 9282
gtx 970 (core 1328.5Mhz, mem 1502MHz)

release 73: 9204
gtx 970 (core 1328.5Mhz, mem 1502MHz)

release 74: 9550
gtx 970 (core 1316.3Mhz, mem 1502MHz)

I have tested some more. Release 74 is using more power and heat, so the card trottle and performance is lost.
My test was conducted in a closed case rig with a EVGA superclocked card (2x 6pins power)

but my dev card:
The gigabyte 970oc G1 comes with 1x8 pin and 1x6 pin and doesn't trottle and give bether performance in release 74 than 66.

sr. member
Activity: 438
Merit: 250
Nope. Compute 3.5 also has max 255 regs per thread. https://docs.nvidia.com/cuda/cuda-c-programming-guide/#compute-capabilities Table 13.

I can see it in the link. I don't have a compute 3.5 card. Maybe there are some possible speedups to be made on the 780ti.

I did most of the work on ethminer on a 780. It's mostly the same as Maxwell, but on SASS level (or PTX if you use Cuda 7.5) the biggest difference is in the absence of LOP3.LUT. Another big difference is that for kernels with low reg counts, you can do double the amount of blocks per SM on Maxwell (32 vs 16). And more shared mem per SM.
Jump to: