Pages:
Author

Topic: artforz and coblee gpu mining litecoin since the start? - page 17. (Read 32574 times)

sr. member
Activity: 406
Merit: 257
Well, from talking with Ahimoth on btc-e, sounds like mtrlt managed to find a workaround for the issues I was having with scrypt miner kernels on GPU.
Short version... any of my kernels that got speeds like that also got 100% invalids. Yet the same kernel worked fine on CPU or if I dropped global and/or local worksizes to silly small levels... which of course made it dog slow again...
And of course RS turning the whole thing into "OMG CONSPIRACY!!1one"... duh, it's RS.
legendary
Activity: 4592
Merit: 1851
Linux since 1997 RedHat 4
What's funny to me about all this is that gpu mining the coin early before anyone else could is exactly what coinhunter did with SC Cheesy.   Such a douche.  You should be banned from this forum.

 Huh
Well.. yeah, SC started out as a GPU chain.

Nope.  It started as CPU only, then a few weeks after launch they miraculously "turned on" GPU mining.

https://bitcointalksearch.org/topic/solidcoin-v20-features-new-hashing-algorithm-faster-on-cpus-44423

then

https://bitcointalksearch.org/topic/solidcoin-2-gpu-mining-and-the-future-49219
SC2.0 did, SC was GPU Smiley
legendary
Activity: 924
Merit: 1004
Firstbits: 1pirata
In Litecoins, hashes mine themselves, and distribute to worthy miners automatically with gift of free electricity and unicorn-shatted skittles delivered by fairy Godmothers.

In ShortBusCoins, coins themselves are shit, as is Glorious Leader Dear DoucheBag/RS/CH King of All Delusion and FUD, Hysterical Chicken Little Bitch Princess of the Realm.

Who cares? RS/CH/Douchebag coming up with the scandal-of-the-moment announcement is a lot like a fish pissing in the ocean, it doesn't amount to much, and nobody really notices.

epic, like always  Cheesy
sr. member
Activity: 574
Merit: 250
In Litecoins, hashes mine themselves, and distribute to worthy miners automatically with gift of free electricity and unicorn-shatted skittles delivered by fairy Godmothers.

In ShortBusCoins, coins themselves are shit, as is Glorious Leader Dear DoucheBag/RS/CH King of All Delusion and FUD, Hysterical Chicken Little Bitch Princess of the Realm.

Who cares? RS/CH/Douchebag coming up with the scandal-of-the-moment announcement is a lot like a fish pissing in the ocean, it doesn't amount to much, and nobody really notices.
sr. member
Activity: 350
Merit: 250
What's funny to me about all this is that gpu mining the coin early before anyone else could is exactly what coinhunter did with SC Cheesy.   Such a douche.  You should be banned from this forum.

 Huh
Well.. yeah, SC started out as a GPU chain.

Nope.  It started as CPU only, then a few weeks after launch they miraculously "turned on" GPU mining.

https://bitcointalksearch.org/topic/solidcoin-v20-features-new-hashing-algorithm-faster-on-cpus-44423

then

https://bitcointalksearch.org/topic/solidcoin-2-gpu-mining-and-the-future-49219
sr. member
Activity: 392
Merit: 250
So I guess CH is really telling us its time to quickly buy some litecoins, because difficulty is about to explode Smiley.

So fast GPU mining of LTC would drive the price of Litecoins up?


sr. member
Activity: 266
Merit: 250
The king and the pawn go in the same box @ endgame
Oh and i have made some dozens of full custom chips at 28nm specifically tailored to mine bitcoin and litecoins. 1 terahash per chip. True story.

Ships in 4-6 weeks?

Laughed so hard!
hero member
Activity: 497
Merit: 500

It currently gets ~250KH/s on a 6990 .


the ltc-mining cpu- and gpu-effectivity w/kh seem to be simultan:
core i3 / 3.1 ghz / 3 threads / ~ 15 w only for this / 12.5 kh/s =>1.2 w/kh
core i7 / 3.6 ghz / 7 threads / ~ 46 w only for this / 32 kh/s =>1.437 w/kh
6990 / ~350 w (don't know the correct watt) for 250 kh/s => 1.4 w/kh

but with more improvement there will be an even better w/kh-rate for the 6990 and the other mining-cards

so the end of ltc-cpu-mining is not too far away because
one 6990 does the job of ~8 core-i7 or ~20 core i3

A unicorn can run 500 miles on one gallon (of beer).  A gallon of beer is more expensive than a gallon of gas but likely we are still seeing the end of hybrid cars because one unicorn is equal to almost 10x Toyota Prius.  Plus with improvement in unicorn-beer technology you are getting even better m/goa rate (thats miles to gallons of alcohol).


OMG I laughed for about 4 mins from that one. Priceless!
hero member
Activity: 518
Merit: 500
So I guess CH is really telling us its time to quickly buy some litecoins, because difficulty is about to explode Smiley.
hero member
Activity: 896
Merit: 1000
Seal Cub Clubbing Club
What's funny to me about all this is that gpu mining the coin early before anyone else could is exactly what coinhunter did with SC Cheesy.   Such a douche.  You should be banned from this forum.

 Huh
Well.. yeah, SC started out as a GPU chain.
sr. member
Activity: 350
Merit: 250
What's funny to me about all this is that gpu mining the coin early before anyone else could is exactly what coinhunter did with SC Cheesy.   Such a douche.  You should be banned from this forum.
newbie
Activity: 39
Merit: 0
Bandwidth has nothing to do w/ scrypt.  LATENCY does.  Which is why the amount of L1 cache is so important.
L1 cache is just less important than you think Smiley For example, my scrypt miner optimizations for Cell do not use 256KB of fast local memory at all. It is insufficient for 4x unrolling which is needed in order to eliminate pipeline stalls and at least half of the performance would be lost. But scrypt is not memory heavy enough, so I can easily get away working with the main memory and still have a lot of memory bandwidth headroom. LATENCY is not important in my case, because memory accesses are pipelined, get executed asynchronously and do not block execution. But you can check scrypt_spu_core8 function in the code yourself.

If GPUs have excessive computational resources, then even waiting for memory a lot of time (80% or so per each execution core) is likely not a problem as long as all of them are competing for the precious memory bandwidth and fully saturating it. I did not think about GPU mining earlier just because I did not have any experience with GPU programming and honestly did not expect them to have that much memory bandwidth (more than 10x advantage over Cell).
legendary
Activity: 1484
Merit: 1005
I am no GPGPU expert, but I think ArtForz made some very good points in the following thread:
https://bitcointalksearch.org/topic/tenebrix-scaling-questions-45849
CoinHunter could make his claims convincing by simply explaining how to address the GPU limitations outlined by ArtForz.
At least ArtForz was mistaken about Cell earlier Smiley

Just let's do some simple math. Playstation3 has 6 SPE cores, each clocked at 3.2GHz and 25GB/s of total memory bandwidth. Calculating one hash needs approximately 434176 ADD/ROL/XOR operations on 128-bit vectors in the performance critical part of salsa20/8 which are executed in the even pipe (shuffles and the other instructions are executed in the odd pipe). Also calculating one hash needs 256KB of memory bandwidth (128KB is written sequentially, 128KB is read in scattered 128-byte chunks). So taking into account that SPE core can execute one instruction from the even pipe each cycle, the theoretical performance limit based on computational power is (6 * 3200000000) / 434176 ~= 44.2 khash/s. The theoretical performance limit based on memory bandwidth is 25GB / 256KB ~= 95.4 khash/s. There is a lot of headroom for the memory bandwidth and arithmetic calculations are the bottleneck. Though Cell has precise control over memory operations by scheduling DMA transfers and can overlap DMA transfers with calculations. This allows to utilize memory bandwidth very efficiently for scrypt algorithm.

This page seems to say that HD 6990 has 320GB/s of memory bandwidth. And here ArtForz tells us that it is possible to achieve < 20% peak BW with GPU. Doing some math again, we get 320GB * 0.2 / 256KB ~= 244 khash/s. Looks rather believable to me.

edit: corrected HD 6990 memory bandwidth (it is 320GB/s and not 350GB/s)

What about smix and the mul operations in scrypt?  I thought the reason for the speed of Cell as implemented in PS3 (~35 kh/s) was do to the 256kb onboard local registers...  The slowdown in scrypt(1024,1,1) has little to do with the speed of the memory and everything to do with the speed of random accesses to that memory.  Cache (or onboard memory in the case of Cell) is way, way faster in terms of random access to data (L1 and L2 are 4 and 10 clock cycles respectively for an I7).

Quote
With DRAM memory, random access is never efficient. In fact, the GPU hardware looks at all memory addresses that the running threads want to access at a given cycle, and attempts to coalesce them into a single DRAM access - in case they are not random. Effectively the contiguous range from i to i+#threads is reverse-engineered from the explicitly computed i,i+1,i+2… - another cost of replicating the index in the first place. If the indexes are in fact random and can not be coalesced, the performance loss depends on “the degree of randomness”. This loss results from the DRAM architecture quite directly, the GPU being unable to do much about it - similarly to any other processor.

http://www.yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html

GPUs generally have little onboard cache (16-32kb) because the data they process is intended to be sequential (and it usually is for 3D applications).
donator
Activity: 1218
Merit: 1079
Gerald Davis
I am no GPGPU expert, but I think ArtForz made some very good points in the following thread:
https://bitcointalksearch.org/topic/tenebrix-scaling-questions-45849
CoinHunter could make his claims convincing by simply explaining how to address the GPU limitations outlined by ArtForz.
At least ArtForz was mistaken about Cell earlier Smiley

Just let's do some simple math. Playstation3 has 6 SPE cores, each clocked at 3.2GHz and 25GB/s of total memory bandwidth. Calculating one hash needs approximately 434176 ADD/ROL/XOR operations on 128-bit vectors in the performance critical part of salsa20/8 which are executed in the even pipe (shuffles and the other instructions are executed in the odd pipe). Also calculating one hash needs 256KB of memory bandwidth (128KB is written sequentially, 128KB is read in scattered 128-byte chunks). So taking into account that SPE core can execute one instruction from the even pipe each cycle, the theoretical performance limit based on computational power is (6 * 3200000000) / 434176 ~= 44.2 khash/s. The theoretical performance limit based on memory bandwidth is 25GB / 256KB ~= 95.4 khash/s. There is a lot of headroom for the memory bandwidth and arithmetic calculations are the bottleneck. Though Cell has precise control over memory operations by scheduling DMA transfers and can overlap DMA transfers with calculations. This allows to utilize memory bandwidth very efficiently for scrypt algorithm.

This page seems to say that HD 6990 has 320GB/s of memory bandwidth. And here ArtForz tells us that it is possible to achieve < 20% peak BW with GPU. Doing some math again, we get 320GB * 0.2 / 256KB ~= 244 khash/s. Looks rather believable to me.

edit: corrected HD 6990 memory bandwidth (it is 320GB/s and not 350GB/s)

Bandwidth has nothing to do w/ scrypt.  LATENCY does.  Which is why the amount of L1 cache is so important.
hero member
Activity: 896
Merit: 1000
Seal Cub Clubbing Club
At least ArtForz was mistaken about Cell earlier Smiley

Just let's do some simple math. Playstation3 has 6 SPE cores, each clocked at 3.2GHz and 25GB/s of total memory bandwidth. Calculating one hash needs approximately 434176 ADD/ROL/XOR operations on 128-bit vectors in the performance critical part of salsa20/8 which are executed in the even pipe (shuffles and the other instructions are executed in the odd pipe). Also calculating one hash needs 256KB of memory bandwidth (128KB is written sequentially, 128KB is read in scattered 128-byte chunks). So taking into account that SPE core can execute one instruction from the even pipe each cycle, the theoretical performance limit based on computational power is (6 * 3200000000) / 434176 ~= 44.2 khash/s. The theoretical performance limit based on memory bandwidth is 25GB / 256KB ~= 95.4 khash/s. There is a lot of headroom for the memory bandwidth and arithmetic calculations are the bottleneck. Though Cell has precise control over memory operations by scheduling DMA transfers and can overlap DMA transfers with calculations. This allows to utilize memory bandwidth very efficiently for scrypt algorithm.

This page seems to say that HD 6990 has 350GB/s of memory bandwidth. And here ArtForz tells us that it is possible to achieve < 20% peak BW with GPU. Doing some math again, we get 350GB * 0.2 / 256KB ~= 267 khash/s. Looks rather believable to me.

Check out the big brain on Brett! Shocked
newbie
Activity: 39
Merit: 0
I am no GPGPU expert, but I think ArtForz made some very good points in the following thread:
https://bitcointalksearch.org/topic/tenebrix-scaling-questions-45849
CoinHunter could make his claims convincing by simply explaining how to address the GPU limitations outlined by ArtForz.
At least ArtForz was mistaken about Cell earlier Smiley

Just let's do some simple math. Playstation3 has 6 SPE cores, each clocked at 3.2GHz and 25GB/s of total memory bandwidth. Calculating one hash needs approximately 434176 ADD/ROL/XOR operations on 128-bit vectors in the performance critical part of salsa20/8 which are executed in the even pipe (shuffles and the other instructions are executed in the odd pipe). Also calculating one hash needs 256KB of memory bandwidth (128KB is written sequentially, 128KB is read in scattered 128-byte chunks). So taking into account that SPE core can execute one instruction from the even pipe each cycle, the theoretical performance limit based on computational power is (6 * 3200000000) / 434176 ~= 44.2 khash/s. The theoretical performance limit based on memory bandwidth is 25GB / 256KB ~= 95.4 khash/s. There is a lot of headroom for the memory bandwidth and arithmetic calculations are the bottleneck. Though Cell has precise control over memory operations by scheduling DMA transfers and can overlap DMA transfers with calculations. This allows to utilize memory bandwidth very efficiently for scrypt algorithm.

This page seems to say that HD 6990 has 320GB/s of memory bandwidth. And here ArtForz tells us that it is possible to achieve < 20% peak BW with GPU. Doing some math again, we get 320GB * 0.2 / 256KB ~= 244 khash/s. Looks rather believable to me.

edit: corrected HD 6990 memory bandwidth (it is 320GB/s and not 350GB/s)
hero member
Activity: 896
Merit: 1000
Seal Cub Clubbing Club
At 250 kH/s with your 6990 your daily yield is right around 160 LTC at current difficulty (estimating with http://www.litecoinpool.org/stats, and ignoring pool fees of course).  At an exchange rate of .002802 BTC per LTC that means you can exchange your daily take of LTC for roughly .44 BTC. 

Now if you were mining straight BTC with that 6990, you'd be doing around 820 MH/s which yields around .60 BTC per day, maybe a little more.  I'm guessing that you'd need that GPU Litecoin miner to hit around 350 kH/s or so before you start breaking even versus mining straight BTC.
donator
Activity: 1218
Merit: 1079
Gerald Davis

It currently gets ~250KH/s on a 6990 .


the ltc-mining cpu- and gpu-effectivity w/kh seem to be simultan:
core i3 / 3.1 ghz / 3 threads / ~ 15 w only for this / 12.5 kh/s =>1.2 w/kh
core i7 / 3.6 ghz / 7 threads / ~ 46 w only for this / 32 kh/s =>1.437 w/kh
6990 / ~350 w (don't know the correct watt) for 250 kh/s => 1.4 w/kh

but with more improvement there will be an even better w/kh-rate for the 6990 and the other mining-cards

so the end of ltc-cpu-mining is not too far away because
one 6990 does the job of ~8 core-i7 or ~20 core i3

A unicorn can run 500 miles on one gallon (of beer).  A gallon of beer is more expensive than a gallon of gas but likely we are still seeing the end of hybrid cars because one unicorn is equal to almost 10x Toyota Prius.  Plus with improvement in unicorn-beer technology you are getting even better m/goa rate (thats miles to gallons of alcohol).
sr. member
Activity: 309
Merit: 250

It currently gets ~250KH/s on a 6990 .


the ltc-mining cpu- and gpu-effectivity w/kh seem to be simultan:
core i3 / 3.1 ghz / 3 threads / ~ 15 w only for this / 12.5 kh/s =>1.2 w/kh
core i7 / 3.6 ghz / 7 threads / ~ 46 w only for this / 32 kh/s =>1.437 w/kh
6990 / ~350 w (don't know the correct watt) for 250 kh/s => 1.4 w/kh

but with more improvement there will be an even better w/kh-rate for the 6990 and the other mining-cards

so the end of ltc-cpu-mining is not too far away because
one 6990 does the job of ~8 core-i7 or ~20 core i3
legendary
Activity: 889
Merit: 1000
Bitcoin calls me an Orphan
Ohh my goodness.. look out.. We have found the ultimate GPU miner for litecoin!!!! We did this with a GeForce mx2 Smiley

Pages:
Jump to: