artforz and coblee gpu mining litecoin since the start? - page 17.

ArtForz

sr. member

Activity: 406

Merit: 257

Well, from talking with Ahimoth on btc-e, sounds like mtrlt managed to find a workaround for the issues I was having with scrypt miner kernels on GPU.
Short version... any of my kernels that got speeds like that also got 100% invalids. Yet the same kernel worked fine on CPU or if I dropped global and/or local worksizes to silly small levels... which of course made it dog slow again...
And of course RS turning the whole thing into "OMG CONSPIRACY!!1one"... duh, it's RS.

kano

legendary

Activity: 4634

Merit: 1851

Linux since 1997 RedHat 4

Quote from: RoloTonyBrownTown on February 10, 2012, 07:47:52 PM

Quote from: Mousepotato on February 10, 2012, 05:47:05 PM

Quote from: RoloTonyBrownTown on February 10, 2012, 05:37:27 PM

What's funny to me about all this is that gpu mining the coin early before anyone else could is exactly what coinhunter did with SC Cheesy

. Such a douche. You should be banned from this forum.

Well.. yeah, SC started out as a GPU chain.

Nope. It started as CPU only, then a few weeks after launch they miraculously "turned on" GPU mining.

https://bitcointalksearch.org/topic/solidcoin-v20-features-new-hashing-algorithm-faster-on-cpus-44423

then

https://bitcointalksearch.org/topic/solidcoin-2-gpu-mining-and-the-future-49219

SC2.0 did, SC was GPU

paraipan

legendary

Activity: 924

Merit: 1004

Firstbits: 1pirata

Quote from: LoupGaroux on February 10, 2012, 08:02:22 PM

In Litecoins, hashes mine themselves, and distribute to worthy miners automatically with gift of free electricity and unicorn-shatted skittles delivered by fairy Godmothers.

In ShortBusCoins, coins themselves are shit, as is Glorious Leader Dear DoucheBag/RS/CH King of All Delusion and FUD, Hysterical Chicken Little Bitch Princess of the Realm.

Who cares? RS/CH/Douchebag coming up with the scandal-of-the-moment announcement is a lot like a fish pissing in the ocean, it doesn't amount to much, and nobody really notices.

epic, like always Cheesy

LoupGaroux

sr. member

Activity: 574

Merit: 250

In Litecoins, hashes mine themselves, and distribute to worthy miners automatically with gift of free electricity and unicorn-shatted skittles delivered by fairy Godmothers.

In ShortBusCoins, coins themselves are shit, as is Glorious Leader Dear DoucheBag/RS/CH King of All Delusion and FUD, Hysterical Chicken Little Bitch Princess of the Realm.

Who cares? RS/CH/Douchebag coming up with the scandal-of-the-moment announcement is a lot like a fish pissing in the ocean, it doesn't amount to much, and nobody really notices.

RoloTonyBrownTown

sr. member

Activity: 350

Merit: 250

Quote from: Mousepotato on February 10, 2012, 05:47:05 PM

Quote from: RoloTonyBrownTown on February 10, 2012, 05:37:27 PM

What's funny to me about all this is that gpu mining the coin early before anyone else could is exactly what coinhunter did with SC Cheesy

. Such a douche. You should be banned from this forum.

Well.. yeah, SC started out as a GPU chain.

Nope. It started as CPU only, then a few weeks after launch they miraculously "turned on" GPU mining.

https://bitcointalksearch.org/topic/solidcoin-v20-features-new-hashing-algorithm-faster-on-cpus-44423

then

https://bitcointalksearch.org/topic/solidcoin-2-gpu-mining-and-the-future-49219

StewartJ

sr. member

Activity: 392

Merit: 250

Quote from: P4man on February 10, 2012, 05:53:14 PM

So I guess CH is really telling us its time to quickly buy some litecoins, because difficulty is about to explode

.

So fast GPU mining of LTC would drive the price of Litecoins up?

ZodiacDragon84

sr. member

Activity: 266

Merit: 250

The king and the pawn go in the same box @ endgame

Quote from: Mousepotato on February 10, 2012, 03:14:25 PM

Quote from: Gabi on February 10, 2012, 02:36:53 PM

Oh and i have made some dozens of full custom chips at 28nm specifically tailored to mine bitcoin and litecoins. 1 terahash per chip. True story.

Ships in 4-6 weeks?

Laughed so hard!

Joshwaa

hero member

Activity: 497

Merit: 500

Quote from: DeathAndTaxes on February 10, 2012, 04:39:56 PM

Quote from: Schwede65 on February 10, 2012, 04:28:31 PM

Quote from: CoinHunter on February 10, 2012, 09:39:39 AM

It currently gets ~250KH/s on a 6990 .

the ltc-mining cpu- and gpu-effectivity w/kh seem to be simultan:
core i3 / 3.1 ghz / 3 threads / ~ 15 w only for this / 12.5 kh/s =>1.2 w/kh
core i7 / 3.6 ghz / 7 threads / ~ 46 w only for this / 32 kh/s =>1.437 w/kh
6990 / ~350 w (don't know the correct watt) for 250 kh/s => 1.4 w/kh

but with more improvement there will be an even better w/kh-rate for the 6990 and the other mining-cards

so the end of ltc-cpu-mining is not too far away because
one 6990 does the job of ~8 core-i7 or ~20 core i3

A unicorn can run 500 miles on one gallon (of beer). A gallon of beer is more expensive than a gallon of gas but likely we are still seeing the end of hybrid cars because one unicorn is equal to almost 10x Toyota Prius. Plus with improvement in unicorn-beer technology you are getting even better m/goa rate (thats miles to gallons of alcohol).

OMG I laughed for about 4 mins from that one. Priceless!

P4man

hero member

Activity: 518

Merit: 500

So I guess CH is really telling us its time to quickly buy some litecoins, because difficulty is about to explode

.

Mousepotato

hero member

Activity: 896

Merit: 1000

Seal Cub Clubbing Club

Quote from: RoloTonyBrownTown on February 10, 2012, 05:37:27 PM

What's funny to me about all this is that gpu mining the coin early before anyone else could is exactly what coinhunter did with SC Cheesy

. Such a douche. You should be banned from this forum.

Well.. yeah, SC started out as a GPU chain.

RoloTonyBrownTown

sr. member

Activity: 350

Merit: 250

What's funny to me about all this is that gpu mining the coin early before anyone else could is exactly what coinhunter did with SC Cheesy

. Such a douche. You should be banned from this forum.

ssvb

newbie

Activity: 39

Merit: 0

Quote from: DeathAndTaxes on February 10, 2012, 05:06:08 PM

Bandwidth has nothing to do w/ scrypt. LATENCY does. Which is why the amount of L1 cache is so important.

L1 cache is just less important than you think

For example, my scrypt miner optimizations for Cell do not use 256KB of fast local memory at all. It is insufficient for 4x unrolling which is needed in order to eliminate pipeline stalls and at least half of the performance would be lost. But scrypt is not memory heavy enough, so I can easily get away working with the main memory and still have a lot of memory bandwidth headroom. LATENCY is not important in my case, because memory accesses are pipelined, get executed asynchronously and do not block execution. But you can check scrypt_spu_core8 function in the code yourself.

If GPUs have excessive computational resources, then even waiting for memory a lot of time (80% or so per each execution core) is likely not a problem as long as all of them are competing for the precious memory bandwidth and fully saturating it. I did not think about GPU mining earlier just because I did not have any experience with GPU programming and honestly did not expect them to have that much memory bandwidth (more than 10x advantage over Cell).

tacotime

legendary

Activity: 1484

Merit: 1005

Quote from: ssvb on February 10, 2012, 04:48:41 PM

Quote from: pooler on February 10, 2012, 12:52:29 PM

I am no GPGPU expert, but I think ArtForz made some very good points in the following thread:
https://bitcointalksearch.org/topic/tenebrix-scaling-questions-45849
CoinHunter could make his claims convincing by simply explaining how to address the GPU limitations outlined by ArtForz.

At least ArtForz was mistaken about Cell earlier

Just let's do some simple math. Playstation3 has 6 SPE cores, each clocked at 3.2GHz and 25GB/s of total memory bandwidth. Calculating one hash needs approximately 434176 ADD/ROL/XOR operations on 128-bit vectors in the performance critical part of salsa20/8 which are executed in the even pipe (shuffles and the other instructions are executed in the odd pipe). Also calculating one hash needs 256KB of memory bandwidth (128KB is written sequentially, 128KB is read in scattered 128-byte chunks). So taking into account that SPE core can execute one instruction from the even pipe each cycle, the theoretical performance limit based on computational power is (6 * 3200000000) / 434176 ~= 44.2 khash/s. The theoretical performance limit based on memory bandwidth is 25GB / 256KB ~= 95.4 khash/s. There is a lot of headroom for the memory bandwidth and arithmetic calculations are the bottleneck. Though Cell has precise control over memory operations by scheduling DMA transfers and can overlap DMA transfers with calculations. This allows to utilize memory bandwidth very efficiently for scrypt algorithm.

This page seems to say that HD 6990 has 320GB/s of memory bandwidth. And here ArtForz tells us that it is possible to achieve < 20% peak BW with GPU. Doing some math again, we get 320GB * 0.2 / 256KB ~= 244 khash/s. Looks rather believable to me.

edit: corrected HD 6990 memory bandwidth (it is 320GB/s and not 350GB/s)

What about smix and the mul operations in scrypt? I thought the reason for the speed of Cell as implemented in PS3 (~35 kh/s) was do to the 256kb onboard local registers... The slowdown in scrypt(1024,1,1) has little to do with the speed of the memory and everything to do with the speed of random accesses to that memory. Cache (or onboard memory in the case of Cell) is way, way faster in terms of random access to data (L1 and L2 are 4 and 10 clock cycles respectively for an I7).

Quote

With DRAM memory, random access is never efficient. In fact, the GPU hardware looks at all memory addresses that the running threads want to access at a given cycle, and attempts to coalesce them into a single DRAM access - in case they are not random. Effectively the contiguous range from i to i+#threads is reverse-engineered from the explicitly computed i,i+1,i+2… - another cost of replicating the index in the first place. If the indexes are in fact random and can not be coalesced, the performance loss depends on “the degree of randomness”. This loss results from the DRAM architecture quite directly, the GPU being unable to do much about it - similarly to any other processor.

http://www.yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html

GPUs generally have little onboard cache (16-32kb) because the data they process is intended to be sequential (and it usually is for 3D applications).

DeathAndTaxes

donator

Activity: 1218

Merit: 1080

Gerald Davis

Quote from: ssvb on February 10, 2012, 04:48:41 PM

Quote from: pooler on February 10, 2012, 12:52:29 PM

I am no GPGPU expert, but I think ArtForz made some very good points in the following thread:
https://bitcointalksearch.org/topic/tenebrix-scaling-questions-45849
CoinHunter could make his claims convincing by simply explaining how to address the GPU limitations outlined by ArtForz.

At least ArtForz was mistaken about Cell earlier

Just let's do some simple math. Playstation3 has 6 SPE cores, each clocked at 3.2GHz and 25GB/s of total memory bandwidth. Calculating one hash needs approximately 434176 ADD/ROL/XOR operations on 128-bit vectors in the performance critical part of salsa20/8 which are executed in the even pipe (shuffles and the other instructions are executed in the odd pipe). Also calculating one hash needs 256KB of memory bandwidth (128KB is written sequentially, 128KB is read in scattered 128-byte chunks). So taking into account that SPE core can execute one instruction from the even pipe each cycle, the theoretical performance limit based on computational power is (6 * 3200000000) / 434176 ~= 44.2 khash/s. The theoretical performance limit based on memory bandwidth is 25GB / 256KB ~= 95.4 khash/s. There is a lot of headroom for the memory bandwidth and arithmetic calculations are the bottleneck. Though Cell has precise control over memory operations by scheduling DMA transfers and can overlap DMA transfers with calculations. This allows to utilize memory bandwidth very efficiently for scrypt algorithm.

This page seems to say that HD 6990 has 320GB/s of memory bandwidth. And here ArtForz tells us that it is possible to achieve < 20% peak BW with GPU. Doing some math again, we get 320GB * 0.2 / 256KB ~= 244 khash/s. Looks rather believable to me.

edit: corrected HD 6990 memory bandwidth (it is 320GB/s and not 350GB/s)

Bandwidth has nothing to do w/ scrypt. LATENCY does. Which is why the amount of L1 cache is so important.

Mousepotato

hero member

Activity: 896

Merit: 1000

Seal Cub Clubbing Club

Quote from: ssvb on February 10, 2012, 04:48:41 PM

At least ArtForz was mistaken about Cell earlier

Just let's do some simple math. Playstation3 has 6 SPE cores, each clocked at 3.2GHz and 25GB/s of total memory bandwidth. Calculating one hash needs approximately 434176 ADD/ROL/XOR operations on 128-bit vectors in the performance critical part of salsa20/8 which are executed in the even pipe (shuffles and the other instructions are executed in the odd pipe). Also calculating one hash needs 256KB of memory bandwidth (128KB is written sequentially, 128KB is read in scattered 128-byte chunks). So taking into account that SPE core can execute one instruction from the even pipe each cycle, the theoretical performance limit based on computational power is (6 * 3200000000) / 434176 ~= 44.2 khash/s. The theoretical performance limit based on memory bandwidth is 25GB / 256KB ~= 95.4 khash/s. There is a lot of headroom for the memory bandwidth and arithmetic calculations are the bottleneck. Though Cell has precise control over memory operations by scheduling DMA transfers and can overlap DMA transfers with calculations. This allows to utilize memory bandwidth very efficiently for scrypt algorithm.

This page seems to say that HD 6990 has 350GB/s of memory bandwidth. And here ArtForz tells us that it is possible to achieve < 20% peak BW with GPU. Doing some math again, we get 350GB * 0.2 / 256KB ~= 267 khash/s. Looks rather believable to me.

Check out the big brain on Brett! Shocked

ssvb

newbie

Activity: 39

Merit: 0

Quote from: pooler on February 10, 2012, 12:52:29 PM

I am no GPGPU expert, but I think ArtForz made some very good points in the following thread:
https://bitcointalksearch.org/topic/tenebrix-scaling-questions-45849
CoinHunter could make his claims convincing by simply explaining how to address the GPU limitations outlined by ArtForz.

At least ArtForz was mistaken about Cell earlier

Just let's do some simple math. Playstation3 has 6 SPE cores, each clocked at 3.2GHz and 25GB/s of total memory bandwidth. Calculating one hash needs approximately 434176 ADD/ROL/XOR operations on 128-bit vectors in the performance critical part of salsa20/8 which are executed in the even pipe (shuffles and the other instructions are executed in the odd pipe). Also calculating one hash needs 256KB of memory bandwidth (128KB is written sequentially, 128KB is read in scattered 128-byte chunks). So taking into account that SPE core can execute one instruction from the even pipe each cycle, the theoretical performance limit based on computational power is (6 * 3200000000) / 434176 ~= 44.2 khash/s. The theoretical performance limit based on memory bandwidth is 25GB / 256KB ~= 95.4 khash/s. There is a lot of headroom for the memory bandwidth and arithmetic calculations are the bottleneck. Though Cell has precise control over memory operations by scheduling DMA transfers and can overlap DMA transfers with calculations. This allows to utilize memory bandwidth very efficiently for scrypt algorithm.

This page seems to say that HD 6990 has 320GB/s of memory bandwidth. And here ArtForz tells us that it is possible to achieve < 20% peak BW with GPU. Doing some math again, we get 320GB * 0.2 / 256KB ~= 244 khash/s. Looks rather believable to me.

edit: corrected HD 6990 memory bandwidth (it is 320GB/s and not 350GB/s)

Mousepotato

hero member

Activity: 896

Merit: 1000

Seal Cub Clubbing Club

At 250 kH/s with your 6990 your daily yield is right around 160 LTC at current difficulty (estimating with http://www.litecoinpool.org/stats, and ignoring pool fees of course). At an exchange rate of .002802 BTC per LTC that means you can exchange your daily take of LTC for roughly .44 BTC.

Now if you were mining straight BTC with that 6990, you'd be doing around 820 MH/s which yields around .60 BTC per day, maybe a little more. I'm guessing that you'd need that GPU Litecoin miner to hit around 350 kH/s or so before you start breaking even versus mining straight BTC.

DeathAndTaxes

donator

Activity: 1218

Merit: 1080

Gerald Davis

Quote from: Schwede65 on February 10, 2012, 04:28:31 PM

Quote from: CoinHunter on February 10, 2012, 09:39:39 AM

It currently gets ~250KH/s on a 6990 .

the ltc-mining cpu- and gpu-effectivity w/kh seem to be simultan:
core i3 / 3.1 ghz / 3 threads / ~ 15 w only for this / 12.5 kh/s =>1.2 w/kh
core i7 / 3.6 ghz / 7 threads / ~ 46 w only for this / 32 kh/s =>1.437 w/kh
6990 / ~350 w (don't know the correct watt) for 250 kh/s => 1.4 w/kh

but with more improvement there will be an even better w/kh-rate for the 6990 and the other mining-cards

so the end of ltc-cpu-mining is not too far away because
one 6990 does the job of ~8 core-i7 or ~20 core i3

A unicorn can run 500 miles on one gallon (of beer). A gallon of beer is more expensive than a gallon of gas but likely we are still seeing the end of hybrid cars because one unicorn is equal to almost 10x Toyota Prius. Plus with improvement in unicorn-beer technology you are getting even better m/goa rate (thats miles to gallons of alcohol).

Schwede65

sr. member

Activity: 309

Merit: 250

Quote from: CoinHunter on February 10, 2012, 09:39:39 AM

It currently gets ~250KH/s on a 6990 .

the ltc-mining cpu- and gpu-effectivity w/kh seem to be simultan:
core i3 / 3.1 ghz / 3 threads / ~ 15 w only for this / 12.5 kh/s =>1.2 w/kh
core i7 / 3.6 ghz / 7 threads / ~ 46 w only for this / 32 kh/s =>1.437 w/kh
6990 / ~350 w (don't know the correct watt) for 250 kh/s => 1.4 w/kh

but with more improvement there will be an even better w/kh-rate for the 6990 and the other mining-cards

so the end of ltc-cpu-mining is not too far away because
one 6990 does the job of ~8 core-i7 or ~20 core i3

wknight

legendary

Activity: 889

Merit: 1000

Bitcoin calls me an Orphan

Ohh my goodness.. look out.. We have found the ultimate GPU miner for litecoin!!!! We did this with a GeForce mx2

Topic: artforz and coblee gpu mining litecoin since the start? - page 17. (Read 32578 times)