Author

Topic: nVidia M2050 GPU optimization (current 70-85 Mshash) (Read 4369 times)

full member
Activity: 184
Merit: 100
. Open source has produced some of the highest-quality software in the world: Linux, Apache, etc. A miner is also just a few hundreds lines of code. It is not the complex, hard-to-optimize software beast that you may imagine.

you are right. but it seems that we do not have the open source standards here regarding the miners code production. All what I can see is individual efforts?! maybe I am wrong but correct me.
full member
Activity: 184
Merit: 100
They are wrong about the threads operating on the same data. A naive reader skimming over the code may think most of the data is the same, but the 32-bit nonce is unique for each thread, which leads to different SHA-256 intermediate hash values (A-H) being manipulated by each thread.

Re: enterprise prices - it is truly market segmentation, whether you want to believe it or not. The 2 largest and public disk reliability studies ever performed were made by Google and CMU. They reveal interesting findings. In particular, contrary to what you think, the CMU one reported no statistical differences between the failure rate of SCSI vs SATA drives on a population of 100k+ drives:
* Failure Trends in a Large Disk Drive Population - Google
* Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? - CMU
Which makes sense when you think about it. Commodity drives are produced in such a high volume that it is in the manufacturer's interest to make them as reliable as possible, because a small improvement in reliability drastically reduces the number of warranty claims.

This is just some food for thoughts... You should trust more of the people here. The Bitcoin community is full of smart folks. Open source has produced some of the highest-quality software in the world: Linux, Apache, etc. A miner is also just a few hundreds lines of code. It is not a complex, hard-to-optimize software beast that you may imagine.

What is your opinion on this?:
http://stackoverflow.com/questions/5932689/rotate-right-operation-on-integer-data-using-floating-point-operations
Theoretically possible or not?

Looks like I have to dig it myself before moving on
mrb
legendary
Activity: 1512
Merit: 1028
It seems impossible to me to use floating point to emulate an integer right rotate.
newbie
Activity: 16
Merit: 0
They are wrong about the threads operating on the same data. A naive reader skimming over the code may think most of the data is the same, but the 32-bit nonce is unique for each thread, which leads to different SHA-256 intermediate hash values (A-H) being manipulated by each thread.

Re: enterprise prices - it is truly market segmentation, whether you want to believe it or not. The 2 largest and public disk reliability studies ever performed were made by Google and CMU. They reveal interesting findings. In particular, contrary to what you think, the CMU one reported no statistical differences between the failure rate of SCSI vs SATA drives on a population of 100k+ drives:
* Failure Trends in a Large Disk Drive Population - Google
* Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? - CMU
Which makes sense when you think about it. Commodity drives are produced in such a high volume that it is in the manufacturer's interest to make them as reliable as possible, because a small improvement in reliability drastically reduces the number of warranty claims.

This is just some food for thoughts... You should trust more of the people here. The Bitcoin community is full of smart folks. Open source has produced some of the highest-quality software in the world: Linux, Apache, etc. A miner is also just a few hundreds lines of code. It is not a complex, hard-to-optimize software beast that you may imagine.

What is your opinion on this?:
http://stackoverflow.com/questions/5932689/rotate-right-operation-on-integer-data-using-floating-point-operations
Theoretically possible or not?
mrb
legendary
Activity: 1512
Merit: 1028
They are wrong about the threads operating on the same data. A naive reader skimming over the code may think most of the data is the same, but the 32-bit nonce is unique for each thread, which leads to different SHA-256 intermediate hash values (A-H) being manipulated by each thread.

Re: enterprise prices - it is truly market segmentation, whether you want to believe it or not. The 2 largest and public disk reliability studies ever performed were made by Google and CMU. They reveal interesting findings. In particular, contrary to what you think, the CMU one reported no statistical differences between the failure rate of SCSI vs SATA drives on a population of 100k+ drives:
* Failure Trends in a Large Disk Drive Population - Google
* Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? - CMU
Which makes sense when you think about it. Commodity drives are produced in such a high volume that it is in the manufacturer's interest to make them as reliable as possible, because a small improvement in reliability drastically reduces the number of warranty claims.

This is just some food for thoughts... You should trust more of the people here. The Bitcoin community is full of smart folks. Open source has produced some of the highest-quality software in the world: Linux, Apache, etc. A miner is also just a few hundreds lines of code. It is not the complex, hard-to-optimize software beast that you may imagine.
full member
Activity: 184
Merit: 100
They are lying to you ("no parallelism", "+350%", ha!) You must have contacted sleazy consulting houses that will say anything to steal money from you :-)

It takes about 3200 operations to execute the SHA-256 compression function without a native integer right rotate instruction (ie. Nvidia).
One Bitcoin hash is 2 calls to the compression function.
Therefore a Tesla M2050 with 448 ALUs at 1150MHz can only execute 448*1150e6/3200/2 = 80.5 Mhash/s which is the theoretical number I gave earlier.
The best you can do is shave a few ops from these 3200 (such as precomputing some of the SHA-256 shuffling steps, which I am planning to implement in my ATI miner), but that would lead to only a few percentage points of perf improvement...


Well here is their executive reply in quotes

"It also should be noted that the preliminary analysis of the CUDA code showed that parallelism has not been applied correctly. Most of the parallel threads are executing the same operation with same data. Therefore we assume that it will be possible to sufficiently optimize the code."

Again I am not sure if i will go for it - at least alone -- as they are asking for 7000 BTC equivalent for the project. but on the other hand ... how it would sound to rent amazon EC2 for bitcoin mining and yet make gains ?!

I think the extra RAM they got their must be of some useful use.

and about the prices for the enterprise is 10 times for the client that is not because of marketing or whatsoever .. it is a justified expenses for the performance parameters you are getting ... to cut it short .. if you got an end user commodity SATA hard disk for a web serevr you will eb changing it every 15 days .. it wont even complete it to the end of the month.

Have any one checked ufasoft new CPU miner ?
mrb
legendary
Activity: 1512
Merit: 1028
They are lying to you ("no parallelism", "+350%", ha!) You must have contacted sleazy consulting houses that will say anything to steal money from you :-)

It takes about 3200 operations to execute the SHA-256 compression function without a native integer right rotate instruction (ie. Nvidia).
One Bitcoin hash is 2 calls to the compression function.
Therefore a Tesla M2050 with 448 ALUs at 1150MHz can only execute 448*1150e6/3200/2 = 80.5 Mhash/s which is the theoretical number I gave earlier.
The best you can do is shave a few ops from these 3200 (such as precomputing some of the SHA-256 shuffling steps, which I am planning to implement in my ATI miner), but that would lead to only a few percentage points of perf improvement...
full member
Activity: 184
Merit: 100
strange how expensive and ineffective this card is.

Why is it strange? It is the norm in the IT industry. The best performance/$ is almost always provided by mass-produced commodity hardware, as opposed to low volume high-end enterprise gear. A desktop GPU costs $200, but a HPC one with equal performance costs $2000 (both AMD and Nvidia segment the market the same way); a desktop HDD costs $50/TB, but a server hdd $500/TB; and so on. High-end enterprise gear tends to have poor perf/$ either because manufacturers cannot leverage economies of scale, or more often because of artificial market segmentation. That's why for example Google built their entire infrastructure on commodity hardware. They wouldn't be as profitable as they are if they run on, say, Dell PowerEdge servers.

so you advise I invest no more time looking code optimization?

Yep. The code is already performing at the theoretical maximum speed.

You would be amazed of what I am getting .. After contacting couple software consultaion companies .. they sayed teh code for CUDA at least are total mess! and do not make use of any pararlelization whatsoever! ( Theyhave checked almsot every available GPU miner that exist here in the forum).

to be honest I am not surprised since most of ALL teh miners here are on volunteer basis in the free time with no professional development cycle/quality assurance/software architecture .. etc.


they said they can have at least 80%+ in first run and total of 350% by final development!

and yes I am thinking of starting a new thread to collect donations for thier fees as to optimize at least the CUDA since it is the industry standard in GPU host renting out there (amazon, peer1, .. etc).


not to mention the ufasoft which example which have encouraged me since people are getting about 100%+ on  their Intel processors.


What do you think guys ? should I start the thread to collect donation and go for the optimized versions for CUDA/OPENCL as to be open source ?

mrb
legendary
Activity: 1512
Merit: 1028
strange how expensive and ineffective this card is.

Why is it strange? It is the norm in the IT industry. The best performance/$ is almost always provided by mass-produced commodity hardware, as opposed to low volume high-end enterprise gear. A desktop GPU costs $200, but a HPC one with equal performance costs $2000 (both AMD and Nvidia segment the market the same way); a desktop HDD costs $50/TB, but a server hdd $500/TB; and so on. High-end enterprise gear tends to have poor perf/$ either because manufacturers cannot leverage economies of scale, or more often because of artificial market segmentation. That's why for example Google built their entire infrastructure on commodity hardware. They wouldn't be as profitable as they are if they run on, say, Dell PowerEdge servers.

so you advise I invest no more time looking code optimization?

Yep. The code is already performing at the theoretical maximum speed.
full member
Activity: 184
Merit: 100
I hope he didn't buy that card just for mining.


Nope. Was just trying it with a free ride. strange how expensive and ineffective this card is. so you advise I invest no more time looking code optimization ?
full member
Activity: 193
Merit: 100
I hope he didn't buy that card just for mining.
mrb
legendary
Activity: 1512
Merit: 1028
You are correct. I forgot about the artificial FP64 crippling.
full member
Activity: 238
Merit: 100
No, Tesla GPUs have the exact same FP64, FP32, and integer performance per clock than consumer GTX GPUs, because they are the same ASICs.

It's not true because consumer GTX GPUs are crippled not to cannibalize nVidia's market for expensive GPGPUs and the FP64 speed is artificially limited to 1/4 of the FP64 speed of Teslas. FP32 and integer performance per clock is the same, though, therefore the difference between GTX and Teslas is completely irrelevant for mining.
mrb
legendary
Activity: 1512
Merit: 1028
No, Tesla GPUs have the exact same FP64, FP32, and integer performance per clock than consumer GTX GPUs, because they are the same ASICs. (The GTX 470 has a slightly higher mining speed than the M2050 because of a slightly higher shader clock: 1215 vs. 1150 MHz).
newbie
Activity: 56
Merit: 0
Also, Teslas have high FP64 performance, but hashing doesn't use FP64 (or even FP32).  So, essentially a GTX 470 would perform a hair better than that Tesla part in mining.  You'd better stick with FP64 workloads with that hardware.
mrb
legendary
Activity: 1512
Merit: 1028
~80 Mhash/s is the theoretical maximum on this GPU. Nvidia GPUs are significantly slower than AMD GPUs because of the reasons explained here (fewer ALUs, no rotate instruction):

https://en.bitcoin.it/wiki/Why_a_GPU_mines_faster_than_a_CPU#Why_are_AMD_GPUs_faster_than_Nvidia_GPUs?
full member
Activity: 184
Merit: 100
Hi Miners,

I am starting this thread as i am using the very expensive nVidia M2050 GPU and I get only which is about a 1 BTC daily in average no matter what is the pool I am in.

I used the rpc and the moc-mod and both are about the same performance.

is that normal ? any squeezing ? fine tunning ?


Thanks in advance

BTC address

1Zu5HKax7pzDXE3azcNN7zFYaTb2LAgve
Jump to: