Topic: 3% faster mining with phoenix+phatk, diablo, or poclbm for everyone (Read 39162 times)

Quote from: DiabloD3 on June 30, 2011, 12:31:16 AM

On normal Radeons at stock clocks and voltages, and are not overheating, have a known error rate (which is something like 1 error per several hundred million instructions)

One error per several hundred million instructions... where are you getting that information from? (an honest question, I'd like to learn more)

A 58xx series, from information I can find, can do between one and four instructions per clock. If it is clocked at 775000000 cycles per second (775 MHz), that's up to 3.1 billion instructions per second. So according to your information there would be many errors per second.

I think I meant 1 per several hundred billion. Its in the chip specification somewhere, ask AMD.

Quote

a silicon chip is not supposed to have errors, and will only occur if: incorrect voltage, incorrect temperature range, the silicon chip itself is faulty, something extremely rare such as a cosmic ray bouncing off and [in the case of RAM] 'flipping a bit' perhaps once every few weeks

Bzzt, wrong. GPU hardware does not use the same manufacturing process CPU hardware typically does. GPUs are not mission critical hardware, and rare calculation errors are considered acceptable.

You are correct about "incorrect" voltage, however you assume that it is incorrect at all. The professional versions of these cards run at lower clock rates and lower voltages to reduce the error rate. Consumer cards are not ran at incorrect settings, they are merely ran at settings that lead to acceptable levels of errors.

This is shown up quite well if you run something like Folding or Seti on your GPU, you will notice a lot more invalid work showing up than you do using the CPU variant and has far as I am aware work for these two does not expire within any reasonable length of time unlike bitcoin so these are actually invalid results and not just "stale".

No, its invalid on mining as well. My miner has a HW error counter, it only ticks up when the HW produces a hash it thinks produces H == 0, but when double checked it doesn't.

teukon

legendary

Activity: 1246

Merit: 1011

359 -> 367 (2.2%) on my two 5850's (900/300 1.01V).

Thank you kindly.

Turix

member

Activity: 76

Merit: 10

Quote from: PcChip on June 29, 2011, 10:12:52 PM

Quote from: PcChip on June 29, 2011, 10:12:52 PM

On normal Radeons at stock clocks and voltages, and are not overheating, have a known error rate (which is something like 1 error per several hundred million instructions)

One error per several hundred million instructions... where are you getting that information from? (an honest question, I'd like to learn more)

A 58xx series, from information I can find, can do between one and four instructions per clock. If it is clocked at 775000000 cycles per second (775 MHz), that's up to 3.1 billion instructions per second. So according to your information there would be many errors per second.

I think I meant 1 per several hundred billion. Its in the chip specification somewhere, ask AMD.

Quote

a silicon chip is not supposed to have errors, and will only occur if: incorrect voltage, incorrect temperature range, the silicon chip itself is faulty, something extremely rare such as a cosmic ray bouncing off and [in the case of RAM] 'flipping a bit' perhaps once every few weeks

Bzzt, wrong. GPU hardware does not use the same manufacturing process CPU hardware typically does. GPUs are not mission critical hardware, and rare calculation errors are considered acceptable.

You are correct about "incorrect" voltage, however you assume that it is incorrect at all. The professional versions of these cards run at lower clock rates and lower voltages to reduce the error rate. Consumer cards are not ran at incorrect settings, they are merely ran at settings that lead to acceptable levels of errors.

This is shown up quite well if you run something like Folding or Seti on your GPU, you will notice a lot more invalid work showing up than you do using the CPU variant and has far as I am aware work for these two does not expire within any reasonable length of time unlike bitcoin so these are actually invalid results and not just "stale".

DiabloD3

legendary

Activity: 1162

Merit: 1000

DiabloMiner author

On normal Radeons at stock clocks and voltages, and are not overheating, have a known error rate (which is something like 1 error per several hundred million instructions)

One error per several hundred million instructions... where are you getting that information from? (an honest question, I'd like to learn more)

A 58xx series, from information I can find, can do between one and four instructions per clock. If it is clocked at 775000000 cycles per second (775 MHz), that's up to 3.1 billion instructions per second. So according to your information there would be many errors per second.

I think I meant 1 per several hundred billion. Its in the chip specification somewhere, ask AMD.

Quote

a silicon chip is not supposed to have errors, and will only occur if: incorrect voltage, incorrect temperature range, the silicon chip itself is faulty, something extremely rare such as a cosmic ray bouncing off and [in the case of RAM] 'flipping a bit' perhaps once every few weeks

Bzzt, wrong. GPU hardware does not use the same manufacturing process CPU hardware typically does. GPUs are not mission critical hardware, and rare calculation errors are considered acceptable.

You are correct about "incorrect" voltage, however you assume that it is incorrect at all. The professional versions of these cards run at lower clock rates and lower voltages to reduce the error rate. Consumer cards are not ran at incorrect settings, they are merely ran at settings that lead to acceptable levels of errors.

jonnynogood

newbie

Activity: 39

Merit: 0

i dont know what this did but i switched from open CL to phoenix changed a bunch of flags and went from 275Mh to 290MH on each of my 4 6870 cards

thanks!

PcChip

sr. member

Activity: 418

Merit: 250

On normal Radeons at stock clocks and voltages, and are not overheating, have a known error rate (which is something like 1 error per several hundred million instructions)

One error per several hundred million instructions... where are you getting that information from? (an honest question, I'd like to learn more)

A 58xx series, from information I can find, can do between one and four instructions per clock. If it is clocked at 775000000 cycles per second (775 MHz), that's up to 3.1 billion instructions per second. So according to your information there would be many errors per second.

When you talk about hardware errors, are you sure you're not talking about rounding errors?

a silicon chip is not supposed to have errors, and will only occur if: incorrect voltage, incorrect temperature range, the silicon chip itself is faulty, something extremely rare such as a cosmic ray bouncing off and [in the case of RAM] 'flipping a bit' perhaps once every few weeks

Hardware errors cause things like bluescreens, frozen machines, display driver crashes and the like. You would have to get REALLY lucky for a hardware error JUST SO HAPPEN to only effect something that isn't important, like what color this or that pixel is, or something mundane like that. Most likely if there's a hardware error the odds are it won't happen on a piece of data where it doesn't matter, it will probably crash the system.

What I can find on cosmic ray bit flips:
http://www.zdnet.com/blog/storage/dram-error-rates-nightmare-on-dimm-street/638
http://lambda-diode.com/opinion/ecc-memory
http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

DiabloD3

legendary

Activity: 1162

Merit: 1000

DiabloMiner author

Quote from: PcChip on June 27, 2011, 09:33:45 PM

Quote from: PcChip on June 28, 2011, 01:11:17 AM

Diablo showed very very few hardware errors (according to the forums, it was low .. less than 3 in 1000 for any card, and in some one case only 1 in ~2000)

Hardware errors? Any hardware error means you've pushed the boundaries of either: voltage, clock speed, or heat... Reduce one

No it doesn't. Go read carefully. Hardware checks are turned off intentionally by Diablo (and I believe all GPU mining software)

I'll cut you off there. You cannot shut off error correction via OpenCL, not that there is any to actually shut off.

The only thing I shut off is HW error message spamming when BFI_INT is on (because BFI_INT is a large hack and certain usages confuse the driver). HW error checking is still enabled in the miner, and the counter still ticks up.

DiabloD3

legendary

Activity: 1162

Merit: 1000

DiabloMiner author

Quote from: PcChip on June 28, 2011, 01:11:17 AM

No it doesn't. Go read carefully. there are always going to be occasional hardware errors for this kind of thing

All this writing and all I really said is that hardware errors are normal and does occur with ALL GPUs.

I'm sorry Veldy but you're incorrect

GDDR5 does have error correction however, which is why you can push it past its boundaries and not crash, but will get reduced performance from all the error-corrections.

Aside from GDDR5 and specific ECC ram, any hardware error would cause huge problems up to and including system lockup. Later operating systems (Win7 in my case) have gotten better at coping though, if you're lucky you get a "Display Driver has Stopped Working" error and not a hard-freeze.

Edit: I'll just edit this post in response to the one below to avoid spamming this thread with offtopic posts to say that we'll just agree to disagree

Consumer GPUs do not use ECC-enabled GDDR5.

The HW errors, however, are caused by naturally unstable hardware. On normal Radeons at stock clocks and voltages, and are not overheating, have a known error rate (which is something like 1 error per several hundred million instructions), and do not have excessive (or really, any) internal checks.

This is not a defect in the hardware. GPUs are for playing video games. Scientific apps that are searching for a needle in a haystack (such as what we do) double check the result.

mattpker

newbie

Activity: 28

Merit: 0

Wow this really worked!!

2 6950's both:

380 -> 395 Mhash/s

Thank you so much!!

Turix

member

Activity: 76

Merit: 10

My 5870 (950/315) went from 418 -> 428 Mhash an increase of 10 Mhash or about 2.5%.

btcminer

newbie

Activity: 32

Merit: 0

anyone notice 'slower' performance? I am noticing 'slower' performance with SDK 2.1

5tr1k3r

newbie

Activity: 39

Merit: 0

Oh, thanks for such a trick!
5770: 210 -> 214
5850: 364 -> 372

Veldy

member

Activity: 98

Merit: 10

No it doesn't. Go read carefully. there are always going to be occasional hardware errors for this kind of thing

All this writing and all I really said is that hardware errors are normal and does occur with ALL GPUs.

I'm sorry Veldy but you're incorrect

GDDR5 does have error correction however, which is why you can push it past its boundaries and not crash, but will get reduced performance from all the error-corrections.

Aside from GDDR5 and specific ECC ram, any hardware error would cause huge problems up to and including system lockup.

No, I am not wrong and I wasn't referring to GDDR5 or any specific memory. As you know, memory itself must have error correction or a system simply could not run. Anything that does I/O will have errors. There are several types and there are several cases where it is better to let them slide than to fix them [and odd pixel or triangle or hexegon somewhere may be better than the performance cost of using error correction to recover it]. Also, as I have mentioned, hardware error correction/check is turned off by Diablo; with that in mind, one must expect errors [or there would be no need to have the error correction in the first place].

For the sake of this thread and forum, I will leave it at that. You can respond if you like, say what you need to say. Interested readers should do their research [including myself]. One thing that I am sure of though is that Diablo3 is no dummy and if errors are expected according to what he wrote about the Diablo miner [see the thread], then I believe he is working off of reliable and true information.

PcChip

sr. member

Activity: 418

Merit: 250

Quote from: Fuzzy on June 27, 2011, 11:17:41 PM

No it doesn't. Go read carefully. there are always going to be occasional hardware errors for this kind of thing

All this writing and all I really said is that hardware errors are normal and does occur with ALL GPUs.

I'm sorry Veldy but you're incorrect

GDDR5 does have error correction however, which is why you can push it past its boundaries and not crash, but will get reduced performance from all the error-corrections.

Aside from GDDR5 and specific ECC ram, any hardware error would cause huge problems up to and including system lockup. Later operating systems (Win7 in my case) have gotten better at coping though, if you're lucky you get a "Display Driver has Stopped Working" error and not a hard-freeze.

Edit: I'll just edit this post in response to the one below to avoid spamming this thread with offtopic posts to say that we'll just agree to disagree

Veldy

member

Activity: 98

Merit: 10

readout went from 1.3 GH/s to 1.4 GH/s. I know most of it is a rounding error, but still an improvement!

There is a large margin of error with only two digits, but that is roughly 7% improvement. Nothing to sneeze at. More than likely, it is closer to 3%

Fuzzy

hero member

Activity: 560

Merit: 500

readout went from 1.3 GH/s to 1.4 GH/s. I know most of it is a rounding error, but still an improvement!

Veldy

member

Activity: 98

Merit: 10

Quote from: rethaw on June 27, 2011, 12:04:14 AM

Reposted from newbie forum posted by bitless

Works also for POCLBM, just need to edit bitcoinminer.cl and change very same line.

Donate to 15igh5HkCXwvvan4aiPYSYZwJZbHxGBYwB

This is a repost from the newbie forum. https://forum.bitcoin.org/index.php?topic=22965.0;topicseen

Please excuse my snip, but I left the link to the original poster and the donation address [it is not mine and believe it to be the person who discovered this].

If anybody still wants or needs to use the poclbm kernel with phoenix, kernel.cl can be edited in phoenix kernels/poclbm directory. I haven't tried it, but it should have the same effect. I edited it locally for both kernels in Phoenix and for the POCLBM Miner "momchil's miner" as was actually indicated above by making the edit to bitcoinminer.cl. I am sure similar could probably be found in Diablo, but you would have to find it in the Java source [unless it is in the native libraries per platform in which case, it is probably in the source code; I haven't looked, but I suspect it is written in C].

Veldy

member

Activity: 98

Merit: 10

Quote from: rethaw on June 27, 2011, 09:38:08 PM

Quote from: PcChip on June 27, 2011, 09:33:45 PM

Accepted share count exactly matches on each miner to what is shown on Deepbit, so it isn't a big deal; but I would like to know what is causing the extra rejects reported locally only.

Interesting, may be worthy of its own thread.

As far as this modification it is being included in the latest versions of the various miners as far as I know, so you will not have to do this in the future.

I saw that it was committed to the mainline in source control. Such simple things can make significant differences and all add up cumulatively ... the process of performance tuning at it's best

EDIT: Of course, in a short time, everybody picks it up, which drives up the network hash rate and thus the difficulty, so in its way, it is not meaningful once broadly adopted beyond getting a higher work/energy ratio which makes mining more affordable ... keeping some miners online just a bit longer [assuming fixed BTC worth which of course isn't true, but all you can use at any given point in time to determine ROI at any given time].

Veldy

member

Activity: 98

Merit: 10

Diablo showed very very few hardware errors (according to the forums, it was low .. less than 3 in 1000 for any card, and in some one case only 1 in ~2000)

Hardware errors? Any hardware error means you've pushed the boundaries of either: voltage, clock speed, or heat... Reduce one

No it doesn't. Go read carefully. Hardware checks are turned off intentionally by Diablo (and I believe all GPU mining software) and there are always going to be occasional hardware errors for this kind of thing (there is for just about all hardware which is why there is FEC, CRC, and other error control protocols for all devices depending upon what they do ... your CPU is a bit of a different beast, and I am not sure how error handling is done there, but I assume there must be occasional errors of some kind]. Excessive hardware errors on the other hand does imply it is pushed beyond limits. I can tell you that all of my cards passed the Furmark hammer

Not quite the same as mining obviously. I have not pushed any card very far when it comes to core clock speed and only the 5850s had memory speed reduced [900MHz to 700MHz] and they had the least errors anyway [all being low, but especially those]. The temperature on all of the cards is never allowed to exceed 80C as my limit, but in practice, none have ever been higher than 78C for any amount of time and normally my hottest card is 73C (two almost mirror each other ... different models, cases, rooms (office and cement basement floor) and one doesn't exceed 68C ... just better airflow and heat sink in or something). MSI Afterburner monitoring all of them. The highest errors [which as I said, were very low] are on my 6970 and that is the card running in my primary workstation that I am using right now and it has Aero running always, a lot of processes going on, moving windows around often [which uses 3D hardware acceleration when Aero is on] and lots of other services [with priority] running on this machine [including Windows Media Center pulling OTA HD broadcasts down, although I don't typically watch them as I usually get what I need via TiVo].

It is worth determining what a hardware error is for any given device. Ethernet cards, for instance, are loaded with them, but they correct for it and there is also correction in the Ethernet layer protocol itself to handle retransmits as needed [i.e. due to collisions]. In the case of a GPU using OpenCL, I do not know precisely what would cause a "hardware error" as determined by Diablo. I suspect bus communication errors [data moving in or out fails a checksum or parity check ... whatever] and could be due to voltage, speed, PCIx bus, etc. It could be true hardware faults from too much heat, too fast of processing without enough supporting power, or memory errors due to similar. Many reasons. Hardware as complicated as a video card [especially the GPU itself] will always produce errors, but it depends on what type of errors and how often whether it is a problem. For instance, if hardware error checking was turned on, it may self correct or recompute .. whatever [I am speculating, I need to look it up to be sure]. Also, with video for instance, certain types of errors result in essentially unnoticeable changes (maybe a pixel is shaded slightly off, or a vector skewed ever so slightly .. again, hypothesizing), and thus, are accepted by the manufacturer (AMD in this case). Different uses have different tolerances to different types of errors. Look at your cable model or DSL modem and if the information is there, you will see lots of error stats like correctable errors and uncorrectable errors. The latter, in that case, is information that it had no ability to correct and are usually external to itself. Correctable errors are errors that may be externally caused [usually are], but error correction protocols were able to fix the error without a resend [meaning, FEC, CRC, parity and other methods which can essentially determine the bit that is incorrect and fix it]. Error correction in the case of communications means overhead in both bandwidth usage and computational analysis of the stream [and thus some latency]. My point is only that errors cannot be eliminated, only compensated for or dismissed as acceptable and dealt with either by redoing the work or simply ignore the results [with video and audio, it is usually fixing it via some error correction mode and anything that escapes that is deemed acceptable, or rather allowed to pass with the assumption that within design limits, the errors are not significant ... so bad hardware is often still usable with lots of errors and you get to suffer the effects of say, marcro blocks or pixilation or with audio, clicks or silences, etc.].

All this writing and all I really said is that hardware errors are normal and does occur with ALL GPUs. Excessive hardware errors are not and the definition of excessive depends on the hardware and its intended function; I am not sure what is excessive with GPUs in general, but with hardware correction turned off by Diablo, the author indicates 3 in 1000 [shares] is just fine. I saw less than that on all GPUs.

xenon481

sr. member

Activity: 406

Merit: 250

Thanks!

5850 - 316 -> 325 (~3%)

rethaw

sr. member

Activity: 378

Merit: 255

Accepted share count exactly matches on each miner to what is shown on Deepbit, so it isn't a big deal; but I would like to know what is causing the extra rejects reported locally only.

Interesting, may be worthy of its own thread.

As far as this modification it is being included in the latest versions of the various miners as far as I know, so you will not have to do this in the future.

PcChip

sr. member

Activity: 418

Merit: 250