Antminer S3 batch 6 overclocking - page 2.

canford

member

Activity: 89

Merit: 10

Quote from: pekatete on December 21, 2014, 08:35:32 PM

Quote from: canford on December 21, 2014, 08:24:56 PM

On the one test unit, I will now try 250/0750, hardware reset then software reset. Will see what happens! One thing I am wondering is if we are triggering an internal chip overheat of some kind, that throttles it back but doesn't report a bunch of failures. So we try to push them harder, and they actually go slower. But I'm just guessing at this point.

Its your rig, but I'd try 275/0800 first. Simply enter the voltage, save and apply then do a power cycle.
On pushing them harder if they are going to throttle back without showing any outward sign ..... that is the question.

Sorry, missed this post. I will try 275/0800 after we see what happens with 250/0750. My procedure lately is to put in freq/volt, save & apply, power cycle, then system/reboot. I added on the last step after just a power cycle would result in bad stats in Miner Status. Have you tried OC on an S3+ with factory thermal paste? The six I am working with are untouched. So if it is a chip-level thermal problem that is not reported to the sw, then I guess that could explain the difference here.

pekatete

hero member

Activity: 518

Merit: 500

Quote from: canford on December 21, 2014, 08:40:55 PM

New test run started: 1024 firmware, 250/0750, queue left at stock, cgminer left at stock. Only change to stock is the additional options in cgminer.lua, which do not change the 250 values.

OK, I need to catch up on my sleep ..... getting to 2am where I am so I'll pick up from where you got to in a several hours' time.

canford

member

Activity: 89

Merit: 10

Quote from: pekatete on December 21, 2014, 08:30:04 PM

Quote from: canford on December 21, 2014, 08:24:56 PM

.... But I would also expect the symptoms you describe with an overheat, and I'm not seeing them. No increasing temps, no "x"s on the chips, HW errors still negligible. Fan speeds are dropping, now 1800-1900, down from 2200-2300. So the unit is getting cooler as it slows down. .... ...

There, my friend, is the sign you are looking for!

Fan speeds dropping ... unit is getting cooler

That means there are some chips that have gone offline! EDIT: Either that, or the chips are not getting work ..... now you see how that the queue argument was formented in my mind?

Does it? Why are all chips still reporting "o" and not "-" or "x"? That is why I'm wondering about the internal throttling. There should be plenty of work, I have tried leaving queue at default as well as deleting queue parameter completely. I have not tried --queue 1024 yet though. But I saw the same result with two different pools and with and without deletion of the parameter.

New test run started: 1024 firmware, 250/0750, queue left at stock, cgminer left at stock. Only change to stock is the additional options in cgminer.lua, which do not change the 250 values.

pekatete

hero member

Activity: 518

Merit: 500

Quote from: canford on December 21, 2014, 08:24:56 PM

On the one test unit, I will now try 250/0750, hardware reset then software reset. Will see what happens! One thing I am wondering is if we are triggering an internal chip overheat of some kind, that throttles it back but doesn't report a bunch of failures. So we try to push them harder, and they actually go slower. But I'm just guessing at this point.

Its your rig, but I'd try 275/0800 first. Simply enter the voltage, save and apply then do a power cycle.
On pushing them harder if they are going to throttle back without showing any outward sign ..... that is the question.

pekatete

hero member

Activity: 518

Merit: 500

Quote from: canford on December 21, 2014, 08:24:56 PM

.... But I would also expect the symptoms you describe with an overheat, and I'm not seeing them. No increasing temps, no "x"s on the chips, HW errors still negligible. Fan speeds are dropping, now 1800-1900, down from 2200-2300. So the unit is getting cooler as it slows down. .... ...

There, my friend, is the sign you are looking for!

Fan speeds dropping ... unit is getting cooler

That means there are some chips that have gone offline! EDIT: Either that, or the chips are not getting work ..... now you see how that the queue argument was formented in my mind?

canford

member

Activity: 89

Merit: 10

Quote from: pekatete on December 21, 2014, 08:11:33 PM

And yes, that is more like a death spiral, too damn right! I'd have expected it to be flashing x's all over + an increase in HW errors at this point, but then again the new binaries support that --bitmain-hwerror option (or such like) that I have never gotten my head around! I think its that time to try the 0800 setting .... I am convinced it is a heat problem your rigs are encountering due to its consistency in drop-off, so reducing the voltage may help (but may need a power cycle, infact I'd say do one even though I do all my tests initially without one).

OK, I'm going to call it a death spiral now. 98 minutes into the run, 15m hashrate of 375 at the pool, miner reports average dropped to 495. All chips still "o", temps 41/41, and a total of 5 HW errors. But I would also expect the symptoms you describe with an overheat, and I'm not seeing them. No increasing temps, no "x"s on the chips, HW errors still negligible. Fan speeds are dropping, now 1800-1900, down from 2200-2300. So the unit is getting cooler as it slows down. I saw the same behavior on all six units, so we are missing something here.

On the one test unit, I will now try 250/0750, hardware reset then software reset. Will see what happens! One thing I am wondering is if we are triggering an internal chip overheat of some kind, that throttles it back but doesn't report a bunch of failures. So we try to push them harder, and they actually go slower. But I'm just guessing at this point.

pekatete

hero member

Activity: 518

Merit: 500

Quote from: canford on December 21, 2014, 08:01:22 PM

Here's where that came from, ckolivas's recommendation when he released the S3 binary I am currently using on the OC unit:

Quote from: -ck on October 19, 2014, 06:17:02 PM

Here's an updated S3 binary.

http://ck.kolivas.org/apps/cgminer/antminer/s3/4.6.1-141020/cgminer

Recommended if you're mining on p2pool for the default binary actually discards stale shares which you should never do, especially on p2pool. Also includes changes to queuing and memory usage that were necessary on S4 but probably only of minor benefit here. Recommend you edit the cgminer startup script to remove the --queue value entirely, and add --lowmem. Performance should be pretty much unchanged.

Death spiral on the OC unit definitive now, average down to 523 GH/s. All chips good with "o", still just the 3 HW errors, temps 42/38. This would bother me less if I had some explanation for what is happening.

On the queue, I'll take ckolivas' word over what I may recall ...

And yes, that is more like a death spiral, too damn right! I'd have expected it to be flashing x's all over + an increase in HW errors at this point, but then again the new binaries support that --bitmain-hwerror option (or such like) that I have never gotten my head around! I think its that time to try the 0800 setting .... I am convinced it is a heat problem your rigs are encountering due to its consistency in drop-off, so reducing the voltage may help (but may need a power cycle, infact I'd say do one even though I do all my tests initially without one).

canford

member

Activity: 89

Merit: 10

Quote from: pekatete on December 21, 2014, 07:42:37 PM

Quote from: canford on December 21, 2014, 07:13:12 PM

4. If you've recently started the rig, could you post the cgminer startup string (usually logged in system log ...?)?
From the process tab:
cgminer --bitmain-options 115200:32:8:14:275:0a82 -o stratum+tcp://us1.ghash.io:3333 -O USER.WORKER:Any -o stratum+tcp://stratum.mining.eligius.st:3334 -O ADDR_WORKER:Any -o stratum+tcp://mint.bitminter.com:3333 -O USER_WORKER:PASS --bitmain-nobeeper --api-listen --api-network --bitmain-checkn2diff --bitmain-hwerror --version-file /usr/bin/compile_time --lowmem

I remember reading somewhere that it is not a good idea to run without a queue, rather have a reduced one to the 2048 one that ships. I run mine with --queue 1024, so you could possibly try --queue 512 if you are averse to having a long one. (I am not conversant with the why's of this, so please don't ask!)

Here's where that came from, ckolivas's recommendation when he released the S3 binary I am currently using on the OC unit:

Quote from: -ck on October 19, 2014, 06:17:02 PM

Here's an updated S3 binary.

http://ck.kolivas.org/apps/cgminer/antminer/s3/4.6.1-141020/cgminer

Recommended if you're mining on p2pool for the default binary actually discards stale shares which you should never do, especially on p2pool. Also includes changes to queuing and memory usage that were necessary on S4 but probably only of minor benefit here. Recommend you edit the cgminer startup script to remove the --queue value entirely, and add --lowmem. Performance should be pretty much unchanged.

Death spiral on the OC unit definitive now, average down to 523 GH/s. All chips good with "o", still just the 3 HW errors, temps 42/38. This would bother me less if I had some explanation for what is happening.

pekatete

hero member

Activity: 518

Merit: 500

Quote from: canford on December 21, 2014, 07:53:14 PM

Quote from: pekatete on December 21, 2014, 07:47:53 PM

Quote from: canford on December 21, 2014, 07:13:12 PM

2. What is the HW %
0 (zero). Last try it got a an occasional HW error, but they were minor. One of the six units got a bogged down in HW errors, but then performed well when I dropped the frequency to 268.

I assume that when your avg hash-rate falls you restart(ed) immediately, so seeing you have no HW errors, could you let it run for at least an hour after that and see whether you get any HW errors and / or x's on the chips?
I also noticed that the 5s hash-rate flactuates whenever a new block is started / a block is about to end, but soon builds up, so that may be something you have to bear in mind (assuming your hash-rate recovers!).

Unit hashrate dropping sharply now after the 1 hour run, average down to 533. HW errors now 3 for 0.0093%.

If hashrate keeps falling, I will report the numbers and try again with 0800.

I'm only going to try this with the one unit, the improvement was so dramatic the last time that when it was solid for an hour, I upgraded all 6. But I have yet to have an OC last more than 90 minutes or so.

That's fair enough, if it is falling that quickly into its uptime, and now registering errors, I expect you'll soon enough see x's on chips too, and if that happened would explain the consistent hash-rate drop-off very well ...... temp build-up!

EDIT: You'll probably have a better run with the 0800 setting .... but see this out for now to get a definitive result.

canford

member

Activity: 89

Merit: 10

Quote from: pekatete on December 21, 2014, 07:47:53 PM

Quote from: canford on December 21, 2014, 07:13:12 PM

2. What is the HW %
0 (zero). Last try it got a an occasional HW error, but they were minor. One of the six units got a bogged down in HW errors, but then performed well when I dropped the frequency to 268.

I assume that when your avg hash-rate falls you restart(ed) immediately, so seeing you have no HW errors, could you let it run for at least an hour after that and see whether you get any HW errors and / or x's on the chips?
I also noticed that the 5s hash-rate flactuates whenever a new block is started / a block is about to end, but soon builds up, so that may be something you have to bear in mind (assuming your hash-rate recovers!).

Unit hashrate dropping sharply now after the 1 hour run, average down to 533. HW errors now 3 for 0.0093%.

If hashrate keeps falling, I will report the numbers and try again with 0800.

I'm only going to try this with the one unit, the improvement was so dramatic the last time that when it was solid for an hour, I upgraded all 6. But I have yet to have an OC last more than 90 minutes or so.

canford

member

Activity: 89

Merit: 10

Quote from: pekatete on December 21, 2014, 07:21:37 PM

Did you try 275/0800 before 275/0815?

Not this time, just went with 0815. 1 hour in now, showing average 539 GH/s. You can see the slowdown starting though, average is decreasing, and temps now 42/38. I am consistently getting 1 hour of improved performance, then the slowdown starts.

pekatete

hero member

Activity: 518

Merit: 500

Quote from: canford on December 21, 2014, 07:13:12 PM

2. What is the HW %
0 (zero). Last try it got a an occasional HW error, but they were minor. One of the six units got a bogged down in HW errors, but then performed well when I dropped the frequency to 268.

I assume that when your avg hash-rate falls you restart(ed) immediately, so seeing you have no HW errors, could you let it run for at least an hour after that and see whether you get any HW errors and / or x's on the chips?
I also noticed that the 5s hash-rate flactuates whenever a new block is started / a block is about to end, but soon builds up, so that may be something you have to bear in mind (assuming your hash-rate recovers!).

pekatete

hero member

Activity: 518

Merit: 500

Quote from: canford on December 21, 2014, 07:13:12 PM

4. If you've recently started the rig, could you post the cgminer startup string (usually logged in system log ...?)?
From the process tab:
cgminer --bitmain-options 115200:32:8:14:275:0a82 -o stratum+tcp://us1.ghash.io:3333 -O USER.WORKER:Any -o stratum+tcp://stratum.mining.eligius.st:3334 -O ADDR_WORKER:Any -o stratum+tcp://mint.bitminter.com:3333 -O USER_WORKER:PASS --bitmain-nobeeper --api-listen --api-network --bitmain-checkn2diff --bitmain-hwerror --version-file /usr/bin/compile_time --lowmem

I remember reading somewhere that it is not a good idea to run without a queue, rather have a reduced one to the 2048 one that ships. I run mine with --queue 1024, so you could possibly try --queue 512 if you are averse to having a long one. (I am not conversant with the why's of this, so please don't ask!)

pekatete

hero member

Activity: 518

Merit: 500

Quote from: canford on December 21, 2014, 07:13:12 PM

3. What freq are you running at?
275/0815 target, last night's test I ran five that way, then the one I dropped to 268/0815.

Did you try 275/0800 before 275/0815?
The datasheet states 2 voltages for that freq, 0800 and 0850. You should try 0800 first and increment if you get LOTS of HW errors or a very low hash-rate. For freq 275, you should have hit at least 550 Gh/s(avg) in the hour IF your HW % is low and no x's showing. EDIT: Saying that, the 538 you are getting with no HW errors is not bad either but I should expect your rig to give at least 550.

PS. I think you'll have to start one machine at a time .... also, I'll address the other bits as and when ...

canford

member

Activity: 89

Merit: 10

Quote from: pekatete on December 21, 2014, 06:57:12 PM

Quote from: canford on December 21, 2014, 06:37:07 PM

I am using the 1024 firmware, the only change on top of that is adding the new frequencies into cgminer.lua. All six units stable again at the lower frequencies with no voltage setting.

Trying again on one unit, this time with ckolivas's S3 cgminer 4.6.1 build, also deleting --queue and adding --lowmem. I can't think of anything else to try. The S3's are in a ventilated space, so the ambient temperature over the three hour period is constant. Why would I get great results for 1+ hours, but poor results over 3 hours?

Possibly easier to simply post a shot of your web UI, but I'll ask just in case:
1. What temps do you have on that rig?
Right now showing 43 & 42 25 minutes in (and reporting 538 GH/s avg), it don't think it got above 44 the last time I tried it. Ambient temp in the room 18C.

2. What is the HW %
0 (zero). Last try it got a an occasional HW error, but they were minor. One of the six units got a bogged down in HW errors, but then performed well when I dropped the frequency to 268.

3. What freq are you running at?
275/0815 target, last night's test I ran five that way, then the one I dropped to 268/0815.

4. If you've recently started the rig, could you post the cgminer startup string (usually logged in system log ...?)?
From the process tab:
cgminer --bitmain-options 115200:32:8:14:275:0a82 -o stratum+tcp://us1.ghash.io:3333 -O USER.WORKER:Any -o stratum+tcp://stratum.mining.eligius.st:3334 -O ADDR_WORKER:Any -o stratum+tcp://mint.bitminter.com:3333 -O USER_WORKER:PASS --bitmain-nobeeper --api-listen --api-network --bitmain-checkn2diff --bitmain-hwerror --version-file /usr/bin/compile_time --lowmem

5. Finally, could you also post the freq setting line that you are running from the cgminer.lua file?
pb:value("14:275:0a82", translate("275M"))
[/quote]

Thanks for the help! Answers above in bold.

pekatete

hero member

Activity: 518

Merit: 500

Quote from: canford on December 21, 2014, 06:37:07 PM

I am using the 1024 firmware, the only change on top of that is adding the new frequencies into cgminer.lua. All six units stable again at the lower frequencies with no voltage setting.

Trying again on one unit, this time with ckolivas's S3 cgminer 4.6.1 build, also deleting --queue and adding --lowmem. I can't think of anything else to try. The S3's are in a ventilated space, so the ambient temperature over the three hour period is constant. Why would I get great results for 1+ hours, but poor results over 3 hours?
[/quote]

Possibly easier to simply post a shot of your web UI, but I'll ask just in case:
1. What temps do you have on that rig?
2. What is the HW %
3. What freq are you running at?
4. If you've recently started the rig, could you post the cgminer startup string (usually logged in system log ...?)?
5. Finally, could you also post the freq setting line that you are running from the cgminer.lua file?

canford

member

Activity: 89

Merit: 10

Quote from: canford on December 21, 2014, 03:55:29 PM

Quote from: IITravel01 on December 21, 2014, 12:33:39 PM

Quote from: canford on December 21, 2014, 05:18:31 AM

I continue to find this extremely frustrating. I am running 6 S3+ units, each supplied by a 750-800W power supply with a single 12V rail and all four PCI-E connectors connected. I tried the 275/0815 settings, and exactly the same thing happened as the last time. Hour 1: fantastic, all units over 500 GH/s, holy crap why did I not do this earlier. Hour 2: OK, but hashrates down both as reported by units and at the pool. Hour 3: terrible, all hashrates reported by units and by pool way below stock results. So I have yet again reverted everything to 225/231/243/237/225/237 with voltage left blank.

I do not understand the mechanism here. If it is heat, why does it take 2 hours to show up? With the OC settings, all chips show up as "o" and working. But the hashrate results speak for themselves. But why the very prominent jump in hashrate for the first hour, but then utter crap out in hour 3? Can anybody explain this?

Which firmware are you using?

I am using the 1024 firmware, the only change on top of that is adding the new frequencies into cgminer.lua. All six units stable again at the lower frequencies with no voltage setting.

Trying again on one unit, this time with ckolivas's S3 cgminer 4.6.1 build, also deleting --queue and adding --lowmem. I can't think of anything else to try. The S3's are in a ventilated space, so the ambient temperature over the three hour period is constant. Why would I get great results for 1+ hours, but poor results over 3 hours?

canford

member

Activity: 89

Merit: 10

Quote from: IITravel01 on December 21, 2014, 12:33:39 PM

Quote from: canford on December 21, 2014, 05:18:31 AM

I continue to find this extremely frustrating. I am running 6 S3+ units, each supplied by a 750-800W power supply with a single 12V rail and all four PCI-E connectors connected. I tried the 275/0815 settings, and exactly the same thing happened as the last time. Hour 1: fantastic, all units over 500 GH/s, holy crap why did I not do this earlier. Hour 2: OK, but hashrates down both as reported by units and at the pool. Hour 3: terrible, all hashrates reported by units and by pool way below stock results. So I have yet again reverted everything to 225/231/243/237/225/237 with voltage left blank.

I do not understand the mechanism here. If it is heat, why does it take 2 hours to show up? With the OC settings, all chips show up as "o" and working. But the hashrate results speak for themselves. But why the very prominent jump in hashrate for the first hour, but then utter crap out in hour 3? Can anybody explain this?

Which firmware are you using?

I am using the 1024 firmware, the only change on top of that is adding the new frequencies into cgminer.lua. All six units stable again at the lower frequencies with no voltage setting.

IITravel01

sr. member

Activity: 338

Merit: 250

Quote from: canford on December 21, 2014, 05:18:31 AM

I continue to find this extremely frustrating. I am running 6 S3+ units, each supplied by a 750-800W power supply with a single 12V rail and all four PCI-E connectors connected. I tried the 275/0815 settings, and exactly the same thing happened as the last time. Hour 1: fantastic, all units over 500 GH/s, holy crap why did I not do this earlier. Hour 2: OK, but hashrates down both as reported by units and at the pool. Hour 3: terrible, all hashrates reported by units and by pool way below stock results. So I have yet again reverted everything to 225/231/243/237/225/237 with voltage left blank.

I do not understand the mechanism here. If it is heat, why does it take 2 hours to show up? With the OC settings, all chips show up as "o" and working. But the hashrate results speak for themselves. But why the very prominent jump in hashrate for the first hour, but then utter crap out in hour 3? Can anybody explain this?

Which firmware are you using?

pekatete

hero member

Activity: 518

Merit: 500

Quote from: canford on December 21, 2014, 05:18:31 AM

I continue to find this extremely frustrating. I am running 6 S3+ units, each supplied by a 750-800W power supply with a single 12V rail and all four PCI-E connectors connected. I tried the 275/0815 settings, and exactly the same thing happened as the last time. Hour 1: fantastic, all units over 500 GH/s, holy crap why did I not do this earlier. Hour 2: OK, but hashrates down both as reported by units and at the pool. Hour 3: terrible, all hashrates reported by units and by pool way below stock results. So I have yet again reverted everything to 225/231/243/237/225/237 with voltage left blank.

I do not understand the mechanism here. If it is heat, why does it take 2 hours to show up? With the OC settings, all chips show up as "o" and working. But the hashrate results speak for themselves. But why the very prominent jump in hashrate for the first hour, but then utter crap out in hour 3? Can anybody explain this?

Just a few pointers that may help:
1. The voltage setting from the datasheet for the freq of 275 is both 0800 and 0850, so you can try anything within / around that range.
2. Hash rate falling in a few hours is NOT the "end of the day". Overall hash-rate is what matters in the long run (I usually let it run for 12 - 24hrs). If you are not getting X's on chips when the hash-rate is falling, I'd let it run as long as this did NOT coincide with a sustained increase in HW errors.
3. Do not be spooked by a fairly high HW error % at the start, and here I mean if the rate is less than or around 0.01 in the first few minutes to an hour, let it run.

S3 variants are the same but handle differently, so you may have to try different voltage settings (even freq settings). For example, the 262.5 freq does not ship from bitmain, but it turns out to be a sweet spot for most S3 variants!

Topic: Antminer S3 batch 6 overclocking - page 2. (Read 23094 times)