Author

Topic: OFFICIAL CGMINER mining software thread for linux/win/osx/mips/arm/r-pi 4.11.0 - page 670. (Read 5805728 times)

Vbs
hero member
Activity: 504
Merit: 500
I had two rigs today with "dead" status on the gpu's (5850's on win x64), with error messages saying "Failed to reinit GPU thread" and "Thread <#> no longer exists" on 2.2.1. The strange thing is that it seems to have actually happened at the same time in both rigs (they were mining on the same pool, different workers, reported both dead for 10h). Could this be related to some pool communication bug?
hero member
Activity: 518
Merit: 500
Stability testing continuing. Will post some updates soon but I need to do the benchmarking and stability testing.
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/

Note also that you are having it being disabled and reading OFF. There is only one place in the code where cgminer does this itself - when it hits the thermal cutoff limit. Now it is possible there is some code convolution issue going on somehow that makes the fan not rise when one of the GPUs is being restarted. Note that your bug report shows thread 2 being idle (which would be GPU 1) and then it proceeds to disable threads 4 and 5 (being GPU 2)... Interesting... I'll audit the code, but perhaps try it without auto-fan on, setting what you know to be a safe static fan speed and see if the problem persists.

ck, I'm having the same "OFF" problem described, in rigs with 5970 and 5870:

https://bitcointalksearch.org/topic/m.727748


I have a theory as to how this might be happening now, and it involves cards that may well be getting sick occasionally, and have committed some code to the git tree for it. It would be interesting for you to run your cards with the old version overnight that doesn't have this problem, and then after it has run for an extended period, go into the GPU menu and see if any of the GPUs have been re-initialised at any time or if the "Last initialised" time is close to the main "Started" time at the top.
hero member
Activity: 807
Merit: 500
Also, on a related note, maybe your processor can get 8MH/s and the 8MH/s extra isn't even coming from the video card.  If that is the case, cutting the CPU frequency in half would lower it to a 4 MH/s gain at a still much more expensive MH/W (although realistically, even if it is the GPU getting that, running the CPU 100% at half frequency will probably still draw enough power to make it a net loss).
Are you theorizing that my CPU is explicitly doing some of the work, as if I was CPU mining, or do you mean something more nuanced than that? I didn't think my CPU would be used for explicit mining unless I had it specifically enabled to do so. I'm carrying out the U experiments that you guys suggested as we speak, btw.
Only as one (very unlikely) possibility.  I have read that the 100% cpu bug is because AMD offloads some of the work to the processor to make a game run even faster (so why couldn't it do the same thing with hashing), but I have also read that it is how AMD was making sure the processor was instantly ready when the video card finishes its task (in which case it wouldn't be doing that).
legendary
Activity: 1320
Merit: 1001
I use these flags
Code:
-I 8 --gpu-engine 950,850,830,950 --gpu-memclock 850,180,180,850  --auto-fan --temp-target 75 --temp-overheat 82

How do I check it's detecting right?  It lists in the same order as 2.1.2. --gpu-reorder worth checking?
I can't see anything wrong with that. Without --gpu-reorder it uses the same order as 2.1.2 would have detected. I guess you'd know pretty quickly if it was setting the wrong speed on the wrong device and...

[2012-02-05 00:01:17] GPU 0 AMD Radeon HD 6800 Series hardware monitoring enabled
[2012-02-05 00:01:17] GPU 1 ATI Radeon HD 5900 Series hardware monitoring enabled
[2012-02-05 00:01:17] GPU 2 ATI Radeon HD 5900 Series hardware monitoring enabled
[2012-02-05 00:01:17] GPU 3 AMD Radeon HD 6800 Series hardware monitoring enabled

looks pretty convincing. So I'm at a loss for why it should be any worse.

Note also that you are having it being disabled and reading OFF. There is only one place in the code where cgminer does this itself - when it hits the thermal cutoff limit. Now it is possible there is some code convolution issue going on somehow that makes the fan not rise when one of the GPUs is being restarted. Note that your bug report shows thread 2 being idle (which would be GPU 1) and then it proceeds to disable threads 4 and 5 (being GPU 2)... Interesting... I'll audit the code, but perhaps try it without auto-fan on, setting what you know to be a safe static fan speed and see if the problem persists.

ck, I'm having the same "OFF" problem described, in rigs with 5970 and 5870:

https://bitcointalksearch.org/topic/m.727748

legendary
Activity: 1428
Merit: 1000
The pool where donations were going was hacked and I'm considering moving all my shares to p2pool as well now. I have grave concerns about centralising work to pools and increasingly see p2pool - or something like it - as the solution for bitcoin's future strength, going back to its decentralised nature as its strength. This means I won't realistically have a way of accepting small hashrate contributions donations with --donation that I can reasonably support. So after much angst I have decided that I will be deprecating the donations feature in upcoming versions and go back to the previous donation model of as-and-when you feel like it.

I thank those who have used the --donation feature greatly till now. It averaged around 400Mh/s over that time and at least kept me "mining" while my own mining rig was dead for over a month.

I recommend people disable --donation now and restart their miners for I don't know what will happen to hashes going to the pool during this instability (it is going offline for 24+ hours likely).


+1 for choosing p2pool!
i never used --donation anyways Wink but had send you some btc some month ago.
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
The pool where donations were going was hacked and I'm considering moving all my shares to p2pool as well now. I have grave concerns about centralising work to pools and increasingly see p2pool - or something like it - as the solution for bitcoin's future strength, going back to its decentralised nature as its strength. This means I won't realistically have a way of accepting small hashrate contributions donations with --donation that I can reasonably support. So after much angst I have decided that I will be deprecating the donations feature in upcoming versions and go back to the previous donation model of as-and-when you feel like it.

I thank those who have used the --donation feature greatly till now. It averaged around 400Mh/s over that time and at least kept me "mining" while my own mining rig was dead for over a month.

I recommend people disable --donation now and restart their miners for I don't know what will happen to hashes going to the pool during this instability (it is going offline for 24+ hours likely).
donator
Activity: 798
Merit: 500
That rig runs fairly cool so I'll try overnight without auto-fan.  Here is the rest of the errors from 2.2.1 sessions - I just lowered clock speeds for the different sessions.

Code:
[2012-02-02 18:22:19] Thread 1 idle for more than 60 seconds, GPU 1 declared SICK!
[2012-02-02 18:22:19] Attempting to restart GPU
[2012-02-02 18:22:19] Thread 2 still exists, killing it off
[2012-02-02 18:22:19] Thread 3 still exists, killing it off
[2012-02-02 18:22:19] Thread 2 restarted
[2012-02-02 18:22:20] Thread 3 restarted
[2012-02-02 18:22:21] Accepted 00000000.40372a1e.d75ae9ba GPU 0 thread 0 pool 0
[2012-02-02 18:22:21] Thread 2 being disabled
[2012-02-02 18:22:21] Thread 3 being disabled

[2012-02-02 17:58:17] Thread 3 idle for more than 60 seconds, GPU 3 declared SICK!
[2012-02-02 17:58:17] Attempting to restart GPU
[2012-02-02 17:58:17] Thread 6 still exists, killing it off
[2012-02-02 17:58:17] Thread 7 still exists, killing it off
[2012-02-02 17:58:18] Thread 6 restarted
[2012-02-02 17:58:19] Thread 7 restarted

[2012-02-03 21:19:13] Thread 4 still exists, killing it off
[2012-02-03 21:19:13] Thread 5 still exists, killing it off
[2012-02-03 21:19:14] Thread 4 restarted
[2012-02-03 21:19:14] Accepted 00000000.0cd03974.6e1c40e3 GPU 0 thread 1 pool 0
[2012-02-03 21:19:14] Thread 5 restarted
[2012-02-03 21:19:15] Thread 4 being disabled
[2012-02-03 21:19:16] Thread 5 being disabled

[2012-02-03 20:47:14] Thread 2 idle for more than 60 seconds, GPU 2 declared SICK!
[2012-02-03 20:47:14] Attempting to restart GPU
[2012-02-03 20:47:14] Thread 4 still exists, killing it off
[2012-02-03 20:47:14] Thread 5 still exists, killing it off
[2012-02-03 20:47:14] Thread 4 restarted
[2012-02-03 20:47:15] Accepted 00000000.b422fb13.d18b69ee GPU 1 thread 3 pool 0
[2012-02-03 20:47:15] Thread 5 restarted
[2012-02-03 20:47:15] Accepted 00000000.8db73352.22b376e6 GPU 0 thread 1 pool 0
[2012-02-03 20:47:16] Thread 4 being disabled
[2012-02-03 20:47:16] Thread 5 being disabled

[2012-02-04 17:29:18] Thread 2 idle for more than 60 seconds, GPU 2 declared SICK!
[2012-02-04 17:29:18] Attempting to restart GPU
[2012-02-04 17:29:18] Thread 4 still exists, killing it off
[2012-02-04 17:29:18] Thread 5 still exists, killing it off
[2012-02-04 17:29:19] Thread 4 restarted
[2012-02-04 17:29:19] Thread 5 restarted
[2012-02-04 17:29:20] Thread 4 being disabled
[2012-02-04 17:29:20] Thread 5 being disabled
legendary
Activity: 1762
Merit: 1011
Also, on a related note, maybe your processor can get 8MH/s and the 8MH/s extra isn't even coming from the video card.  If that is the case, cutting the CPU frequency in half would lower it to a 4 MH/s gain at a still much more expensive MH/W (although realistically, even if it is the GPU getting that, running the CPU 100% at half frequency will probably still draw enough power to make it a net loss).

Are you theorizing that my CPU is explicitly doing some of the work, as if I was CPU mining, or do you mean something more nuanced than that? I didn't think my CPU would be used for explicit mining unless I had it specifically enabled to do so. I'm carrying out the U experiments that you guys suggested as we speak, btw.
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
I use these flags
Code:
-I 8 --gpu-engine 950,850,830,950 --gpu-memclock 850,180,180,850  --auto-fan --temp-target 75 --temp-overheat 82

How do I check it's detecting right?  It lists in the same order as 2.1.2. --gpu-reorder worth checking?
I can't see anything wrong with that. Without --gpu-reorder it uses the same order as 2.1.2 would have detected. I guess you'd know pretty quickly if it was setting the wrong speed on the wrong device and...

[2012-02-05 00:01:17] GPU 0 AMD Radeon HD 6800 Series hardware monitoring enabled
[2012-02-05 00:01:17] GPU 1 ATI Radeon HD 5900 Series hardware monitoring enabled
[2012-02-05 00:01:17] GPU 2 ATI Radeon HD 5900 Series hardware monitoring enabled
[2012-02-05 00:01:17] GPU 3 AMD Radeon HD 6800 Series hardware monitoring enabled

looks pretty convincing. So I'm at a loss for why it should be any worse.

Note also that you are having it being disabled and reading OFF. There is only one place in the code where cgminer does this itself - when it hits the thermal cutoff limit. Now it is possible there is some code convolution issue going on somehow that makes the fan not rise when one of the GPUs is being restarted. Note that your bug report shows thread 2 being idle (which would be GPU 1) and then it proceeds to disable threads 4 and 5 (being GPU 2)... Interesting... I'll audit the code, but perhaps try it without auto-fan on, setting what you know to be a safe static fan speed and see if the problem persists.
donator
Activity: 798
Merit: 500
Well 2.2.1 has the dual GPU linkage going on which 2.1.2 does not. Bear in mind the auto fanspeed control now looks at the temps of both GPUs, but overall this should run things cooler rather than hotter so unless you have some specific setup, or maybe have autofan off on one of the devices or different clock settings... Also, have you checked it's actually detecting the right card in the right device and linking the right devices?

I use these flags
Code:
-I 8 --gpu-engine 950,850,830,950 --gpu-memclock 850,180,180,850  --auto-fan --temp-target 75 --temp-overheat 82

How do I check it's detecting right?  It lists in the same order as 2.1.2. --gpu-reorder worth checking?
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
Using 2.2.1 on a mixed card rig
Code:
[2012-02-05 00:01:17] CL Platform vendor: Advanced Micro Devices, Inc.
[2012-02-05 00:01:17] CL Platform name: AMD Accelerated Parallel Processing
[2012-02-05 00:01:17] CL Platform version: OpenCL 1.1 AMD-APP-SDK-v2.4 (595.10)
[2012-02-05 00:01:17] GPU 0 AMD Radeon HD 6800 Series hardware monitoring enabled
[2012-02-05 00:01:17] GPU 1 ATI Radeon HD 5900 Series hardware monitoring enabled
[2012-02-05 00:01:17] GPU 2 ATI Radeon HD 5900 Series hardware monitoring enabled
[2012-02-05 00:01:17] Failed to ADL_Overdrive5_FanSpeedInfo_Get
[2012-02-05 00:01:17] GPU 3 AMD Radeon HD 6800 Series hardware monitoring enabled
[2012-02-05 00:01:17] Dual GPUs detected: 2 and 1
[2012-02-05 00:01:17] 4 GPU devices detected

And the second core of the 5970 keeps showing OFF after 12-18 hrs running with this error
Code:
[2012-02-04 03:37:42] Thread 2 idle for more than 60 seconds, GPU 2 declared SICK!
[2012-02-04 03:37:42] Attempting to restart GPU
[2012-02-04 03:37:42] Thread 4 still exists, killing it off
[2012-02-04 03:37:42] Thread 5 still exists, killing it off
[2012-02-04 03:37:42] Thread 4 restarted
[2012-02-04 03:37:43] Thread 5 restarted
[2012-02-04 03:37:44] Thread 4 being disabled
[2012-02-04 03:37:44] Thread 5 being disabled

I first thought it was clocks, but it has been running for 5 months at the same clocks.  I lowered them a few times anyway but still the same error.  If I go back to 2.1.2 it runs fine, 24 + hours now.  Huh
Well 2.2.1 has the dual GPU linkage going on which 2.1.2 does not. Bear in mind the auto fanspeed control now looks at the temps of both GPUs, but overall this should run things cooler rather than hotter so unless you have some specific setup, or maybe have autofan off on one of the devices or different clock settings... Also, have you checked it's actually detecting the right card in the right device and linking the right devices?
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
Oops, I might have tried that with 2.1.2 instead of 2.2.1. I have both on the machine currently. I'll try rerunning with 2.2.1

EDIT: Ok I am an idiot, this was caused by my failure to type
export DISPLAY=:0

in the shell each time before I ran it. I mistakenly thought that command applied to the entire login session. Hope that info helps anyone else running into the same problem. Carry on. Tongue
You know I was actually about to say that's what usually causes  it, but I doubted it would happen between runs. Seems I was wrong  Wink
donator
Activity: 798
Merit: 500
Using 2.2.1 on a mixed card rig
Code:
[2012-02-05 00:01:17] CL Platform vendor: Advanced Micro Devices, Inc.
[2012-02-05 00:01:17] CL Platform name: AMD Accelerated Parallel Processing
[2012-02-05 00:01:17] CL Platform version: OpenCL 1.1 AMD-APP-SDK-v2.4 (595.10)
[2012-02-05 00:01:17] GPU 0 AMD Radeon HD 6800 Series hardware monitoring enabled
[2012-02-05 00:01:17] GPU 1 ATI Radeon HD 5900 Series hardware monitoring enabled
[2012-02-05 00:01:17] GPU 2 ATI Radeon HD 5900 Series hardware monitoring enabled
[2012-02-05 00:01:17] Failed to ADL_Overdrive5_FanSpeedInfo_Get
[2012-02-05 00:01:17] GPU 3 AMD Radeon HD 6800 Series hardware monitoring enabled
[2012-02-05 00:01:17] Dual GPUs detected: 2 and 1
[2012-02-05 00:01:17] 4 GPU devices detected

And the second core of the 5970 keeps showing OFF after 12-18 hrs running with this error
Code:
[2012-02-04 03:37:42] Thread 2 idle for more than 60 seconds, GPU 2 declared SICK!
[2012-02-04 03:37:42] Attempting to restart GPU
[2012-02-04 03:37:42] Thread 4 still exists, killing it off
[2012-02-04 03:37:42] Thread 5 still exists, killing it off
[2012-02-04 03:37:42] Thread 4 restarted
[2012-02-04 03:37:43] Thread 5 restarted
[2012-02-04 03:37:44] Thread 4 being disabled
[2012-02-04 03:37:44] Thread 5 being disabled

I first thought it was clocks, but it has been running for 5 months at the same clocks.  I lowered them a few times anyway but still the same error.  If I go back to 2.1.2 it runs fine, 24 + hours now.  Huh
full member
Activity: 200
Merit: 100
|Quantum|World's First Cloud Management Platform
Oops, I might have tried that with 2.1.2 instead of 2.2.1. I have both on the machine currently. I'll try rerunning with 2.2.1

EDIT: Ok I am an idiot, this was caused by my failure to type
export DISPLAY=:0

in the shell each time before I ran it. I mistakenly thought that command applied to the entire login session. Hope that info helps anyone else running into the same problem. Carry on. Tongue
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
[2012-02-04 22:53:02] Pushing ping to longpoll thread
[2012-02-04 22:53:02] ADL Initialisation Error!
[2012-02-04 22:53:02] Pushing ping to thread 0
[2012-02-04 22:53:02] Init GPU thread 0
[2012-02-04 22:53:02] List of devices:
[2012-02-04 22:53:02]   0       Cypress
[2012-02-04 22:53:02]   1       Cypress
[2012-02-04 22:53:02]   2       Cypress
[2012-02-04 22:53:02]   3       Juniper
[2012-02-04 22:53:02]   4       Cypress
[2012-02-04 22:53:02] Selected 0: Cypress
[2012-02-04 22:53:02] Preferred vector width reported 4
[2012-02-04 22:53:02] Max work group size reported 256


Only place it has a line with ADL.
Code:
	result = ADL_Main_Control_Create (ADL_Main_Memory_Alloc, 1);
if (result != ADL_OK) {
applog(LOG_INFO, "ADL Initialisation Error! Error %d!", result);
return ;
}

The newer version should also show what the error is, but the main initialisation is failing. What version of cgminer are you running?

It will be one of the following.
Code:
#define 	ADL_OK_WAIT   4
  All OK, but need to wait.
#define ADL_OK_RESTART   3
  All OK, but need restart.
#define ADL_OK_MODE_CHANGE   2
  All OK but need mode change.
#define ADL_OK_WARNING   1
  All OK, but with warning.
#define ADL_OK   0
  ADL function completed successfully.
#define ADL_ERR   -1
  Generic Error. Most likely one or more of the Escape calls to the driver failed!
#define ADL_ERR_NOT_INIT   -2
  ADL not initialized.
#define ADL_ERR_INVALID_PARAM   -3
  One of the parameter passed is invalid.
#define ADL_ERR_INVALID_PARAM_SIZE   -4
  One of the parameter size is invalid.
#define ADL_ERR_INVALID_ADL_IDX   -5
  Invalid ADL index passed.
#define ADL_ERR_INVALID_CONTROLLER_IDX   -6
  Invalid controller index passed.
#define ADL_ERR_INVALID_DIPLAY_IDX   -7
  Invalid display index passed.
#define ADL_ERR_NOT_SUPPORTED   -8
  Function not supported by the driver.
#define ADL_ERR_NULL_POINTER   -9
  Null Pointer error.
#define ADL_ERR_DISABLED_ADAPTER   -10
  Call can't be made due to disabled adapter.
#define ADL_ERR_INVALID_CALLBACK   -11
  Invalid Callback.
#define ADL_ERR_RESOURCE_CONFLICT   -12
  Display Resource conflict.
full member
Activity: 200
Merit: 100
|Quantum|World's First Cloud Management Platform
[2012-02-04 22:53:02] Pushing ping to longpoll thread
[2012-02-04 22:53:02] ADL Initialisation Error!
[2012-02-04 22:53:02] Pushing ping to thread 0
[2012-02-04 22:53:02] Init GPU thread 0
[2012-02-04 22:53:02] List of devices:
[2012-02-04 22:53:02]   0       Cypress
[2012-02-04 22:53:02]   1       Cypress
[2012-02-04 22:53:02]   2       Cypress
[2012-02-04 22:53:02]   3       Juniper
[2012-02-04 22:53:02]   4       Cypress
[2012-02-04 22:53:02] Selected 0: Cypress
[2012-02-04 22:53:02] Preferred vector width reported 4
[2012-02-04 22:53:02] Max work group size reported 256


Only place it has a line with ADL.
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
Allegedly 2.4 and 2.5 do need more memory speed for some strange reason.
BTW, any ideas where I'd start looking to see why cgminer stops detecting fan rpm speeds/gpu temps after it has been previously run? Only way to get it to detect them again is to reboot the machine.
It's at a driver level that this is failing. I have no idea why it would change its mind to not support it on future runs. When it fails, if you start it with debugging enabled you will get what the ADL error is, but that still won't give an answer as to how to fix it. Something like "ADL Initialisation Error!" any error with adl in it.
full member
Activity: 200
Merit: 100
|Quantum|World's First Cloud Management Platform
Allegedly 2.4 and 2.5 do need more memory speed for some strange reason.
BTW, any ideas where I'd start looking to see why cgminer stops detecting fan rpm speeds/gpu temps after it has been previously run? Only way to get it to detect them again is to reboot the machine.
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
what SDK are you using and what speeds are you getting with 950/300 ?

v2.4 sdk. That's probably the issue, thought I had 2.1 installed. I'll see if I can remove 2.4 and install 2.1, hopefully that will make a difference. Currently getting ~430mh at 950/300 setting, that drops to 425 when going to 950/180.
Allegedly 2.4 and 2.5 do need more memory speed for some strange reason.
Jump to: