OFFICIAL CGMINER mining software thread for linux/win/osx/mips/arm/r-pi 4.11.0 - page 670.

Vbs

hero member

Activity: 504

Merit: 500

I had two rigs today with "dead" status on the gpu's (5850's on win x64), with error messages saying "Failed to reinit GPU thread" and "Thread <#> no longer exists" on 2.2.1. The strange thing is that it seems to have actually happened at the same time in both rigs (they were mining on the same pool, different workers, reported both dead for 10h). Could this be related to some pool communication bug?

bulanula

hero member

Activity: 518

Merit: 500

Stability testing continuing. Will post some updates soon but I need to do the benchmarking and stability testing.

-ck

legendary

Activity: 4088

Merit: 1631

Ruu \o/

Quote from: Peao on February 05, 2012, 06:45:12 AM

Quote from: -ck on February 05, 2012, 12:52:25 AM

Note also that you are having it being disabled and reading OFF. There is only one place in the code where cgminer does this itself - when it hits the thermal cutoff limit. Now it is possible there is some code convolution issue going on somehow that makes the fan not rise when one of the GPUs is being restarted. Note that your bug report shows thread 2 being idle (which would be GPU 1) and then it proceeds to disable threads 4 and 5 (being GPU 2)... Interesting... I'll audit the code, but perhaps try it without auto-fan on, setting what you know to be a safe static fan speed and see if the problem persists.

ck, I'm having the same "OFF" problem described, in rigs with 5970 and 5870:

https://bitcointalksearch.org/topic/m.727748

I have a theory as to how this might be happening now, and it involves cards that may well be getting sick occasionally, and have committed some code to the git tree for it. It would be interesting for you to run your cards with the old version overnight that doesn't have this problem, and then after it has run for an extended period, go into the GPU menu and see if any of the GPUs have been re-initialised at any time or if the "Last initialised" time is close to the main "Started" time at the top.

The00Dustin

hero member

Activity: 807

Merit: 500

Quote from: mmortal03 on February 05, 2012, 01:05:01 AM

Quote from: The00Dustin on February 02, 2012, 05:42:30 AM

Also, on a related note, maybe your processor can get 8MH/s and the 8MH/s extra isn't even coming from the video card. If that is the case, cutting the CPU frequency in half would lower it to a 4 MH/s gain at a still much more expensive MH/W (although realistically, even if it is the GPU getting that, running the CPU 100% at half frequency will probably still draw enough power to make it a net loss).

Are you theorizing that my CPU is explicitly doing some of the work, as if I was CPU mining, or do you mean something more nuanced than that? I didn't think my CPU would be used for explicit mining unless I had it specifically enabled to do so. I'm carrying out the U experiments that you guys suggested as we speak, btw.

Only as one (very unlikely) possibility. I have read that the 100% cpu bug is because AMD offloads some of the work to the processor to make a game run even faster (so why couldn't it do the same thing with hashing), but I have also read that it is how AMD was making sure the processor was instantly ready when the video card finishes its task (in which case it wouldn't be doing that).

Peao

legendary

Activity: 1320

Merit: 1001

Quote from: -ck on February 05, 2012, 12:52:25 AM

Quote from: gnar1ta$ on February 05, 2012, 12:35:17 AM

I use these flags

Code:

-I 8 --gpu-engine 950,850,830,950 --gpu-memclock 850,180,180,850  --auto-fan --temp-target 75 --temp-overheat 82

How do I check it's detecting right? It lists in the same order as 2.1.2. --gpu-reorder worth checking?

I can't see anything wrong with that. Without --gpu-reorder it uses the same order as 2.1.2 would have detected. I guess you'd know pretty quickly if it was setting the wrong speed on the wrong device and...

[2012-02-05 00:01:17] GPU 0 AMD Radeon HD 6800 Series hardware monitoring enabled
[2012-02-05 00:01:17] GPU 1 ATI Radeon HD 5900 Series hardware monitoring enabled
[2012-02-05 00:01:17] GPU 2 ATI Radeon HD 5900 Series hardware monitoring enabled
[2012-02-05 00:01:17] GPU 3 AMD Radeon HD 6800 Series hardware monitoring enabled

looks pretty convincing. So I'm at a loss for why it should be any worse.

Note also that you are having it being disabled and reading OFF. There is only one place in the code where cgminer does this itself - when it hits the thermal cutoff limit. Now it is possible there is some code convolution issue going on somehow that makes the fan not rise when one of the GPUs is being restarted. Note that your bug report shows thread 2 being idle (which would be GPU 1) and then it proceeds to disable threads 4 and 5 (being GPU 2)... Interesting... I'll audit the code, but perhaps try it without auto-fan on, setting what you know to be a safe static fan speed and see if the problem persists.

ck, I'm having the same "OFF" problem described, in rigs with 5970 and 5870:

https://bitcointalksearch.org/topic/m.727748

flower1024

legendary

Activity: 1428

Merit: 1000

Quote from: -ck on February 05, 2012, 04:00:30 AM

The pool where donations were going was hacked and I'm considering moving all my shares to p2pool as well now. I have grave concerns about centralising work to pools and increasingly see p2pool - or something like it - as the solution for bitcoin's future strength, going back to its decentralised nature as its strength. This means I won't realistically have a way of accepting small hashrate contributions donations with --donation that I can reasonably support. So after much angst I have decided that I will be deprecating the donations feature in upcoming versions and go back to the previous donation model of as-and-when you feel like it.

I thank those who have used the --donation feature greatly till now. It averaged around 400Mh/s over that time and at least kept me "mining" while my own mining rig was dead for over a month.

I recommend people disable --donation now and restart their miners for I don't know what will happen to hashes going to the pool during this instability (it is going offline for 24+ hours likely).

+1 for choosing p2pool!
i never used --donation anyways Wink

but had send you some btc some month ago.

-ck

legendary

Activity: 4088

Merit: 1631

Ruu \o/

The pool where donations were going was hacked and I'm considering moving all my shares to p2pool as well now. I have grave concerns about centralising work to pools and increasingly see p2pool - or something like it - as the solution for bitcoin's future strength, going back to its decentralised nature as its strength. This means I won't realistically have a way of accepting small hashrate contributions donations with --donation that I can reasonably support. So after much angst I have decided that I will be deprecating the donations feature in upcoming versions and go back to the previous donation model of as-and-when you feel like it.

I thank those who have used the --donation feature greatly till now. It averaged around 400Mh/s over that time and at least kept me "mining" while my own mining rig was dead for over a month.

I recommend people disable --donation now and restart their miners for I don't know what will happen to hashes going to the pool during this instability (it is going offline for 24+ hours likely).

gnar1ta$

donator

Activity: 798

Merit: 500

That rig runs fairly cool so I'll try overnight without auto-fan. Here is the rest of the errors from 2.2.1 sessions - I just lowered clock speeds for the different sessions.

Code:

[2012-02-02 18:22:19] Thread 1 idle for more than 60 seconds, GPU 1 declared SICK!
[2012-02-02 18:22:19] Attempting to restart GPU
[2012-02-02 18:22:19] Thread 2 still exists, killing it off
[2012-02-02 18:22:19] Thread 3 still exists, killing it off
[2012-02-02 18:22:19] Thread 2 restarted
[2012-02-02 18:22:20] Thread 3 restarted
[2012-02-02 18:22:21] Accepted 00000000.40372a1e.d75ae9ba GPU 0 thread 0 pool 0
[2012-02-02 18:22:21] Thread 2 being disabled
[2012-02-02 18:22:21] Thread 3 being disabled

[2012-02-02 17:58:17] Thread 3 idle for more than 60 seconds, GPU 3 declared SICK!
[2012-02-02 17:58:17] Attempting to restart GPU
[2012-02-02 17:58:17] Thread 6 still exists, killing it off
[2012-02-02 17:58:17] Thread 7 still exists, killing it off
[2012-02-02 17:58:18] Thread 6 restarted
[2012-02-02 17:58:19] Thread 7 restarted

[2012-02-03 21:19:13] Thread 4 still exists, killing it off
[2012-02-03 21:19:13] Thread 5 still exists, killing it off
[2012-02-03 21:19:14] Thread 4 restarted
[2012-02-03 21:19:14] Accepted 00000000.0cd03974.6e1c40e3 GPU 0 thread 1 pool 0
[2012-02-03 21:19:14] Thread 5 restarted
[2012-02-03 21:19:15] Thread 4 being disabled
[2012-02-03 21:19:16] Thread 5 being disabled

[2012-02-03 20:47:14] Thread 2 idle for more than 60 seconds, GPU 2 declared SICK!
[2012-02-03 20:47:14] Attempting to restart GPU
[2012-02-03 20:47:14] Thread 4 still exists, killing it off
[2012-02-03 20:47:14] Thread 5 still exists, killing it off
[2012-02-03 20:47:14] Thread 4 restarted
[2012-02-03 20:47:15] Accepted 00000000.b422fb13.d18b69ee GPU 1 thread 3 pool 0
[2012-02-03 20:47:15] Thread 5 restarted
[2012-02-03 20:47:15] Accepted 00000000.8db73352.22b376e6 GPU 0 thread 1 pool 0
[2012-02-03 20:47:16] Thread 4 being disabled
[2012-02-03 20:47:16] Thread 5 being disabled

[2012-02-04 17:29:18] Thread 2 idle for more than 60 seconds, GPU 2 declared SICK!
[2012-02-04 17:29:18] Attempting to restart GPU
[2012-02-04 17:29:18] Thread 4 still exists, killing it off
[2012-02-04 17:29:18] Thread 5 still exists, killing it off
[2012-02-04 17:29:19] Thread 4 restarted
[2012-02-04 17:29:19] Thread 5 restarted
[2012-02-04 17:29:20] Thread 4 being disabled
[2012-02-04 17:29:20] Thread 5 being disabled

mmortal03

legendary

Activity: 1762

Merit: 1011

Quote from: The00Dustin on February 02, 2012, 05:42:30 AM

Also, on a related note, maybe your processor can get 8MH/s and the 8MH/s extra isn't even coming from the video card. If that is the case, cutting the CPU frequency in half would lower it to a 4 MH/s gain at a still much more expensive MH/W (although realistically, even if it is the GPU getting that, running the CPU 100% at half frequency will probably still draw enough power to make it a net loss).

Are you theorizing that my CPU is explicitly doing some of the work, as if I was CPU mining, or do you mean something more nuanced than that? I didn't think my CPU would be used for explicit mining unless I had it specifically enabled to do so. I'm carrying out the U experiments that you guys suggested as we speak, btw.

-ck

legendary

Activity: 4088

Merit: 1631

Ruu \o/

Quote from: gnar1ta$ on February 05, 2012, 12:35:17 AM

I use these flags

Code:

-I 8 --gpu-engine 950,850,830,950 --gpu-memclock 850,180,180,850  --auto-fan --temp-target 75 --temp-overheat 82

How do I check it's detecting right? It lists in the same order as 2.1.2. --gpu-reorder worth checking?

I can't see anything wrong with that. Without --gpu-reorder it uses the same order as 2.1.2 would have detected. I guess you'd know pretty quickly if it was setting the wrong speed on the wrong device and...

[2012-02-05 00:01:17] GPU 0 AMD Radeon HD 6800 Series hardware monitoring enabled
[2012-02-05 00:01:17] GPU 1 ATI Radeon HD 5900 Series hardware monitoring enabled
[2012-02-05 00:01:17] GPU 2 ATI Radeon HD 5900 Series hardware monitoring enabled
[2012-02-05 00:01:17] GPU 3 AMD Radeon HD 6800 Series hardware monitoring enabled

looks pretty convincing. So I'm at a loss for why it should be any worse.

Note also that you are having it being disabled and reading OFF. There is only one place in the code where cgminer does this itself - when it hits the thermal cutoff limit. Now it is possible there is some code convolution issue going on somehow that makes the fan not rise when one of the GPUs is being restarted. Note that your bug report shows thread 2 being idle (which would be GPU 1) and then it proceeds to disable threads 4 and 5 (being GPU 2)... Interesting... I'll audit the code, but perhaps try it without auto-fan on, setting what you know to be a safe static fan speed and see if the problem persists.

gnar1ta$

donator

Activity: 798

Merit: 500

Quote from: -ck on February 05, 2012, 12:17:54 AM

Well 2.2.1 has the dual GPU linkage going on which 2.1.2 does not. Bear in mind the auto fanspeed control now looks at the temps of both GPUs, but overall this should run things cooler rather than hotter so unless you have some specific setup, or maybe have autofan off on one of the devices or different clock settings... Also, have you checked it's actually detecting the right card in the right device and linking the right devices?

I use these flags

Code:

-I 8 --gpu-engine 950,850,830,950 --gpu-memclock 850,180,180,850  --auto-fan --temp-target 75 --temp-overheat 82

How do I check it's detecting right? It lists in the same order as 2.1.2. --gpu-reorder worth checking?

-ck

legendary

Activity: 4088

Merit: 1631

Ruu \o/

Quote from: gnar1ta$ on February 05, 2012, 12:09:50 AM

Using 2.2.1 on a mixed card rig

Code:

[2012-02-05 00:01:17] CL Platform vendor: Advanced Micro Devices, Inc.
[2012-02-05 00:01:17] CL Platform name: AMD Accelerated Parallel Processing
[2012-02-05 00:01:17] CL Platform version: OpenCL 1.1 AMD-APP-SDK-v2.4 (595.10)
[2012-02-05 00:01:17] GPU 0 AMD Radeon HD 6800 Series hardware monitoring enabled
[2012-02-05 00:01:17] GPU 1 ATI Radeon HD 5900 Series hardware monitoring enabled
[2012-02-05 00:01:17] GPU 2 ATI Radeon HD 5900 Series hardware monitoring enabled
[2012-02-05 00:01:17] Failed to ADL_Overdrive5_FanSpeedInfo_Get
[2012-02-05 00:01:17] GPU 3 AMD Radeon HD 6800 Series hardware monitoring enabled
[2012-02-05 00:01:17] Dual GPUs detected: 2 and 1
[2012-02-05 00:01:17] 4 GPU devices detected

And the second core of the 5970 keeps showing OFF after 12-18 hrs running with this error

Code:

[2012-02-04 03:37:42] Thread 2 idle for more than 60 seconds, GPU 2 declared SICK!
[2012-02-04 03:37:42] Attempting to restart GPU
[2012-02-04 03:37:42] Thread 4 still exists, killing it off
[2012-02-04 03:37:42] Thread 5 still exists, killing it off
[2012-02-04 03:37:42] Thread 4 restarted
[2012-02-04 03:37:43] Thread 5 restarted
[2012-02-04 03:37:44] Thread 4 being disabled
[2012-02-04 03:37:44] Thread 5 being disabled

I first thought it was clocks, but it has been running for 5 months at the same clocks. I lowered them a few times anyway but still the same error. If I go back to 2.1.2 it runs fine, 24 + hours now. Huh

Well 2.2.1 has the dual GPU linkage going on which 2.1.2 does not. Bear in mind the auto fanspeed control now looks at the temps of both GPUs, but overall this should run things cooler rather than hotter so unless you have some specific setup, or maybe have autofan off on one of the devices or different clock settings... Also, have you checked it's actually detecting the right card in the right device and linking the right devices?

-ck

legendary

Activity: 4088

Merit: 1631

Ruu \o/

Quote from: QuantumFoam on February 04, 2012, 11:32:00 PM

Oops, I might have tried that with 2.1.2 instead of 2.2.1. I have both on the machine currently. I'll try rerunning with 2.2.1

EDIT: Ok I am an idiot, this was caused by my failure to type
export DISPLAY=:0

in the shell each time before I ran it. I mistakenly thought that command applied to the entire login session. Hope that info helps anyone else running into the same problem. Carry on. Tongue

You know I was actually about to say that's what usually causes it, but I doubted it would happen between runs. Seems I was wrong Wink

gnar1ta$

donator

Activity: 798

Merit: 500

Using 2.2.1 on a mixed card rig

Code:

[2012-02-05 00:01:17] CL Platform vendor: Advanced Micro Devices, Inc.
[2012-02-05 00:01:17] CL Platform name: AMD Accelerated Parallel Processing
[2012-02-05 00:01:17] CL Platform version: OpenCL 1.1 AMD-APP-SDK-v2.4 (595.10)
[2012-02-05 00:01:17] GPU 0 AMD Radeon HD 6800 Series hardware monitoring enabled
[2012-02-05 00:01:17] GPU 1 ATI Radeon HD 5900 Series hardware monitoring enabled
[2012-02-05 00:01:17] GPU 2 ATI Radeon HD 5900 Series hardware monitoring enabled
[2012-02-05 00:01:17] Failed to ADL_Overdrive5_FanSpeedInfo_Get
[2012-02-05 00:01:17] GPU 3 AMD Radeon HD 6800 Series hardware monitoring enabled
[2012-02-05 00:01:17] Dual GPUs detected: 2 and 1
[2012-02-05 00:01:17] 4 GPU devices detected

And the second core of the 5970 keeps showing OFF after 12-18 hrs running with this error

Code:

[2012-02-04 03:37:42] Thread 2 idle for more than 60 seconds, GPU 2 declared SICK!
[2012-02-04 03:37:42] Attempting to restart GPU
[2012-02-04 03:37:42] Thread 4 still exists, killing it off
[2012-02-04 03:37:42] Thread 5 still exists, killing it off
[2012-02-04 03:37:42] Thread 4 restarted
[2012-02-04 03:37:43] Thread 5 restarted
[2012-02-04 03:37:44] Thread 4 being disabled
[2012-02-04 03:37:44] Thread 5 being disabled

I first thought it was clocks, but it has been running for 5 months at the same clocks. I lowered them a few times anyway but still the same error. If I go back to 2.1.2 it runs fine, 24 + hours now. Huh

QuantumFoam

full member

Activity: 200

Merit: 100

|Quantum|World's First Cloud Management Platform

Oops, I might have tried that with 2.1.2 instead of 2.2.1. I have both on the machine currently. I'll try rerunning with 2.2.1

EDIT: Ok I am an idiot, this was caused by my failure to type
export DISPLAY=:0

in the shell each time before I ran it. I mistakenly thought that command applied to the entire login session. Hope that info helps anyone else running into the same problem. Carry on. Tongue

-ck

legendary

Activity: 4088

Merit: 1631

Ruu \o/

Quote from: QuantumFoam on February 04, 2012, 10:57:10 PM

[2012-02-04 22:53:02] Pushing ping to longpoll thread
[2012-02-04 22:53:02] ADL Initialisation Error!
[2012-02-04 22:53:02] Pushing ping to thread 0
[2012-02-04 22:53:02] Init GPU thread 0
[2012-02-04 22:53:02] List of devices:
[2012-02-04 22:53:02] 0 Cypress
[2012-02-04 22:53:02] 1 Cypress
[2012-02-04 22:53:02] 2 Cypress
[2012-02-04 22:53:02] 3 Juniper
[2012-02-04 22:53:02] 4 Cypress
[2012-02-04 22:53:02] Selected 0: Cypress
[2012-02-04 22:53:02] Preferred vector width reported 4
[2012-02-04 22:53:02] Max work group size reported 256

Only place it has a line with ADL.

Code:

	result = ADL_Main_Control_Create (ADL_Main_Memory_Alloc, 1);
	if (result != ADL_OK) {
		applog(LOG_INFO, "ADL Initialisation Error! Error %d!", result);
		return ;
	}

The newer version should also show what the error is, but the main initialisation is failing. What version of cgminer are you running?

It will be one of the following.

Code:

#define 	ADL_OK_WAIT   4
 	All OK, but need to wait.
#define 	ADL_OK_RESTART   3
 	All OK, but need restart.
#define 	ADL_OK_MODE_CHANGE   2
 	All OK but need mode change.
#define 	ADL_OK_WARNING   1
 	All OK, but with warning.
#define 	ADL_OK   0
 	ADL function completed successfully.
#define 	ADL_ERR   -1
 	Generic Error. Most likely one or more of the Escape calls to the driver failed!
#define 	ADL_ERR_NOT_INIT   -2
 	ADL not initialized.
#define 	ADL_ERR_INVALID_PARAM   -3
 	One of the parameter passed is invalid.
#define 	ADL_ERR_INVALID_PARAM_SIZE   -4
 	One of the parameter size is invalid.
#define 	ADL_ERR_INVALID_ADL_IDX   -5
 	Invalid ADL index passed.
#define 	ADL_ERR_INVALID_CONTROLLER_IDX   -6
 	Invalid controller index passed.
#define 	ADL_ERR_INVALID_DIPLAY_IDX   -7
 	Invalid display index passed.
#define 	ADL_ERR_NOT_SUPPORTED   -8
 	Function not supported by the driver.
#define 	ADL_ERR_NULL_POINTER   -9
 	Null Pointer error.
#define 	ADL_ERR_DISABLED_ADAPTER   -10
 	Call can't be made due to disabled adapter.
#define 	ADL_ERR_INVALID_CALLBACK   -11
 	Invalid Callback.
#define 	ADL_ERR_RESOURCE_CONFLICT   -12
 	Display Resource conflict.

QuantumFoam

full member

Activity: 200

Merit: 100

|Quantum|World's First Cloud Management Platform

[2012-02-04 22:53:02] Pushing ping to longpoll thread
[2012-02-04 22:53:02] ADL Initialisation Error!
[2012-02-04 22:53:02] Pushing ping to thread 0
[2012-02-04 22:53:02] Init GPU thread 0
[2012-02-04 22:53:02] List of devices:
[2012-02-04 22:53:02] 0 Cypress
[2012-02-04 22:53:02] 1 Cypress
[2012-02-04 22:53:02] 2 Cypress
[2012-02-04 22:53:02] 3 Juniper
[2012-02-04 22:53:02] 4 Cypress
[2012-02-04 22:53:02] Selected 0: Cypress
[2012-02-04 22:53:02] Preferred vector width reported 4
[2012-02-04 22:53:02] Max work group size reported 256

Only place it has a line with ADL.

-ck

legendary

Activity: 4088

Merit: 1631

Ruu \o/

Quote from: QuantumFoam on February 04, 2012, 10:10:56 PM

Quote from: -ck on February 04, 2012, 08:55:33 PM

Allegedly 2.4 and 2.5 do need more memory speed for some strange reason.

BTW, any ideas where I'd start looking to see why cgminer stops detecting fan rpm speeds/gpu temps after it has been previously run? Only way to get it to detect them again is to reboot the machine.

It's at a driver level that this is failing. I have no idea why it would change its mind to not support it on future runs. When it fails, if you start it with debugging enabled you will get what the ADL error is, but that still won't give an answer as to how to fix it. Something like "ADL Initialisation Error!" any error with adl in it.

QuantumFoam

full member

Activity: 200

Merit: 100

|Quantum|World's First Cloud Management Platform

Quote from: -ck on February 04, 2012, 08:55:33 PM

Allegedly 2.4 and 2.5 do need more memory speed for some strange reason.

BTW, any ideas where I'd start looking to see why cgminer stops detecting fan rpm speeds/gpu temps after it has been previously run? Only way to get it to detect them again is to reboot the machine.

-ck

legendary

Activity: 4088

Merit: 1631

Ruu \o/

Quote from: QuantumFoam on February 04, 2012, 08:46:34 PM

Quote from: cablepair on February 04, 2012, 07:10:45 PM

what SDK are you using and what speeds are you getting with 950/300 ?

v2.4 sdk. That's probably the issue, thought I had 2.1 installed. I'll see if I can remove 2.4 and install 2.1, hopefully that will make a difference. Currently getting ~430mh at 950/300 setting, that drops to 425 when going to 950/180.

Allegedly 2.4 and 2.5 do need more memory speed for some strange reason.

Topic: OFFICIAL CGMINER mining software thread for linux/win/osx/mips/arm/r-pi 4.11.0 - page 670. (Read 5805728 times)