Hi, this is the fix.
I did study a computer that did have stalled working units, I did recoqnise that the percentage you can read in the GUI are really detached from what you find in the log file. What you find in the log file is what results you have from the GPU/WU, that is the true performance. So when a worker are losing sync to a working unit the percentage are increased with the same speed until at reaches 99.99% and stops(in the GUI), thats why you can find out that something is wrong until the timer are reaching 99.99%. So the GUI are not the right place to look for stalled WU's. So if you look deeper in to the computer to the list of processes and particular process FahCore_17.exe, these processes are one for each GPU in your computer. The average CPU usage are around 1-4% and memory size is 200-300 MB ruff numbers.(When the WU are loaded and are folding) When a WU are stalling the CPU activity goes to 0% and stays there until you pause and starts folding again and then the FAHControl.exe are restarting the WU from the last saved file. OK, that's nice you can actually check if the GPU's are folding or not, when the activity are 0% you only need to terminate the stalled FahCore_17.exe and the WU restarts(from last save file). BUT that is hard work to run around and check WU's, so to the end of the story is the automatic solution: Download a software called processlasso and install it.
http://bitsum.com/processlasso/Search and find one process of FahCore_17.exe and right click on it, select menu option "Set watchdog rules for this process", 1: for -CPU, 2: Less than, 3: 1%, 4: 300 sec, 5: terminate the process, 6: Puch button "Create new process watchdog rule". Now the software are terminating the wu that have been inactive for 5 minutes and restarts, this is saved by the software and are restarted every time the computer starts. - YES it works, no more lost hours of stalled WU's
//Aboy68, by the way - if you change in the motherboard bios settings the PCIe version to version 2, that gives you are more stable system, the timing is not that fast as version 3 and verson 2 have the bandwith we need in foldings, I have this setting on all mother boards, Ver 1 is to slow(I have tryed it)
Outstanding post Aboy68! I love automated solutions and this will save me the trouble of having to try to write one myself! I will give this a shot. A couple of my computers get extremely unstable when some event happens that I have not been able to identify yet so I'm not sure this will work in my case but I'm definitely going to try it. I will also try the PCI-E 2 settings.
Thanks again!
Update1: I've just set up the software exactly as you describe on my machines. Your instructions were very easy to follow. Again well done!
Update2: Immediately had an opportunity for processlasso to take some action on the troubled machine and I got a message that I had been using it for 21,000,000 days so it was deactivated. lol UGH. I would have liked to try it before buying it just to see if would help so I may have to pursue a homegrown approach if / when I ever get time to do it.
Display driver recovery happend = one of the folding GPU's did stop = no CPU activity, waiting waiting and there the reset of the process happend.
This is the log file text for this event.
05:09:57:WU04:FS04:0x17:Completed 2550000 out of 5000000 steps (51%)
REM - Display driver recovery event
REM - 3min and reset of process.
05:16:20:WARNING:WU04:FS04:FahCore returned: FAILED_2 (1 = 0x1)
05:16:20:WU04:FS04:Starting
05:16:20:WU04:FS04:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/Admin/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/ATI/R600/Core_17.fah/FahCore_17.exe -dir 04 -suffix 01 -version 704 -lifeline 4756 -checkpoint 15 -gpu 3 -gpu-vendor ati
05:16:20:WU04:FS04:Started FahCore on PID 2068
05:16:20:WU04:FS04:Core PID:200
05:16:20:WU04:FS04:FahCore 0x17 started
05:16:21:WU04:FS04:0x17:*********************** Log Started 2014-06-04T05:16:21Z ***********************
05:16:21:WU04:FS04:0x17:Project: 9408 (Run 355, Clone 0, Gen 32)
05:16:21:WU04:FS04:0x17:Unit: 0x000000290a3b1e5c5342d6762c91b48a
05:16:21:WU04:FS04:0x17:CPU: 0x00000000000000000000000000000000
05:16:21:WU04:FS04:0x17:Machine: 4
05:16:21:WU04:FS04:0x17:Digital signatures verified
05:16:21:WU04:FS04:0x17:Folding@home GPU core17
05:16:21:WU04:FS04:0x17:Version 0.0.52
05:16:22:WU04:FS04:0x17: Found a checkpoint file
AND it running again.
If you want to check if there have been any reset events you only tick the Warnings and errors box and then you look at rows like this:
- 05:16:20:WARNING:WU04:FS04:FahCore returned: FAILED_2 (1 = 0x1)
//Aboy68