I've been running into a lot of hung work units. The work unit will hit the end and just sit there forever at 99.99% with 3 seconds remaining. I've done a bit of looking around and from what I've seen so far they say look for connection errors in the logs. So far I haven't seen any but I have these hung work units on several computers. I'm not sure how to resolve them but they certainly are hurting my production.
About the only thing I've been able to do so far is restart the computer without stopping the work unit first. That seems to knock it back to a previous checkpoint and then it will process about 50% of the time after it redoes the work again which can take hours.
If I stop the work unit prior to rebooting it just sits are 99.99% again. There must be an easier way to get around these. Has anyone else been experiencing this?
After you complete a work unit you should see something like this in the log:
23:58:06:WU00:FS01:0x17:Completed 5000000 out of 5000000 steps (100%)
23:58:08:WU03:FS01:Connecting to 171.67.108.201:80
23:58:08:WU03:FS01:Assigned to work server 140.163.4.231
23:58:08:WU03:FS01:Requesting new work unit for slot 01: RUNNING gpu:0:Pitcairn [] from 140.163.4.231
23:58:08:WU03:FS01:Connecting to 140.163.4.231:8080
23:58:09:WU03:FS01:Downloading 4.84MiB
23:58:11:WU03:FS01:Download complete
23:58:11:WU03:FS01:Received Unit: id:03 state:DOWNLOAD error:NO_ERROR project:13001 run:534 clone:7 gen:14 core:0x17 unit:0x0000001d538b3db7532c93b7e185f3f6
23:58:26:WU00:FS01:0x17:Saving result file logfile_01.txt
23:58:27:WU00:FS01:0x17:Saving result file checkpointState.xml
23:58:29:WU00:FS01:0x17:Saving result file checkpt.crc
23:58:29:WU00:FS01:0x17:Saving result file log.txt
23:58:29:WU00:FS01:0x17:Saving result file positions.xtc
23:58:31:WU00:FS01:0x17:Folding@home Core Shutdown: FINISHED_UNIT
23:58:31:WU00:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
23:58:31:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:13001 run:162 clone:2 gen:11 core:0x17 unit:0x00000021538b3db753287d9b0d4519a3
23:58:31:WU00:FS01:Uploading 12.86MiB to 140.163.4.231
23:58:31:WU00:FS01:Connecting to 140.163.4.231:8080
23:58:31:WU03:FS01:Starting
23:58:31:WU03:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/admin/AppData/Roaming/FAHClient/cores/www.stanford.edu/~pande/Win32/AMD64/ATI/R600/Core_17.fah/FahCore_17.exe -dir 03 -suffix 01 -version 704 -lifeline 4596 -checkpoint 15 -gpu 0 -gpu-vendor ati
23:58:32:WU03:FS01:Started FahCore on PID 848
23:58:32:WU03:FS01:Core PID:2244
23:58:32:WU03:FS01:FahCore 0x17 started
23:58:32:WU03:FS01:0x17:*********************** Log Started 2014-05-14T23:58:32Z ***********************
23:58:32:WU03:FS01:0x17:Project: 13001 (Run 534, Clone 7, Gen 14)
23:58:32:WU03:FS01:0x17:Unit: 0x0000001d538b3db7532c93b7e185f3f6
23:58:32:WU03:FS01:0x17:CPU: 0x00000000000000000000000000000000
23:58:32:WU03:FS01:0x17:Machine: 1
23:58:32:WU03:FS01:0x17:Reading tar file state.xml
23:58:33:WU03:FS01:0x17:Reading tar file system.xml
23:58:33:WU03:FS01:0x17:Reading tar file integrator.xml
23:58:33:WU03:FS01:0x17:Reading tar file core.xml
23:58:33:WU03:FS01:0x17:Digital signatures verified
23:58:33:WU03:FS01:0x17:Folding@home GPU core17
23:58:33:WU03:FS01:0x17:Version 0.0.52
23:58:37:WU00:FS01:Upload 7.78%
23:58:43:WU00:FS01:Upload 17.01%
23:58:49:WU00:FS01:Upload 25.76%
23:58:55:WU00:FS01:Upload 34.51%
23:59:01:WU00:FS01:Upload 43.75%
23:59:07:WU00:FS01:Upload 52.49%
23:59:13:WU00:FS01:Upload 61.24%
23:59:19:WU00:FS01:Upload 69.99%
23:59:25:WU00:FS01:Upload 78.74%
23:59:31:WU00:FS01:Upload 87.49%
23:59:37:WU00:FS01:Upload 96.24%
23:59:39:WU00:FS01:Upload complete
23:59:40:WU00:FS01:Server responded WORK_ACK (400)
23:59:40:WU00:FS01:Final credit estimate, 63963.00 points
23:59:40:WU00:FS01:Cleaning up
00:02:17:WU03:FS01:0x17:Completed 0 out of 5000000 steps (0%)
What does your log look like for these hung units?
EDIT: Also does anyone know how to do a spoiler in this forum? I'm new here so sorry for taking up all the space with the log. Put it in a quote to make it a little smaller anyway lol
EDIT2: Thanks meelvanchris that's better.