[ANN] CureCoin 2.0 is live - Mandatory Update is available now - DEC 2018 - page 123.

intrinsic coins

newbie

Activity: 5

Merit: 0

Quote from: bananahunter67 on June 04, 2014, 02:18:30 AM

Quote from: ivanlabrie on June 01, 2014, 11:24:05 PM

I'm all for Curecoin, but the implementation IS flawed...devs seem awfully silent and that is not good in the eyes of the public.

There's too much inflation and too little uses for the Curecoins right now, sha256 miners will flock to btc again and the coin won't be secure with PoS alone. Something must be done asap... Undecided

+1. This coin is dying. Even people are leaving folding as number of coins earned per day is growing each day. Make multipool and make paid advertisment. You have money from the IPO. Contact health organisations, ask for sponsorship. Otherwise this good idea will die.

Well, start up a exchange for the secondary computation already. Like reserve 60% of the network for protein folding then leave the rest 40% up for bid. Then you use then money to do coins buyback. like 10% at the start, 30% in the middle then 50% at the finish. Use easy proof of concept such as prime number crunching for demos.

bananahunter67

sr. member

Activity: 392

Merit: 265

Quote from: ivanlabrie on June 01, 2014, 11:24:05 PM

I'm all for Curecoin, but the implementation IS flawed...devs seem awfully silent and that is not good in the eyes of the public.

There's too much inflation and too little uses for the Curecoins right now, sha256 miners will flock to btc again and the coin won't be secure with PoS alone. Something must be done asap... Undecided

+1. This coin is dying. Even people are leaving folding as number of coins earned per day is growing each day. Make multipool and make paid advertisment. You have money from the IPO. Contact health organisations, ask for sponsorship. Otherwise this good idea will die.

ChasingTheDream

sr. member

Activity: 292

Merit: 250

Quote from: Aboy68 on June 04, 2014, 12:20:50 AM

Quote from: ChasingTheDream on June 03, 2014, 06:23:15 PM

Quote from: Aboy68 on June 03, 2014, 04:47:24 PM

Hi, this is the fix.
I did study a computer that did have stalled working units, I did recoqnise that the percentage you can read in the GUI are really detached from what you find in the log file. What you find in the log file is what results you have from the GPU/WU, that is the true performance. So when a worker are losing sync to a working unit the percentage are increased with the same speed until at reaches 99.99% and stops(in the GUI), thats why you can find out that something is wrong until the timer are reaching 99.99%. So the GUI are not the right place to look for stalled WU's. So if you look deeper in to the computer to the list of processes and particular process FahCore_17.exe, these processes are one for each GPU in your computer. The average CPU usage are around 1-4% and memory size is 200-300 MB ruff numbers.(When the WU are loaded and are folding) When a WU are stalling the CPU activity goes to 0% and stays there until you pause and starts folding again and then the FAHControl.exe are restarting the WU from the last saved file. OK, that's nice you can actually check if the GPU's are folding or not, when the activity are 0% you only need to terminate the stalled FahCore_17.exe and the WU restarts(from last save file). BUT that is hard work to run around and check WU's, so to the end of the story is the automatic solution: Download a software called processlasso and install it. http://bitsum.com/processlasso/
Search and find one process of FahCore_17.exe and right click on it, select menu option "Set watchdog rules for this process", 1: for -CPU, 2: Less than, 3: 1%, 4: 300 sec, 5: terminate the process, 6: Puch button "Create new process watchdog rule". Now the software are terminating the wu that have been inactive for 5 minutes and restarts, this is saved by the software and are restarted every time the computer starts. - YES it works, no more lost hours of stalled WU's

//Aboy68, by the way - if you change in the motherboard bios settings the PCIe version to version 2, that gives you are more stable system, the timing is not that fast as version 3 and verson 2 have the bandwith we need in foldings, I have this setting on all mother boards, Ver 1 is to slow(I have tryed it)

Outstanding post Aboy68! I love automated solutions and this will save me the trouble of having to try to write one myself! I will give this a shot. A couple of my computers get extremely unstable when some event happens that I have not been able to identify yet so I'm not sure this will work in my case but I'm definitely going to try it. I will also try the PCI-E 2 settings.

Thanks again!

Update1: I've just set up the software exactly as you describe on my machines. Your instructions were very easy to follow. Again well done!

Update2: Immediately had an opportunity for processlasso to take some action on the troubled machine and I got a message that I had been using it for 21,000,000 days so it was deactivated. lol UGH. I would have liked to try it before buying it just to see if would help so I may have to pursue a homegrown approach if / when I ever get time to do it.

Display driver recovery happend = one of the folding GPU's did stop = no CPU activity, waiting waiting and there the reset of the process happend.
This is the log file text for this event.

05:09:57:WU04:FS04:0x17:Completed 2550000 out of 5000000 steps (51%)
   REM - Display driver recovery event
   REM - 3min and reset of process.
05:16:20:WARNING:WU04:FS04:FahCore returned: FAILED_2 (1 = 0x1)
05:16:20:WU04:FS04:Starting
05:16:20:WU04:FS04:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/Admin/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/ATI/R600/Core_17.fah/FahCore_17.exe -dir 04 -suffix 01 -version 704 -lifeline 4756 -checkpoint 15 -gpu 3 -gpu-vendor ati
05:16:20:WU04:FS04:Started FahCore on PID 2068
05:16:20:WU04:FS04:Core PID:200
05:16:20:WU04:FS04:FahCore 0x17 started
05:16:21:WU04:FS04:0x17:*********************** Log Started 2014-06-04T05:16:21Z ***********************
05:16:21:WU04:FS04:0x17:Project: 9408 (Run 355, Clone 0, Gen 32)
05:16:21:WU04:FS04:0x17:Unit: 0x000000290a3b1e5c5342d6762c91b48a
05:16:21:WU04:FS04:0x17:CPU: 0x00000000000000000000000000000000
05:16:21:WU04:FS04:0x17:Machine: 4
05:16:21:WU04:FS04:0x17:Digital signatures verified
05:16:21:WU04:FS04:0x17:Folding@home GPU core17
05:16:21:WU04:FS04:0x17:Version 0.0.52
05:16:22:WU04:FS04:0x17: Found a checkpoint file

AND it running again.

If you want to check if there have been any reset events you only tick the Warnings and errors box and then you look at rows like this:
   - 05:16:20:WARNING:WU04:FS04:FahCore returned: FAILED_2 (1 = 0x1)

//Aboy68

Yeah when I do it mine looks like this....

*********************** Log Started 2014-06-04T04:17:08Z ***********************

There are literally no errors or warnings.

Aboy68

member

Activity: 96

Merit: 10

Quote from: ChasingTheDream on June 03, 2014, 06:23:15 PM

Quote from: Aboy68 on June 03, 2014, 04:47:24 PM

Hi, this is the fix.
I did study a computer that did have stalled working units, I did recoqnise that the percentage you can read in the GUI are really detached from what you find in the log file. What you find in the log file is what results you have from the GPU/WU, that is the true performance. So when a worker are losing sync to a working unit the percentage are increased with the same speed until at reaches 99.99% and stops(in the GUI), thats why you can find out that something is wrong until the timer are reaching 99.99%. So the GUI are not the right place to look for stalled WU's. So if you look deeper in to the computer to the list of processes and particular process FahCore_17.exe, these processes are one for each GPU in your computer. The average CPU usage are around 1-4% and memory size is 200-300 MB ruff numbers.(When the WU are loaded and are folding) When a WU are stalling the CPU activity goes to 0% and stays there until you pause and starts folding again and then the FAHControl.exe are restarting the WU from the last saved file. OK, that's nice you can actually check if the GPU's are folding or not, when the activity are 0% you only need to terminate the stalled FahCore_17.exe and the WU restarts(from last save file). BUT that is hard work to run around and check WU's, so to the end of the story is the automatic solution: Download a software called processlasso and install it. http://bitsum.com/processlasso/
Search and find one process of FahCore_17.exe and right click on it, select menu option "Set watchdog rules for this process", 1: for -CPU, 2: Less than, 3: 1%, 4: 300 sec, 5: terminate the process, 6: Puch button "Create new process watchdog rule". Now the software are terminating the wu that have been inactive for 5 minutes and restarts, this is saved by the software and are restarted every time the computer starts. - YES it works, no more lost hours of stalled WU's

//Aboy68, by the way - if you change in the motherboard bios settings the PCIe version to version 2, that gives you are more stable system, the timing is not that fast as version 3 and verson 2 have the bandwith we need in foldings, I have this setting on all mother boards, Ver 1 is to slow(I have tryed it)

Outstanding post Aboy68! I love automated solutions and this will save me the trouble of having to try to write one myself! I will give this a shot. A couple of my computers get extremely unstable when some event happens that I have not been able to identify yet so I'm not sure this will work in my case but I'm definitely going to try it. I will also try the PCI-E 2 settings.

Thanks again!

Update1: I've just set up the software exactly as you describe on my machines. Your instructions were very easy to follow. Again well done!

Update2: Immediately had an opportunity for processlasso to take some action on the troubled machine and I got a message that I had been using it for 21,000,000 days so it was deactivated. lol UGH. I would have liked to try it before buying it just to see if would help so I may have to pursue a homegrown approach if / when I ever get time to do it.

Display driver recovery happend = one of the folding GPU's did stop = no CPU activity, waiting waiting and there the reset of the process happend.
This is the log file text for this event.

05:09:57:WU04:FS04:0x17:Completed 2550000 out of 5000000 steps (51%)
   REM - Display driver recovery event
   REM - 3min and reset of process.
05:16:20:WARNING:WU04:FS04:FahCore returned: FAILED_2 (1 = 0x1)
05:16:20:WU04:FS04:Starting
05:16:20:WU04:FS04:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/Admin/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/ATI/R600/Core_17.fah/FahCore_17.exe -dir 04 -suffix 01 -version 704 -lifeline 4756 -checkpoint 15 -gpu 3 -gpu-vendor ati
05:16:20:WU04:FS04:Started FahCore on PID 2068
05:16:20:WU04:FS04:Core PID:200
05:16:20:WU04:FS04:FahCore 0x17 started
05:16:21:WU04:FS04:0x17:*********************** Log Started 2014-06-04T05:16:21Z ***********************
05:16:21:WU04:FS04:0x17:Project: 9408 (Run 355, Clone 0, Gen 32)
05:16:21:WU04:FS04:0x17:Unit: 0x000000290a3b1e5c5342d6762c91b48a
05:16:21:WU04:FS04:0x17:CPU: 0x00000000000000000000000000000000
05:16:21:WU04:FS04:0x17:Machine: 4
05:16:21:WU04:FS04:0x17:Digital signatures verified
05:16:21:WU04:FS04:0x17:Folding@home GPU core17
05:16:21:WU04:FS04:0x17:Version 0.0.52
05:16:22:WU04:FS04:0x17: Found a checkpoint file

AND it running again.

If you want to check if there have been any reset events you only tick the Warnings and errors box and then you look at rows like this:
   - 05:16:20:WARNING:WU04:FS04:FahCore returned: FAILED_2 (1 = 0x1)

//Aboy68

ChasingTheDream

sr. member

Activity: 292

Merit: 250

Quote from: Aboy68 on June 03, 2014, 11:48:28 PM

Quote from: ChasingTheDream on June 03, 2014, 06:23:15 PM

Quote from: Aboy68 on June 03, 2014, 04:47:24 PM

Hi, this is the fix.
I did study a computer that did have stalled working units, I did recoqnise that the percentage you can read in the GUI are really detached from what you find in the log file. What you find in the log file is what results you have from the GPU/WU, that is the true performance. So when a worker are losing sync to a working unit the percentage are increased with the same speed until at reaches 99.99% and stops(in the GUI), thats why you can find out that something is wrong until the timer are reaching 99.99%. So the GUI are not the right place to look for stalled WU's. So if you look deeper in to the computer to the list of processes and particular process FahCore_17.exe, these processes are one for each GPU in your computer. The average CPU usage are around 1-4% and memory size is 200-300 MB ruff numbers.(When the WU are loaded and are folding) When a WU are stalling the CPU activity goes to 0% and stays there until you pause and starts folding again and then the FAHControl.exe are restarting the WU from the last saved file. OK, that's nice you can actually check if the GPU's are folding or not, when the activity are 0% you only need to terminate the stalled FahCore_17.exe and the WU restarts(from last save file). BUT that is hard work to run around and check WU's, so to the end of the story is the automatic solution: Download a software called processlasso and install it. http://bitsum.com/processlasso/
Search and find one process of FahCore_17.exe and right click on it, select menu option "Set watchdog rules for this process", 1: for -CPU, 2: Less than, 3: 1%, 4: 300 sec, 5: terminate the process, 6: Puch button "Create new process watchdog rule". Now the software are terminating the wu that have been inactive for 5 minutes and restarts, this is saved by the software and are restarted every time the computer starts. - YES it works, no more lost hours of stalled WU's

//Aboy68, by the way - if you change in the motherboard bios settings the PCIe version to version 2, that gives you are more stable system, the timing is not that fast as version 3 and verson 2 have the bandwith we need in foldings, I have this setting on all mother boards, Ver 1 is to slow(I have tryed it)

Outstanding post Aboy68! I love automated solutions and this will save me the trouble of having to try to write one myself! I will give this a shot. A couple of my computers get extremely unstable when some event happens that I have not been able to identify yet so I'm not sure this will work in my case but I'm definitely going to try it. I will also try the PCI-E 2 settings.

Thanks again!

Update1: I've just set up the software exactly as you describe on my machines. Your instructions were very easy to follow. Again well done!

Update2: Immediately had an opportunity for processlasso to take some action on the troubled machine and I got a message that I had been using it for 21,000,000 days so it was deactivated. lol UGH. I would have liked to try it before buying it just to see if would help so I may have to pursue a homegrown approach if / when I ever get time to do it.

Strange the trial period are 7 days?

No worries. It was worth a shot. I can write something to do something similar if I get some free time. I'm actually quite concerned with the stability of the machines. I don't think it would work with how unstable the machines become and I have no idea why. Nothing in Windows Event Viewer and nothing in the F@H logs. No errors or warnings. I'm interacting with the F@H support forum now to see if they have any ideas.

I'll figure it out but I can see it is going to take quite a while. I suspect my sustainable PPD is going to be about 2 million instead of the 3 million I was doing earlier. Just no way to keep the machines running to sustain it. So you better pass me while you can. lol Sooner or later I might get the machines running right.

Aboy68

member

Activity: 96

Merit: 10

Quote from: ChasingTheDream on June 03, 2014, 06:23:15 PM

Quote from: Aboy68 on June 03, 2014, 04:47:24 PM

Hi, this is the fix.
I did study a computer that did have stalled working units, I did recoqnise that the percentage you can read in the GUI are really detached from what you find in the log file. What you find in the log file is what results you have from the GPU/WU, that is the true performance. So when a worker are losing sync to a working unit the percentage are increased with the same speed until at reaches 99.99% and stops(in the GUI), thats why you can find out that something is wrong until the timer are reaching 99.99%. So the GUI are not the right place to look for stalled WU's. So if you look deeper in to the computer to the list of processes and particular process FahCore_17.exe, these processes are one for each GPU in your computer. The average CPU usage are around 1-4% and memory size is 200-300 MB ruff numbers.(When the WU are loaded and are folding) When a WU are stalling the CPU activity goes to 0% and stays there until you pause and starts folding again and then the FAHControl.exe are restarting the WU from the last saved file. OK, that's nice you can actually check if the GPU's are folding or not, when the activity are 0% you only need to terminate the stalled FahCore_17.exe and the WU restarts(from last save file). BUT that is hard work to run around and check WU's, so to the end of the story is the automatic solution: Download a software called processlasso and install it. http://bitsum.com/processlasso/
Search and find one process of FahCore_17.exe and right click on it, select menu option "Set watchdog rules for this process", 1: for -CPU, 2: Less than, 3: 1%, 4: 300 sec, 5: terminate the process, 6: Puch button "Create new process watchdog rule". Now the software are terminating the wu that have been inactive for 5 minutes and restarts, this is saved by the software and are restarted every time the computer starts. - YES it works, no more lost hours of stalled WU's

//Aboy68, by the way - if you change in the motherboard bios settings the PCIe version to version 2, that gives you are more stable system, the timing is not that fast as version 3 and verson 2 have the bandwith we need in foldings, I have this setting on all mother boards, Ver 1 is to slow(I have tryed it)

Outstanding post Aboy68! I love automated solutions and this will save me the trouble of having to try to write one myself! I will give this a shot. A couple of my computers get extremely unstable when some event happens that I have not been able to identify yet so I'm not sure this will work in my case but I'm definitely going to try it. I will also try the PCI-E 2 settings.

Thanks again!

Update1: I've just set up the software exactly as you describe on my machines. Your instructions were very easy to follow. Again well done!

Update2: Immediately had an opportunity for processlasso to take some action on the troubled machine and I got a message that I had been using it for 21,000,000 days so it was deactivated. lol UGH. I would have liked to try it before buying it just to see if would help so I may have to pursue a homegrown approach if / when I ever get time to do it.

Strange the trial period are 7 days?

ChasingTheDream

sr. member

Activity: 292

Merit: 250

Quote from: kingscrown on June 03, 2014, 09:34:27 PM

one of coolest startup coins, im sure price will raise. this stuff deserves it.

Join us if you haven't already.

kingscrown

hero member

Activity: 672

Merit: 500

http://fuk.io - check it out!

one of coolest startup coins, im sure price will raise. this stuff deserves it.

Burninj

legendary

Activity: 1148

Merit: 1000

Quote from: Aboy68 on June 03, 2014, 04:47:24 PM

Hi, this is the fix.
I did study a computer that did have stalled working units, I did recoqnise that the percentage you can read in the GUI are really detached from what you find in the log file. What you find in the log file is what results you have from the GPU/WU, that is the true performance. So when a worker are losing sync to a working unit the percentage are increased with the same speed until at reaches 99.99% and stops(in the GUI), thats why you can find out that something is wrong until the timer are reaching 99.99%. So the GUI are not the right place to look for stalled WU's. So if you look deeper in to the computer to the list of processes and particular process FahCore_17.exe, these processes are one for each GPU in your computer. The average CPU usage are around 1-4% and memory size is 200-300 MB ruff numbers.(When the WU are loaded and are folding) When a WU are stalling the CPU activity goes to 0% and stays there until you pause and starts folding again and then the FAHControl.exe are restarting the WU from the last saved file. OK, that's nice you can actually check if the GPU's are folding or not, when the activity are 0% you only need to terminate the stalled FahCore_17.exe and the WU restarts(from last save file). BUT that is hard work to run around and check WU's, so to the end of the story is the automatic solution: Download a software called processlasso and install it. http://bitsum.com/processlasso/
Search and find one process of FahCore_17.exe and right click on it, select menu option "Set watchdog rules for this process", 1: for -CPU, 2: Less than, 3: 1%, 4: 300 sec, 5: terminate the process, 6: Puch button "Create new process watchdog rule". Now the software are terminating the wu that have been inactive for 5 minutes and restarts, this is saved by the software and are restarted every time the computer starts. - YES it works, no more lost hours of stalled WU's

//Aboy68, by the way - if you change in the motherboard bios settings the PCIe version to version 2, that gives you are more stable system, the timing is not that fast as version 3 and verson 2 have the bandwith we need in foldings, I have this setting on all mother boards, Ver 1 is to slow(I have tryed it)

Really really nice to share this!

ChasingTheDream

sr. member

Activity: 292

Merit: 250

Quote from: Aboy68 on June 03, 2014, 04:47:24 PM

Hi, this is the fix.
I did study a computer that did have stalled working units, I did recoqnise that the percentage you can read in the GUI are really detached from what you find in the log file. What you find in the log file is what results you have from the GPU/WU, that is the true performance. So when a worker are losing sync to a working unit the percentage are increased with the same speed until at reaches 99.99% and stops(in the GUI), thats why you can find out that something is wrong until the timer are reaching 99.99%. So the GUI are not the right place to look for stalled WU's. So if you look deeper in to the computer to the list of processes and particular process FahCore_17.exe, these processes are one for each GPU in your computer. The average CPU usage are around 1-4% and memory size is 200-300 MB ruff numbers.(When the WU are loaded and are folding) When a WU are stalling the CPU activity goes to 0% and stays there until you pause and starts folding again and then the FAHControl.exe are restarting the WU from the last saved file. OK, that's nice you can actually check if the GPU's are folding or not, when the activity are 0% you only need to terminate the stalled FahCore_17.exe and the WU restarts(from last save file). BUT that is hard work to run around and check WU's, so to the end of the story is the automatic solution: Download a software called processlasso and install it. http://bitsum.com/processlasso/
Search and find one process of FahCore_17.exe and right click on it, select menu option "Set watchdog rules for this process", 1: for -CPU, 2: Less than, 3: 1%, 4: 300 sec, 5: terminate the process, 6: Puch button "Create new process watchdog rule". Now the software are terminating the wu that have been inactive for 5 minutes and restarts, this is saved by the software and are restarted every time the computer starts. - YES it works, no more lost hours of stalled WU's

//Aboy68, by the way - if you change in the motherboard bios settings the PCIe version to version 2, that gives you are more stable system, the timing is not that fast as version 3 and verson 2 have the bandwith we need in foldings, I have this setting on all mother boards, Ver 1 is to slow(I have tryed it)

Outstanding post Aboy68! I love automated solutions and this will save me the trouble of having to try to write one myself! I will give this a shot. A couple of my computers get extremely unstable when some event happens that I have not been able to identify yet so I'm not sure this will work in my case but I'm definitely going to try it. I will also try the PCI-E 2 settings.

Thanks again!

Update1: I've just set up the software exactly as you describe on my machines. Your instructions were very easy to follow. Again well done!

Update2: Immediately had an opportunity for processlasso to take some action on the troubled machine and I got a message that I had been using it for 21,000,000 days so it was deactivated. lol UGH. I would have liked to try it before buying it just to see if would help so I may have to pursue a homegrown approach if / when I ever get time to do it.

Aboy68

member

Activity: 96

Merit: 10

Quote from: ChasingTheDream on June 03, 2014, 02:00:20 PM

Quote from: scyth3 on June 03, 2014, 01:39:53 PM

Quote from: ChasingTheDream on June 03, 2014, 01:27:45 PM

Quote from: Vorksholk on June 03, 2014, 01:04:48 PM

Quote from: ChasingTheDream on June 03, 2014, 12:55:27 PM

Quote from: Aboy68 on June 03, 2014, 09:47:27 AM

Quote from: ChasingTheDream on June 02, 2014, 11:31:34 PM

Calling Aboy68 Cheesy

Our production is dropping off and I've been having a lot of hardware issues and instability despite under clocking the GPU's literally to their lowest possible settings (both core and speed). If you are having similar issues try under clocking your RAM. I got that tip directly from the F@H support and I think the guy may have nailed it! At least I hope so! I'll know more in about 24 hours. I may be able to spin up the GPU's again. They are running ridiculously under clocked right now.

Despite the bumps I'm still trying to beat you into the top 10. Grin

Update: GAH. Still didn't make it three hours before two machines were down again. LOL. So the quest for stability continues...

Im running stock values on all hardware I have, yes sometimes the worker and the WU do get lost in space with the result of 99.99%
To resync is the solution, ea pause and fold commands.
I have a 100% fix on this automaticly no human hands on! The last 3 day I have tryed out the fix and it still works.

Do you want to know how?

//Aboy68

Yes any ideas are welcome. Virtually every morning at least two of my machines are down meaning I can not restart them folding without physically rebooting the machine. The machine will not respond to remote restarts or keyboard input. I actually have to press the reset button. That happens during the day as well but I'm not always available to do anything about it so they sit for hours like that. At this point I've got the GPU's under clocked to the maximum amount so it is not the GPU's. It is something in the systems themselves. Memory, CPU, something. I've removed the CPU slot on the troubled machines (after the WU finished of course) but it does not seem to have made any difference in terms of stability.

I'm going to gather my logs and present them to the F@H support group to see if they have any ideas to help speed up the process of getting these things running properly. As a short term fix I may write a program that reads the logs and if too much time goes by before the log is updated it could force a system restart. Unfortunately I don't think this will work because whatever is happening makes the system so unstable that I don't think it will be able to restart.

Ironically there are no hardware errors or application errors in Windows Event Viewer though. This has actually been plaguing me since I started but it was the same way with mining. It took a long time to get the systems to behave. This will eventually get worked out.

Unfortunately, as a result I'm only running at about 2/3 my expected output, but it is still better than nothing. lol

If you have a fix I would love to hear about it!

Are the fans working correctly? Might want to get a tool that lets you see VRM temps too, I had a 7970 experiencing a similar issue (back in the mining days) and VRMs were around 117C. Some more work showed that the fan speed in CCC/afterburner/trixx was incorrect, as the fan had a hardware issue and was spinning with a much higher resistance than it should have.

I used GPU-Z to take a peek and it looks like the highest VRM temp on any of the cards in the troubled machines is 58C. Also based on your recommendation some time ago I did swap the GPU's out of the most troubled machine with the machine next to it. The most troubled machine is still the machine having the most issues. Based on all that, I don't think it is the GPU's at this point but it was definitely worth a look. Thanks for the suggestion.

Ironically though, I ran GPU-Z on several of my machines that don't seem to want to run for very long. The most troubled machine ended up getting a video driver failure while I was watching. The machine was still stable afterwards and I was able to remotely reboot it so it was responding appropriately. Whatever else is happening makes it so unstable it is on a whole different level of ugly. Definitely not just a video driver failure.

Try underclocking your system RAM. This helped with one of my rigs.

I actually was talking about that in the first post in this sequence and thought it was going to help because the RAM speed in all the machines was at 2133. I brought it down to 1333 and unfortunately it didn't help. I think I'm going to run a memory test next and maybe even swap memory between a machine that behaves somewhat well and the least stable machine. Hard to believe the memory is bad in 2-3 different machines but I need to rule it out.

Another suggestion from the F@H folks was that I could be overloading a rail on the PSU but the computer that is having the most issues has a 1200 watt Corsair which is a single rail PSU. So the quest continues.