Glasswalker, please look at this:
https://bitcointalksearch.org/topic/m.1177532And comment on my thoughts / sanity.
I am now up for 46 hrs continuous operation versus the first two failures within 24 hrs.
I don't think this is all related to USB cables and hubs. (But I'm sure they do play a part. Multiple possible points of failure here, after all.)
-- edit --
Although this is not gonna help the guy with his Ubuntu problem.
You are correct, I am pretty sure that a glitching worker FPGA can in fact make the FTDI appear unresponsive (for example by asserting tx/rx for example). I'm not exactly sure how the FTDI reacts in that situation.
If that's the case, the bitstream again isn't "responsible" but the worker "locking up" (as it has in the past and still does occasionally) may cause the FTDI to freak out.
If you're experiencing this with the faster bitstreams, I suggest testing with the first HashVoodoo 175 release that I put out. It's proven extremely stable for most (and it's what I'm running on the linux host I mentioned has been up for like 26 days non stop).
I know it's a 12% or more hit to your hashrate, but it could be a valuable test to confirm if it's still stability in the bitstream.
If it does work for you, I hope to have a new version out soon with dynamic overclocking and a faster overall clock speed. (provided smartxplorer cooperates, so far it's been having a hard time closing timing...) anyway...
But yes there are SEVERAL points of failure in this chain, and unfortunately it's a tricky thing to really isolate them in a large cluster.
I've decided that once I get a "faster" release out with dynamic clocking, I am going to work on a "temp" protocol which will allow use of the up/down link. My hope is that I'll be able to chain 16+ fpgas on a single UART. I'll need to crank the FTDI baud rate for this though, so there are a bunch of variables to consider.
The icarus setup currently sends 512bits of data in a "packet" so that means a 250K baud rate can handle about 488 packets per second. With my setup, you would send one packet per up to 16 FPGAs. (they would "share" the packet). But they would send independant responses (32bit). so without adding more than .1 second of delay, that means you should be able to chain upwards of a couple hundred FPGAs on a single chain without major problems. Mind you transmission latency might become a problem at that point. Anyway, even with 100 FPGAs in a chain, that would allow for 25 boards per single USB connection. Which is why I was bringing this up at this point, ie: that feature may help alleviate this problem that some are experiencing.
The "final" solution will be a from-scratch protocol, using raw USB bandwidth to communicate, allowing for theoretically VERY large clusters. But that one is a ways away. I hope this "temporary" feature will be good enough for now. (also this feature alone is still at least a few weeks away, I'm doing my best).
Hope that helps.