I've had my Icarus + Ztex combo up and running with MPBM for 1D 12h now.
I'd like to understand some of the numbers.
For instance, why are cancelled jobs on the Ztex boards 11.50%, 14.58% and 13.61% vs. only between 1.11% and 1.16% on my Icarus boards?
A Ztex board takes about twice as long to process a job than an Icarus. This means that the probability that a job is hit by a long poll is twice as high.
The actual probability for a job to be canceled on an Icarus should be around 1.3% on average, twice that on Ztex. So your Icarus numbers look perfectly fine to me.
The reason that your Ztex boards show higher numbers is a bit more complicated: MPBM fetches new work as soon as the number of jobs in the queue that expire in more than 10 seconds goes below some threshold. However, if jobs are received in bursts due to X-Roll-NTime, it's rather likely that all jobs in the queue have the same expiry time, and that this is somewhere between 10 (fetcher threshold) and 15 (Ztex board job processing time) seconds in the future. Because of this, there are no jobs with a remaining life of >15 seconds in the queue, and the Ztex board is forced to work on a job that it can't possibly finish in time. As soon as the remaining jobs cross the 10 second boundary, MPBM will fetch new jobs, and as soon as the remaining life of the accepted job hits zero, the board will behave as if a long poll came in, cancel its current job and get a new one (with longer life). To account for network latencies, a safety margin of about 5 seconds is subtracted from the expiry time reported by the pool before inserting the jobs into the queue.
TL;DR: This is caused by MPBMs inner workings, and nothing to worry about.The 'GHashes total' on the Icarus seems lower per FPGA (ie. Ztex 28290 vs. Icarus 25045) than on the Ztex, but then the jobs per hour seems higher (Ztex 240-250 Icarus 308-318).
This is partially caused by the effect I described above, and partially by the fact that MPBM pushes new work to the boards about 1 second before the board actually exhausts the noncerange of the previous job, because none of the boards that we have today support real queueing of jobs (which means we need to do this to prevent the board from idling briefly between jobs). This absolute time value doesn't depend on the hashrate, and thus offsets the number of jobs against the number of gigahashes.
Do any other numbers seem off in the screenshot? I realize that Slush has a lot of upload errors. The timeout is 5 now (default), would increasing that possibly solve that problem? 5 seems already very long to me.
5 Seconds should be plenty. If a pool fails to respond within this timeframe, I'd avoid that pool. I wouldn't increase that timeout, because that would decrease MPBM's aggressiveness to deliver a share, thereby possibly causing more stales, especially if the timeout was not caused by the pool software being overloaded, but rather by network packet loss (retrying is likely to be faster than waiting for the protocol layer to detect the packet loss, trigger retransmissions, and finally get a response). However it seems like stales are still within a reasonable margin on this instance, so I wouldn't care too much about those retries.