I feel the need to explain what happened yesterday in greater detail and also what I have done to make sure something like that doesn't happen again.
Also, give you some insight in the pools work distribution operation.
First I would like to describe the fun - almost dramatic - situation that led to the discovery of the bug.
The bounty (
https://blockchain.info/address/1pdSSfCx4QynTwXTtVDjEEavZ4dDnYdhP) was in a block that was about to be issued "any moment now".
"client X" finally asked the server for work which contained the block.
You must know, that the whole client<->server interaction is a game of promise and deliver. When a client asks for work, he gives nothing but a promise to work on the assigned block(s). The server knows when this work was issued and the server also knows about the clients computation speed. Therefore the server knows when the delivery of the promised work is due. The server waits for 200% of this time, because the client may have a higher load and not be able to utilize its capacity 100% to check blocks. The user might have halted the process (under Linux Ctrl+Z) for some time to start it later on. There may be some network lag etc. etc. If after more than double of the promised time, the client does not return a POW of the checked block (client-id, interval), the server considers that interval not having been checked and will reissue it to the next client who asks for work (which actually may be the same client-id). In the meantime however, the server will not issue blocks from the promised interval to other clients to avoid double work.However, "client X" did not deliver the interval back in time, so its interval was open to reissue.
Meanwhile, other clients asked for work and delivered it back, so basically the interval for reissue is a "hole" in the work done. Have a look at this JSON representation:
[["0",653587],[653708,654825],[654848,657098],[657111,657122],[657135,660588],[660608,672324],[672392,674026],[674055,674193],[674222,676045],[677393,677444],[677775,678894],[69545491,69545512],.....
The "holes" are the intervals between the intervals. Like 653588-653707, then 654826-654847 etc. This happens all the time, is a natural process and expected. Clients promise and do not/cannot deliver, so work is done later. Naturally the holes have different sizes, depending on the computational strength of the client who promised and did not deliver. Now if a slow client promised and did not deliver, the resulting "hole" is small. Which actually means, that it becomes inaccessible for "fat" clients who ask for (promise to do) bigger chunks of work. These get new work at the current "forefront" of the pool.
Nevertheless, a new (another) "client Y" came along and asked for work within the hole in the intervals that contained the bounty. And guess what? He didn't deliver also. Meanwhile the hole/interval containing the bounty was even smaller (like 3 blocks). That's when I was reminded of the pulp fiction scene of missed shots.
FINALLY, a "client Z" came along, claimed the blocks and ... delivered. AND ... nothing happened! I thought WTF? I gave my notebook (Linux) the specific block number and .. HIT. WTF? WTF? I went to my Windows machine, gave it the specific block number and ... NOTHING!
In this very moment there was a rapid change of operation of my sweat glands.
Potentially, every win-client was not getting hits despite them being there. Every second of operation added to wasted work. And this at a time when we finally had stable win clients and the pool evidently got more and more clients. For a second, I thought
"Maybe I can fix it without telling anyone and embarrassing myself?" Nope. I had to stop the pool operation to prevent issuing of work which would have been moot anyway. I stopped the pool, put out a quick message here and started a frantic bug hunt.
I compared the output of the Linux and Win versions of the generators byte-by-byte. Same. Of course - I did that test before.
I issued a "LBC -x" on Linux: correct 3 hits. I issued "LBC -x" on Windows: bummer! 2 hits
It's not
"not working" - it's somehow a little bit not working? WTF
2?
Ok. I hacked some debug information into LBC and let it print out what the LBC client "thinks" it sees coming from the generator right before the hash160 check. And there it was: Some 50000 bytes down the stream, the Linux and Win versions started to differ. What was going on? I pinned down the exact location of the 1st change and there I saw it: Linux "0a0a6f" Windows "0a0a0d6f". Naturally, all subsequent hash160 checks for the current block in the Windows version were 1 byte off and could not match.
CRLF anyone? The windows client considered the stream from the generator to be a text stream. Which at the time when it was a text stream (base58) was perfectly ok. But since the client 0.823 when the stream became binary, this was not ok. Oh this was so badly not ok. For Linux, it didn't matter, as the default mode of operation on incoming pipes is ":raw" don't change anything. But on Windows, the default mode of operation on incoming pipes is ":crlf" - yeah munge the data by default.
And the problem is - of course known and documented
http://perldoc.perl.org/functions/binmode.htmlOn some systems (in general, DOS- and Windows-based systems) binmode is necessary when you're not working with a text file. For the sake of portability it is a good idea always to use it when appropriate, and never to use it when it isn't appropriate.
So the fix was a
binmode ":raw" on the pipe filehandle. Then
LBC -x on windows had it's 3 hits. Then the windows client showed a hit when processing the specific block containing the bounty.
I put in some information here, had the pool server accept only clients 0.831 or higher (unfortunately it cannot do that separately on OS versions, so Linux clients had to update too although for them it was not necessary - collateral damage) put the new versions for download, rolled back the "forefront" of the pool and started up the pool.
Phew!
There were of course some more details done. When checking the Windows client via LBC -x, I realized, that there was no way, a user could determine if the check ran successful if there were only some hits. So now the check checks itself and will bail out if there are not exactly 3 hits. The windows client refuses now to be called with anything else than -c 1, respectively sets the number of CPUs always to 1. Unfortunately there is a small annoying message even if you set -c 1, but it's not breaking anything.
----
All in all, I certainly hope this was the last incident of this type and hopefully the dust will settle now and everything will run smooth. Good hunting everyone!
Rico