Anyone interested in a run down of the KDB problem the other day ...
Note this is a TL;DR; for most
The cause was related to the pool size, not a bug in my KDB code 'as such'
The sequence number code allows tracking 2^20 (~1mill) sequence numbers for shares.
This tracking is needed for out of order data, but the code of course doesn't keep that data forever, it reuses the data when it needs more space.
This reuse means that it can only track ~1 million different share sequence numbers at any one time.
This is way more than needed for normal running, but during a reload there's a sizeable time difference between the start of the reload and the data coming in from the pool code.
The catch comes when the reload has completed, then it starts processing the queue of data from the pool that arrived while it was reloading.
This can't be more than ~1 million shares apart, which again is normally OK but can be a problem for a slow (long) reload done on the live server.
The extremely large log shows this is the cause of problem, but there was so much data in the log file that it wasn't obvious at first why it happened.
The fix is 2 fold:
1) if I ever need to roll back the database to correct or generate missing shift data, don't do it, use the backup server to generate it and copy it across like I did, except the first time during the problem.
2) I'll increase the share sequence tracking size by another 8x before I restart KDB again
Mine on!