Analyze away.
I think it's easy to spot the following cut-off points in your sample dataset:
- a fee over ~5300 s/byte will guarantee inclusion in the next block
- a fee over ~3000 s/byte will guarantee inclusion in the next two blocks
- a fee over ~500 s/byte will guarantee inclusion in the next three blocks (*)
- etc.
- even txns with 0 fee are picked up in less than ~35 blocks
(*) The only exception for this a single transaction with a ~2000 s/byte fee that took about 7 blocks; I suspect it didn't have all it's inputs confirmed and that's why it's an outlier. So the correct block delay should be computed from the moment when all inputs are confirmed.
"Guaranteed" should of course be understood "in light of historical data". You can make no guarantee that a transaction will be picked up, but absent a major shift in the behavior of miners or transaction rate, historical data is our best chance to compute a "fair" fee that won't waste money for a given confirmation speed. If the client sets a fee that is too low for current conditions, the resulting delay will signal to other clients the new market conditions and they will increase the fee accordingly. A fee that is too high can also be detected by the clients, if they target say a 90-95% inclusion at a given block instead of 100% guarantee of inclusion.
Given the full set of transactions as you have provided us, the algorithm to compute the correct fee seems a simple binary search :
start with a guesstimate fee F = maxfee/2
set search step S = F
set delay target T (ex. 3 blocks)
set inclusion confidence L (ex. 90%)
set precision of fee determination epsilon (ex. 1%)
loop
n = count(transactions with fee <= F AND confirmation delay <= T)
N = count (transactions with fee > F AND confirmation delay > T)
compare n/(n+N) with L
F=F+/-S (sign depending on the compare)
S=S/2
break if S < epsilon*F
continue loop
return F
So I basically split the data into four quadrants depending on fee and delay, as compared to our delay target and current fee. Since they bring us no information, I discard transaction with low fee and high delay, as well as those with high fee and low delay. Then compare the data in the remaining quadrants with our target and continue to search for the optimal fee for achieving our goal. I don't have any functional code but this should work. At least that's how the theory goes
The trouble with this algorithm is that it doesn't scale too well because it needs to walk all transactions multiple times. At high rates it could be a killer. So instead of walking actual transactions, you could sample them in discrete bins just once, then run the algorithm on the bins. The bins would form a 2D table, and each bin will consist of a single integer that records how many transactions match the given fee interval and delay. For the fee axis I would go with an exponential pace, for example each bin represents a 10% increase in fee. Of course the precision of fee determinations drops but you only need to build the table once and after that it's just a few kB to download from a central location for example.
An even better algorithm could take into consideration current unconfirmed transaction rate and adjust the fee (linearly ? needs experimentation); this is especially important when the block limit is reached and fee ceilings transition from a fixed ceiling set by the miner ("don't include transactions with less than x BTC fee") to a dynamic rate set by the market ("include the transactions with the largest fees possible in a given block size").