Hi Mr. organofcorti,
Please don't call me that. "Mr. organofcorti" is my father. Just "ooc" will be fine.
The theoretical part can be computed with formulas, but that's just a guess.
No, it's not a guess. That's why it's called "Theory". Are you really so certain that SatoshiDice are not playing fair that you want to find the slightest variation from the expected values?
Here, we have the actual historical data of what actually happened. The odds of winning are very close to the actual win:loss. The formula for consecutive losses isn't as close to actual losses especially with that weird 65 losses in a row.
There is no "formula for consecutive losses". There is one for expected consecutive losses in
n trials, in the same way that there is an expected number of shares that can solve a block.
Correct me if I'm mistaken, but statistics is based on actuarial data. Theoretical possibilities or probabilities are just that, it's possible or it's probably, but not necessarily what already happened. It can still happen.
In any case, both the formula and the historical data are interesting. Why do some match and why do others not seem to match... or will these numbers change a few years from now, or will it be more of the same ... or it doesn't matter since the house always wins.
You're not understanding how to analyse this.It
is a little complicated, and you have a bit of a learning curve ahead, but you're still confusing the expected or average number of consecutive wins/losses in
n trials with a probabilistic confidence interval. The reason some results are closer to the expected value than other has to do with how likely a particular run of losses is. For this we use probability rather than statistics.
For example, if there are twice as many losses in a run than expected, we can calculate the probability of that occurring, assuming that SatoshiDice is playing fair.
If you're trying to prove SatoshiDice is
not playing fair, then you use statistics, providing a null and an alternative hypothesis and proving your point with a certain level of confidence (usually p < 0.05).
The problem with trying to find these confidence intervals using real life data is that you will need hundreds of repeats of the data to find out if it agrees with theory.
So you pick a game, decide on the number of trials you want to investigate (say 100) and consecutively split the data into runs of 100 trials. Find the largest number of consecutive losses in each run, and take the average. Only then can you compare the theoretical expected value with the actual average value.
If you want to find the probability of a particular number of runs occurring, using the same data you recorded as above, divide the longest runs into percentile groups. Say you record 10000 runs of 100 trials of a coin flip, and 0.001% of the runs there are more than say 20 consecutive losses. You can say for the sample group that the probability of more 20 losses is 0.001%, and that 99.999% of the time the number of consecutive losses in 100 trials will be smaller than this.
If you compare the quantile groups you calculated to theoretical quantiles, or the probabilities to theoretical probabilities, you can again calculate the probability of the dataset belonging to the same probability distribution as the theoretical probability distribution.
Edit: By all means carry on with your investigations. I find this sort of thing very interesting too, and it might provide a great opportunity for you to learn more about statistics and probability. Personally, I find it hard to learn things unless I need to solve a problem in which I'm interested. Perhaps the same will be true for you.