The set (S1, S2, S3, ...,Sn) is the cumulative sum of the geometrically distributed variables.The probability that X is to be found somewhere in that set is given by the sum of all probabilities that X = some S divided by n:
Pr(X∈(S1, S2, S3, … ,Sn))
= sum from 1 to n [Γ(X+n)/(Γ(n) * Γ(X+1)) * p^n * (1-p)^X]/n (1)
Here is where things start to get horribly, horribly wrong. Both in what you're trying to get, and how you get it.
First, I'll say again: If what you want is the balance after n rounds (which is what the text implies), it is simply (n - Sn/D)*B. That's it. Sn is the number of shares the pool had to pay. Any other derivation is nonsense. So I can only assume you really are trying to figure out the probability that a balance of X will be reached at some point within the first n rounds. Even if this is the case, the above quantity is misguided.
Let's begin with this formula. It is absolutely, and obviously, wrong.
The event X∈(S1, S2, S3, … ,Sn) contains the even X=Sj for any j. Therefore its probability must be at least the probability that X=Sj, which is the pmf of Sj at point X. So the probability must be greater than all the pmfs, so it can't be equal to their average.
The one place where the average of pmfs is meaningful is when you have a mixture variable. For example, if you have a variable Sr which is generated by first choosing a uniform random integer j between 1 and n and then setting it equal to Sj, its pmf will be given by this formula. But such a variable has no relevance whatsoever to the problem at hand.
If you do want to find the probability that X will be in the set, you need to treat this as a union (logic or) of the equalities X=Sj, which can be expanded with the inclusion-exclusion principle. I don't know if the expansion has a nice closed form in this case, I think it's extremely messy because of the dependencies between the Sj's.
I'm not following your train of thought because you're not explaining what, why and how you are trying to get.
Yes, my aim was to find the probability that a cumulative sum or balance of X or lower will be reached before n rounds. Firstly to define a pdf, compare with simulated results, and then sum the pdf to a cdf.
So now maybe you can understand why the quantity Pr(X∈(S1, S2, S3, … ,Sn)) is not meaningful. It is not a pmf of anything. The total area under this curve is more than 1, and you can't take areas under the curve for any insight. The events 0∈(S1, S2, S3, … ,Sn) and 1∈(S1, S2, S3, … ,Sn) are not mutually exclusive, so you can't disjunct them by simply adding the probabilities.
After more reading (and learning) I have to say I agree with you. So I really don't know how the simulation matches the 'mean negative binomial pdf' (which you show isn't meaningful). An error in concept maybe? I wrote a
simpler simulation with no special libraries required, commented, doesn't make shortcuts and
should be possible to follow if you have time.
The sim works as follows: Take
n geometrically distributed random variables with the same
p, subtracting
1/p from each variable so the expectation of each variable is 0. The take the cumulative sum of the variables. For example let
p = 10^-6 and
n = 10:
Geometrically distributed random variables:
[1] 1182417 3279310 482644 550187 404841 183692 1106328 1663533 1716093 5213483
Geometrically distributed random variables minus 10^6
[1] 182417 2279310 -517356 -449813 -595159 -816308 106328 663533 716093 4213483
Cumulative sum of (geometrically distributed random variables minus 10^6)
[1] 182417 2461727 1944371 1494558 899399 83091 189419 852952 1569045 5782528
Now make
m iterations of the cumulative sums to
n and plot the histogram or sample density for the entire
m*n samples. For example, if
m=10, you'd plot the density or histogram of:
m=1 m=2 m=3 m=4 m=5 m=6 m=7 m=8 m=9 m=10
n=1 182417 -567342 -844019 -992762 -965404 -621776 -163689 34795 -790737 219609
n=2 2461727 -1328229 -1116118 -1352713 -1693465 -574916 -744127 -769836 -966007 -259975
n=3 1944371 -370917 -1951890 -1044311 -2358626 -1014358 -1149823 755121 125528 -1139511
n=4 1494558 1608634 -1645397 16903 -1972138 -1336607 -1674819 318716 -393838 1008137
n=5 899399 1333825 1530411 -96895 -2960904 -1578059 -2312004 299624 -1354225 897368
n=6 83091 414391 4826197 9793 -3182523 -2087554 -2619065 -374199 -1590228 91992
n=7 189419 568167 4014970 -681510 -372781 -2077967 -1813568 -159466 -1914558 -357282
n=8 852952 295560 3170795 -1481067 -239713 -2468778 -226886 -342762 -2101092 -1334599
n=9 1569045 683593 3718213 -2315456 1281720 -1220656 -684308 56256 -2411408 -913902
n=10 5782528 169347 2806338 -3215373 1084351 -1431807 -1526684 -254378 -3073730 -1742821
where n = the number of "fails" before each success, and m = the iteration number (the 'mth' time a sample of n was taken)
This gives a density plot or histogram which corresponds to the probability density of the distribution from which it came. If I then multiply the cumulative sums by B/D, I can then create a cdf and find, as you state, "the probability that the balance will reach X or lower at some point within the first n rounds".
So, at some point I somehow I came up with the supposed "mean negative binomial density" and wrote such a simulation which appeared to 'confirm' it. The sim I linked to plots the simulated cumulative sum densities against the "mean negative binomial pdf" for comparison, as below.
If you click on the graph for a closer view, you'll see that the densities match very well.
I'm aware that random variable based simulations have limitations, but such a clear correlation is unlikely to be random. Since the "mean negative binomial pdf" is not just wrong but not meaningful, have I misunderstood what I was finding or looking for in the simulation?
Secondly, and possibly more interestingly, after your comment about the clt I looked for and found that for large n (larger than ~ 100) the simulated densities match skewed generalized error distributions quite well. Since I don't think I'd be able to use the inclusion-exclusion principle easily (tbh a the method to achieve a sum to a non-specific 'n' for this principle is beyond me), I'd like to be able to replace the error on the blog with a skewed GED instead as a model. However I'm unwilling to do more work on it until I can get a consensus (well,
your advice
) as to whether I've simulated the cumulative sum (balance) at all times before n rounds.
I know I'm asking a lot for you to go through the sim and look at the results. I wish I could say "I'll owe you one" but I'll just settle for being grateful for either a confirmation that the simulation works or a reason why it doesn't. Right now, I'm just confused.
Thanks for your help.