Author

Topic: What is the probability of a 40 min 6 block streak? (Read 2354 times)

donator
Activity: 2058
Merit: 1007
Poor impulse control.
As is noted in the thread, the timestamps are inaccurate. Trying to make timestamps accurate requires the assumption that blocks are generated a particular way - but this is what you're testing, so you can't do that.

I have a source for more accurate "timestamps" (actually the first time a block has been recorded by a well connected monitor), but this doesn't fix the problem.

The problem is that blocks appear wrt to time as a non-homogenous Poisson process rather than a homogenous (usual type) Poisson process. They are only a homogenous Poisson process with respect to hashes.

This is not usually an issue unless considered over many days, but if there are sudden changes in hashrate the block rate will be affected in a significantly non-homogenous way. For example, I've noticed that block durations aren't actually exponentially distributed even if you try to normalise the data to account for the non-homogenous nature of the process. I *think* this has something to do with miner hashrate changes at the start of a block, but it's hard to prove.

I don't doubt that some of what you see is the effect of the generation process being non-homogenous. However, it might be that in the relatively small sample you took which look non-Poisson might actually be ok. You could use R package dgof to do some discrete goodness of fit tests, or you could find the confidence intervals for the histogram bins and see if the bins are either under or overfilled, or within the expected range (if bins are the same size they should have a binomial distribution where p = 1/ number of bins)

Great post.  I found that quite interesting. 

What is your thinking on the hash rate at the "start of a block".  Do you mean the "orphaned hash rate" due to miners working on the old headers before learning of a new block?

Is it homogenous w.r.t. the hashes?  I'm assuming the hash rate is continuously changing? 

I think yes to both questions, but that's opinion based on old data. It seemed to be the case before stratum, not sure if it's a significant effect now.I hope to get time to look at that again soon.

I also had a look at your site.  Great work b.t.w.  If you don't mind answering here, or is there a thread on your site, but your CI for the forecast appears narrower than the CI of the the hash rate estimate. 

Thanks for the kind words Smiley

They're about the same, but one is offset wrt the other. It's annoying and I think it's because the forecast method makes assumptions about residuals for the forecast that is not the case. I'm not sure how to fix that.

sr. member
Activity: 362
Merit: 262
As is noted in the thread, the timestamps are inaccurate. Trying to make timestamps accurate requires the assumption that blocks are generated a particular way - but this is what you're testing, so you can't do that.

I have a source for more accurate "timestamps" (actually the first time a block has been recorded by a well connected monitor), but this doesn't fix the problem.

The problem is that blocks appear wrt to time as a non-homogenous Poisson process rather than a homogenous (usual type) Poisson process. They are only a homogenous Poisson process with respect to hashes.

This is not usually an issues unless considered over many days, but if there are sudden changes in hashrate the block rate will be affected in a significantly non-homogenous way. For example, I've noticed that block durations aren't actually exponentially distributed even if you try to normalise the data to account for the non-homogenous nature of the process. I *think* this has something to do with miner hashrate changes at the start of a block, but it's hard to prove.

I don't doubt that some of what you see is the effect of the generation process being non-homogenous. However, it might be that in the relatively small sample you took that although it looks non-Poisson, it might actually be ok. You could use R package dgof to do some discrete goodness of fit tests, or you could find the confidence intervals for the histogram bins and see if the bins are either under or overfilled, or within the expected range (if bins are the same size they should have a binomial distribution where p = 1/ number of bins)


Great post.  I found that quite interesting. 

What is your thinking on the hash rate at the "start of a block".  Do you mean the "orphaned hash rate" due to miners working on the old headers before learning of a new block?

Is it homogenous w.r.t. the hashes?  I'm assuming the hash rate is continuosly changing? 

I also had a look at your site.  Great work b.t.w.  If you don't mind answering here, or is there a thread on your site, but your CI for the forecast appears narrower than the CI of the the hash rate estimate. 


donator
Activity: 2058
Merit: 1007
Poor impulse control.
As is noted in the thread, the timestamps are inaccurate. Trying to make timestamps accurate requires the assumption that blocks are generated a particular way - but this is what you're testing, so you can't do that.

I have a source for more accurate "timestamps" (actually the first time a block has been recorded by a well connected monitor), but this doesn't fix the problem.

The problem is that blocks appear wrt to time as a non-homogenous Poisson process rather than a homogenous (usual type) Poisson process. They are only a homogenous Poisson process with respect to hashes.

This is not usually an issue unless considered over many days, but if there are sudden changes in hashrate the block rate will be affected in a significantly non-homogenous way. For example, I've noticed that block durations aren't actually exponentially distributed even if you try to normalise the data to account for the non-homogenous nature of the process. I *think* this has something to do with miner hashrate changes at the start of a block, but it's hard to prove.

I don't doubt that some of what you see is the effect of the generation process being non-homogenous. However, it might be that in the relatively small sample you took which look non-Poisson might actually be ok. You could use R package dgof to do some discrete goodness of fit tests, or you could find the confidence intervals for the histogram bins and see if the bins are either under or overfilled, or within the expected range (if bins are the same size they should have a binomial distribution where p = 1/ number of bins)
legendary
Activity: 3472
Merit: 4801
I think that everyone will agree that two consecutive timestamps that show a negative interval have an incorrect timestamp somewhere.

The timestamps in the blocks are not intended to be completely accurate.  I believe they can vary by plus or minus a few hours.  I think I've read that some miners (and/or mining pools) will use the block timestamp as a an extra nonce so that they don't need to rebuild the merkle root as often.  The timestamp is only intended to be used for calculating the new difficulty every 2016 blocks.  A variation of 7200 seconds (2 hours) over the course of 2016 blocks works out to only about 3.6 seconds per block.  That's relatively insignificant when compared to the natural variations that will occur due to the random nature of the proof-or-work process.

I'm not sure what you are investigating, or what you are trying to determine, but modifying unreliable data to make it fit some preconceived expectation is typically a bad idea.

legendary
Activity: 1246
Merit: 1002
I'm still looking at some of the blockchain timestamp data.

I typed in the block numbers, 370,944 to 371,087, and the block times from blockchain.info and saved it as a .csv file.  This is the start of the current difficulty epoch continuing for approximately 24 hours.  I can post the whole file somewhere if someone has a suggestion.

Code:
> temp[ c(1:5,140:144), ]
     block mon day year hr min sec
1   370944   8  22 2015  0  49  43
2   370945   8  22 2015  1   4  59
3   370946   8  22 2015  1  10  29
4   370947   8  22 2015  2   5   5
5   370948   8  22 2015  2  10  31
140 371083   8  22 2015 22   7  41
141 371084   8  22 2015 22  21  10
142 371085   8  22 2015 22  24  50
143 371086   8  22 2015 22  33   5
144 371087   8  22 2015 22  35   2

The blocks following these blocks show a negative time increment.
It might be interesting to see if these pairs of blocks are over represented by any particular miner.  I don't know how to find who mined a particular block.

Code:
> blocktimes[ delta[] < 0, ]
     block mon day year hr min sec  time
7   370950   8  22 2015  2  38  33  9513
21  370964   8  22 2015  6  11  29 22289
34  370977   8  22 2015  7  18  53 26333
50  370993   8  22 2015  9  22  24 33744
114 371057   8  22 2015 19   7  28 68848
131 371074   8  22 2015 20  51  29 75089


I manipulated this data in R with commands similar to these.  These are from notes made not exactly from the log file...
I don't know yet if it is algorithmically possible to "fit" a Poisson to the distribution data.

Code:
temp <-  read.csv("Documents/blockchain calctimes.csv", header=T)
blocktimes <- 60*(60*temp[,"hr"]+temp[,"min"])+temp[,"sec"]
delta <- blocktimes[ 2:144, "time" ] - blocktimes[ 1:143, "time"]
#note: min blocktimes is -711
png(filename="blockchain-poisson.png")
plot( tb1 <- table( cut( delta+711, seq(0, 3276, 300), right=FALSE)), ylim=c(0, 50))
n <- 9; x <- c( 0:n ); y <- dpois( x, 2.0 ); points( 2+x, 136*y, ylim=c(0, 0.5), col="red")
dev.off()

I wasn't able to link to the image.  I put it on Google+ as https://plus.google.com/u/0/photos/115426745065196075335/albums/6187408748966855121/6187408753859123554




I think that everyone will agree that two consecutive timestamps that show a negative interval have an incorrect timestamp somewhere.  I am pretty sure I can repair the data by modifying one of those two timestamps to give data that is much closer to a realistic Poisson distribution.  I try to be very conservative when I repair data.  I haven't explored that process yet.
legendary
Activity: 1512
Merit: 1036
Is there a tool that can take a block number, and a block count, and return the number of minutes between successive blocks.  While I can build this by hand from blockchain.info, there is a certain tediousness to it.

For example, I would like to start at block 370944, the beginning of the current epoch, and continue for some small number, perhaps 18 or 24.



Each bitcoin block has a timestamp, but it is added by the miner (by the local time on the bitcoind machine or on the pool server) when the block was generated to be hashed. There are many blocks that have a negative timestamp offset compared with the previous block due to differences in computer clocks.

It may be more reliable to have a listening node monitor the time that new blocks are published on the network (which propagate everywhere within seconds) if you are not doing historical analysis.


In Bitcoin Core, you can get the timestamps out of your local blockchain, but it requires chaining two RPC commands: one to get the block hash, and one to dump that block using the hash.

Here's a post I wrote describing a script to do this. Replace "bitcoind" with "bitcoin-cli" when using the latest Bitcoin software.

Here's a PM I wrote someone else with the details of wot to do.

Quote from: deepceleron
I have a CSV of block times: https://bitcointalksearch.org/topic/m.1453722

I dumped them on Windows with this "dumptime.cmd" in the bitcoind directory (and then added some more spreadsheet columns to make epoch time readable time), here it dumps times from block 50000-99999:

Code:
@echo off 
setlocal enableextensions
set /a height=50000
rem echo --start > timeout.txt
:beg
for /f "tokens=* delims=:" %%a in (
'bitcoind getblockhash %height%'
) do (
set hash=%%a
)

for /f "tokens=*" %%a in (
'bitcoind getblock %hash% ^| find "time"'
) do (
set blktim=%%a
)
echo %height%: %blktim%
echo %height%: %blktim% >> timeout.txt

set /a height = height + 1
IF %height% LEQ 99999 goto beg

endlocal 



Code:
blocknum,epochtime,blocksec,datetime
0,1231006505,0,2009-01-03T18:15:05Z
1,1231469665,0,2009-01-09T02:54:25Z
2,1231469744,79,2009-01-09T02:55:44Z
3,1231470173,429,2009-01-09T03:02:53Z
4,1231470988,815,2009-01-09T03:16:28Z
5,1231471428,440,2009-01-09T03:23:48Z
6,1231471789,361,2009-01-09T03:29:49Z
7,1231472369,580,2009-01-09T03:39:29Z
8,1231472743,374,2009-01-09T03:45:43Z
9,1231473279,536,2009-01-09T03:54:39Z
10,1231473952,673,2009-01-09T04:05:52Z
11,1231474360,408,2009-01-09T04:12:40Z
12,1231474888,528,2009-01-09T04:21:28Z
13,1231475020,132,2009-01-09T04:23:40Z
14,1231475589,569,2009-01-09T04:33:09Z
legendary
Activity: 1246
Merit: 1002
Is there a tool that can take a block number, and a block count, and return the number of minutes between successive blocks.  While I can build this by hand from blockchain.info, there is a certain tediousness to it.

For example, I would like to start at block 370944, the beginning of the current epoch, and continue for some small number, perhaps 18 or 24.

hero member
Activity: 836
Merit: 1030
bits of proof
I think a more interesting question is, how big is the probability of not finding a block within time period in a mining pool of x% market share.

Knowing this enables you to audit pools.

Example:

Slush could not to mine a single block for more than 2 days between 19 and 21. Jun 2015, see
https://mining.bitcoin.cz/stats/blocks/?page=23

Slush' market share was 2.2% around that time, see http://organofcorti.blogspot.hu/2015/06/june-14th-2015-block-maker-statistics.html

Such a bad luck has a probability of only 0.17%

hero member
Activity: 836
Merit: 1030
bits of proof
- snip -
the probability of 5 blocks within next 40 mins is 15%.

Do I understand it correctly then if I say that the probability of 5 or more blocks within the next 40 minutes (and therefore within the 40 minutes immediately following the broadcast of a block) is 26%?

The probability of 5 or more blocks within the next 40 minutes is 1 - N[CDF[PoissonDistribution[4], 4]] = 37%, I made a correction in my first reply, see striketrhough.

Since events are independent, it does not matter if you are just after a block or not.

Probability of 5 or more blocks includes probabilty of 5, probability of 6 ....

The way you deal with this is using cummulative probability function as I enumerated in the previous post. Using that table you can compute the probability of any block number range withing 3 hours.

Examples:

The probability of less or equal 15 blocks is 28.6 %
The probability of more than 20 blocks in 3 hours is 1-0.73072 = 26.9%
The probability of 16-20 blocks is 0.73072 - 0.286653 = 44.4%
hero member
Activity: 836
Merit: 1030
bits of proof
Based on what I have seen, however, I wish I had another plot, this time the Poisson distribution as you have presented it, only for lambda = 18 (3 hour expected network block production), and running out to k = 36. 

While I am wishing, I also want a numeric table of the cumulative distribution function for the same distribution.

here you are:

1. probability of exactly n block within 3 hours:



2. cummulative numeric, that is probability to have <= n blocks within 3 hours:
Code:
{
 {0, 1.523E-8},
 {1, 2.8937E-7},
 {2, 2.75663E-6},
 {3, 0.0000175602},
 {4, 0.0000841761},
 {5, 0.000323993},
 {6, 0.00104345},
 {7, 0.00289347},
 {8, 0.00705601},
 {9, 0.0153811},
 {10, 0.0303663},
 {11, 0.0548874},
 {12, 0.0916692},
 {13, 0.142598},
 {14, 0.208077},
 {15, 0.286653},
 {16, 0.37505},
 {17, 0.468648},
 {18, 0.562245},
 {19, 0.650916},
 {20, 0.73072},
 {21, 0.799124},
 {22, 0.85509},
 {23, 0.89889},
 {24, 0.93174},
 {25, 0.955392},
 {26, 0.971766},
 {27, 0.982682},
 {28, 0.9897},
 {29, 0.994056},
 {30, 0.996669},
 {31, 0.998187},
 {32, 0.99904},
 {33, 0.999506},
 {34, 0.999752},
 {35, 0.999879},
 {36, 0.999942},
 {37, 0.999973},
 {38, 0.999988}
}
newbie
Activity: 4
Merit: 1
Small pedantic point to add here, in that the question in the OP and the thread title are slightly different:

OP: What is the probability that 6 blocks will be found in 40 minutes? - Poisson distribution is appropriate because block discoveries are independent events.

Thread title: Re: What is the probability of a 40 min 6 block streak? - I assume the word streak here means chain? 6 independently discovered blocks do not necessarily form a block-chain of height 6 due to orphaned blocks.

So to answer the thread title: the probability of a 6 block chain in 40 minutes is slightly less than (1 - N[CDF[PoissonDistribution[4], 6]]) because one of those discovered blocks may become orphaned (due to block propagation speeds or other reasons).
legendary
Activity: 1246
Merit: 1002
Thanks!  I have read some of the WiKi page till my head got full.  I'll go read more later.

I don't have access to Mathematica, and the documentation in R is a bit more than I want to tackle in the next short while. 

Based on what I have seen, however, I wish I had another plot, this time the Poisson distribution as you have presented it, only for lambda = 18 (3 hour expected network block production), and running out to k = 36. 

While I am wishing, I also want a numeric table of the cumulative distribution function for the same distribution.

If anyone has the R to do this, I would be really appreciative to see it.  While I myself can code this correctly in R from the WiKi definitions, it will take some time.

hero member
Activity: 836
Merit: 1030
bits of proof
Yes, as you see on the plot I added in parallel to my first reply,  the probability of 5 blocks within next 40 mins is 15%.
legendary
Activity: 3472
Merit: 4801
Of course, it is important to note that the numbers reported by grau assume that an arbitrary 40 minute period is chosen at random without selection bias.

If on the other hand you start with an already solved block and ask what the probability is that another 5 blocks will be solved within the 40 minutes immediately after seeing the first solved block, I believe the probability is higher.  In that case you are essentially asking what the odds are that 5 (or more) blocks will be solved in the next 39 minutes and 59.999... seconds.
hero member
Activity: 836
Merit: 1030
bits of proof
In probability theory and statistics, the Poisson distribution (French pronunciation [pwasɔ̃]; in English usually /ˈpwɑːsɒn/), named after French mathematician Siméon Denis Poisson, is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event.

From: https://en.wikipedia.org/wiki/Poisson_distribution

Mathematica says that: N[PDF[PoissonDistribution[4], 6]] is 10.4 % that is the prpbability of exactly 6 blocks per 40 min.
The probability of 6 or more blocks in 40 minutes is: 1 - N[CDF[PoissonDistribution[4], 6]] or 11%  1 - N[CDF[PoissonDistribution[4], 5]] or 21%

Below the plot of probabilty of n blocks per 40 min:

legendary
Activity: 1246
Merit: 1002
When the difficulty and network hash rate are in sync, what is the probability that 6 blocks will be found in 40 minutes?

Jump to: