Author

Topic: SER_DISK vs SER_NETWORK (Read 1361 times)

legendary
Activity: 1526
Merit: 1134
September 29, 2012, 05:07:25 AM
#15
Sure, contributing back the blk parser would be welcome.

My concern with your heuristics is not that they will always be wrong (they won't), but people will use whatever statistics you come up with to make judgements or even investment decisions, without understanding the quite serious caveats that go along with your methodologies. See: the Silk Road study, which is now being quoted as fact in various news sources despite that it was based on a VERY shaky set of assumptions.

hero member
Activity: 555
Merit: 654
September 28, 2012, 08:54:04 AM
#14
i would rather postpone the decision about "what was change" and merge addresses to entities. once you see a transaction that signs multiple inputs at once, you can "assume" that it was one entity and assign change status retroactively

Interesting. What would cover all spent outputs...
Also I can use that information to validate the naive method I suggested, and see the false positives/negatives ratio.

hero member
Activity: 668
Merit: 501
September 27, 2012, 07:58:20 PM
#13
i would rather postpone the decision about "what was change" and merge addresses to entities. once you see a transaction that signs multiple inputs at once, you can "assume" that it was one entity and assign change status retroactively
legendary
Activity: 1400
Merit: 1005
September 27, 2012, 05:44:34 PM
#12

That assumption will be wrong 50% of the time...

Yes! I forgot the change position randomization!

But still it's generally possible to guess which output is the change, since :

1. The payment amount is always greater than the sum of inputs amounts, with the exception of the input amount of lesser value.
3. The change amount is always smaller than any of the inputs.

The only case where this guessing fails is when there is a single input amount. In this case, generally the payment amount is an integer value, and the change is not, so you still can guess with some accuracy.

Best regards Pieter!
I'm glad someone finally is doing analysis based on these assumptions!  No, they aren't exact, but they would be generally pretty close.  I am excited to see what you come up with.  Smiley

Are you planning to contribute your blkdat parser back to bitcoinj? It sounds useful!

I believe your assumptions are still incorrect:

1) You cannot assume anything about the size of a change address, nothing says it has to be smaller than the payment and often it won't be

2) You cannot assume payments are round numbers as often they will have been converted through an exchange rate. For instance many payments I make look like essentially random numbers because they are some round figure of my local currency multiplied by the exchange rate at the time.

Block chain analysis is hard, I doubt there is an accurate way to calculate what you want.
He did say "guess".  And this is certainly a good way to get it right a vast majority of the time.
kjj
legendary
Activity: 1302
Merit: 1026
September 27, 2012, 04:53:39 PM
#11
But you still have to know which addresses belongs to you, so there is the chicken and egg problem.

But that information, if ever found, then travels back in time and infects every transaction you've ever done, which is bad.
hero member
Activity: 555
Merit: 654
September 27, 2012, 04:31:47 PM
#10
But you still have to know which addresses belongs to you, so there is the chicken and egg problem.
kjj
legendary
Activity: 1302
Merit: 1026
September 27, 2012, 04:24:57 PM
#9
Are you planning to contribute your blkdat parser back to bitcoinj? It sounds useful!
Yes, if anyone wants it

I believe your assumptions are still incorrect:

1) You cannot assume anything about the size of a change address, nothing says it has to be smaller than the payment and often it won't be

But no client would automatically generate a transaction where the change is greater than a transaction input? What for?

Do you mean something like (A):

Inputs: 10 , 20, 30
Outputs: 15 (change), 45 (payment)

Why not create the tx (B):

Input: 20 , 30
Output: 5 (change) ,45 (payment)

Is the client so dumb to generate a transaction like A which wastes space instead of B ?

In the future, I would like to see the client attempt to make outputs that are roughly equal in size, with equal probability of being higher or lower.  Just to make it harder to guess.  But that is hardly a promise of anonymity.  Which one was the change will quickly be revealed when it is merged with another address known to belong to you, or with the change you sent before, or with another transaction sent to the same address as one of the inputs, or...
hero member
Activity: 555
Merit: 654
September 27, 2012, 04:10:17 PM
#8
Are you planning to contribute your blkdat parser back to bitcoinj? It sounds useful!
Yes, if anyone wants it

I believe your assumptions are still incorrect:

1) You cannot assume anything about the size of a change address, nothing says it has to be smaller than the payment and often it won't be

But no client would automatically generate a transaction where the change is greater than a transaction input? What for?

Do you mean something like (A):

Inputs: 10 , 20, 30
Outputs: 15 (change), 45 (payment)

Why not create the tx (B):

Input: 20 , 30
Output: 5 (change) ,45 (payment)

Is the client so dumb to generate a transaction like A which wastes space instead of B ?




legendary
Activity: 1526
Merit: 1134
September 27, 2012, 02:25:04 PM
#7
Are you planning to contribute your blkdat parser back to bitcoinj? It sounds useful!

I believe your assumptions are still incorrect:

1) You cannot assume anything about the size of a change address, nothing says it has to be smaller than the payment and often it won't be

2) You cannot assume payments are round numbers as often they will have been converted through an exchange rate. For instance many payments I make look like essentially random numbers because they are some round figure of my local currency multiplied by the exchange rate at the time.

Block chain analysis is hard, I doubt there is an accurate way to calculate what you want.
hero member
Activity: 555
Merit: 654
September 27, 2012, 01:44:08 PM
#6

That assumption will be wrong 50% of the time...

Yes! I forgot the change position randomization!

But still it's generally possible to guess which output is the change, since :

1. The payment amount is always greater than the sum of inputs amounts, with the exception of the input amount of lesser value.
3. The change amount is always smaller than any of the inputs.

The only case where this guessing fails is when there is a single input amount. In this case, generally the payment amount is an integer value, and the change is not, so you still can guess with some accuracy.

Best regards Pieter!







legendary
Activity: 1072
Merit: 1181
September 25, 2012, 05:17:08 PM
#5
(Note that I had to assume that the last output from a transaction is the change).

That assumption will be wrong 50% of the time...
hero member
Activity: 555
Merit: 654
September 25, 2012, 05:07:12 PM
#4
Thanks! I finished implementing the blk0001.dat parser for Bitcoinj.

For me, it' was the most simpe way to get statistics out from the blockchain. Tomorrow I will post a histogram of average volume transacted depending on the amount range (eg. 0 to 1 BTC, 10 - 100 BTC, 100 to 1K BTC, etc.)
(Note that I had to assume that the last output from a transaction is the change).
This reveals interesting information regarding the average use.

If someone wants to experiment with it, send me a message.

Best regards,
 Sergio.
kjj
legendary
Activity: 1302
Merit: 1026
September 25, 2012, 01:57:50 PM
#3
I was just grepping through the source, and those enums get passed around a lot, but appear only to be consumed in the IMPLEMENT_SERIALIZE functions of various classes.

For example, CAddress::IMPLEMENT_SERIALIZE in protocol.h adds the nVersion and nTime if called with SER_DISK, but does not otherwise.  The others look mostly similar.

SER_DISK only seems to be consumed in protocol.h and wallet.cpp, neither of which involve the block chain, so there should be no differences in the block format there.

If you want to go looking for them yourself, don't forget to also look for SER_GETHASH.
legendary
Activity: 1596
Merit: 1100
September 25, 2012, 01:32:30 PM
#2
In Satoshi client every object can be serialized either to disk or to network.
Nevertheless I haven't found any difference between the serialization of the blockchain for SER_DISK compared to SER_NETWORK.

What classes are sensitive to SER_* serialization ?

I'm writing a Bitcoinj class to read and process Satoshi blockchain (blk*.dat) files and I want to know if I should care about SET_* flags.

The python implementation pynode does not have any notion of serialization differences between the two, either.  pynode successfully imports bitcoin-generated blk000?.dat files, as well as talking on the network.

Perhaps this was for future expansion?  I would love to know any differences, myself.

hero member
Activity: 555
Merit: 654
September 25, 2012, 12:32:47 PM
#1
In Satoshi client every object can be serialized either to disk or to network.
Nevertheless I haven't found any difference between the serialization of the blockchain for SER_DISK compared to SER_NETWORK.

What classes are sensitive to SER_* serialization ?

I'm writing a Bitcoinj class to read and process Satoshi blockchain (blk*.dat) files and I want to know if I should care about SET_* flags.

Thanks, Sergio.
Jump to: