Author

Topic: How to determine common ownership of addresses? (Inspired by Shamir's paper) (Read 1971 times)

hero member
Activity: 770
Merit: 500
I don't think it is possible to determine formally common ownership.
It is however possible to quantify its likelihood.

First, the notion of "common ownership" needs to be redefined.
Instead of trying to find if two addresses belong to the same owner, it seems a more generic approach to consider that they belong to different owners, and try instead to quantify the degree of interaction between these two owners, common ownership becoming a special case of tight relationship where owners happen to be the same person or entity.

A few ideas of metrics that could be used to quantify the likelihood of common ownership based on the degree of interation of owners for any pair of addresses:
- Number of overlapping or non-overlapping paths connecting these two addresses in a non-directed graph where vertices represent the addresses, and edges represent their relation as input in a same transaction.
- Same as above, but weigthed by path length (typically sanctionning longer paths with lower weight to represent looser relationship).
- Same as above, but including time between transaction in the notion of "distance".
- Number of cycles in a directed graph where vertices represent the addresses, and edges represent their relation as input and output respectively in the same transaction. This goes one step further from the original study as it would help detecting and quantifying bidirectional relationships between owners.
- Same as the above, but weighted by distance, time and or flow (min value transiting along the cycle).
- A combination of the above.

Addresses that appear to have a strong relationship both horizontally (direct or transitive co-occurence as inputs) and vertically (direct or transitive co-occurence as input and output in turns) will be particularly worthy of closer study.
 
Time analysis would also help giving faith in the results.
A single observation at a fixed point in time is only that: a single observation of a transitory state in the history of the network.
On the other hand, showing lasting (building) trends would give more predictive power to the conclusions (FWIW: cf induction problem).

These are just naive examples.
Please feel free to propose more elaborate metrics.
donator
Activity: 2058
Merit: 1054
Indistinguishable under which lens? "Yeah, these all look the same to me" doesn't cut it. Can you prove information-theoretically that no information can be obtained?
Under what model?  Under a model that allows you to sniff the keyboards of their creators you can distinguish them.  If I hand you a transaction in isolation, however, you can not distinguish the cases as there is no information to distinguish the cases— a transaction U generated from inputs X and Y paying Z and Q is bitwise identical no matter if was authored by one party or two.
I was talking about extracting information from the amounts involved.

If you propose a model conditioned on other history about X and Y— if there is any, e.g.  if users reused addresses— you might say that it's more or less likely to be one kind or another _but_ you could say that about X and Y no matter if U existed or not, the information did not come from U.
You use all your data. Using the history doesn't mean your conclusion will be the same whether U exists or not.


Adi added this:

Quote from: Adi Shamir
I looked at the threads you initiated and saw a very lively discussion of the points I raised. However, some of the participants take the extreme position that if something can not be measured with complete accuracy, or if all the possible sources of error can not be characterized with their precise statistical distribution, then one should not even try to carry out a scientific research or to draw any conclusions from available data. With such an approach, one would never try to study any sufficiently complex issue. To demonstrate this point, consider the crucial problem of estimating how many poor people live in the US, which can have major consequences on the social policy and budgets of local and federal authorities. As a researcher, you will always have to make some arbitrary choices about where is the poverty line, about your sources of data (if you use tax returns, do you know how many people cheat in their declarations and which types of wealth are not properly measured by these returns? If you use a questionnaire, do you know exactly how truthful are the answers?), about the family unit analyzed (in case someone has no personal income but his close relatives living with him have high salaries), etc etc. If you specify your exact research methodology, explain the possible sources of error, and use best efforts to extract the best possible conclusions from all the available data,  the research can be extremely valuable even when the conclusions do not perfectly match the ground truth.
staff
Activity: 4284
Merit: 8808
Indistinguishable under which lens? "Yeah, these all look the same to me" doesn't cut it. Can you prove information-theoretically that no information can be obtained?

Under what model?  Under a model that allows you to sniff the keyboards of their creators you can distinguish them.  If I hand you a transaction in isolation, however, you can not distinguish the cases as there is no information to distinguish the cases— a transaction U generated from inputs X and Y paying Z and Q is bitwise identical no matter if was authored by one party or two.

If you propose a model conditioned on other history about X and Y— if there is any, e.g.  if users reused addresses— you might say that it's more or less likely to be one kind or another _but_ you could say that about X and Y no matter if U existed or not, the information did not come from U.

And in general, I think you're still not catching my main point: Which is that you can't speak reliably about the errors from any estimates because the ground truth depends on usage behavior you can't observe.

Quote
You have a bunch of linked addresses which you believe are in a single wallet. You check if the transactions involving these addresses make sense with the coin selection algorithm. If they don't you check which addresses can be unlinked to end up with something that makes more sense.

You're waving your arms here. The algorithm is randomized. The most powerful indicator you can get from it is that it'll use an exact mach if it's available and if its confirmed.  ... Except this only applies to the reference client, not armory,  only applies if you know when the transaction was authored,  only applies if the user hasn't manually selected the inputs (which we support),  can only delink a false link at a later spend (so it may take a long time for the information to be revealed), and can only delink things in the presence of address reuse.  For whatever delinking feature you'd care to use you'll run into not knowing how often manual selection and different clients are falsely triggering it, or how often spending patterns are simply late in triggering it... so you can't put error bounds on it.  Including this feature in your estimate will probably increase the variance of your results.

donator
Activity: 2058
Merit: 1054
It does not. Decenteralized mixing and joint payments can produce transactions which are indistinguishable at the transaction level from regular common ownership transactions.
Indistinguishable under which lens? "Yeah, these all look the same to me" doesn't cut it. Can you prove information-theoretically that no information can be obtained?

You can't even say much about coin selection behavior without knowing in advance the full content of the wallet,
You have a bunch of linked addresses which you believe are in a single wallet. You check if the transactions involving these addresses make sense with the coin selection algorithm. If they don't you check which addresses can be unlinked to end up with something that makes more sense.

but the accuracy of this guess depends on access to the ground truth on both sides.
You can devise a theory and then test it on actual wallets.


Let's agree that you think it is impossible to derive any information from the data, and that I think that, while difficult, it is possible to analyze internal and external data in a sophisticated way to obtain useful estimates of statistical properties of interest.
staff
Activity: 4284
Merit: 8808
You can ask services what are their volumes. (Some publish them without being asked).

And some will not. It only takes one that you can't measure to give you enormous error bounds.

Quote
Decentralized mixing (of the kind I know anyway) has a characteristic format, you can try to detect it and filter these transactions for linking addresses.

It does not. Decenteralized mixing and joint payments can produce transactions which are indistinguishable at the transaction level from regular common ownership transactions.

You could try to wave your arms some more and say that some transaction is more likely to be one than the other because of the presence or absence of other linking transactions or the value involved,  but the accuracy of this guess depends on access to the ground truth on both sides.  You can't even say much about coin selection behavior without knowing in advance the full content of the wallet, and you can't say much about what unknown transaction look like.  Again, you're toying with systemic errors here.  You can basically pick whatever outcome you want based on what biases you assume they result in, and no one could disprove your results. That isn't science.
legendary
Activity: 1386
Merit: 1002
I think people who said they did it wrong aren't expecting them to revise the paper, just to let them know that their research is flawed and will always be.
[sarcasm]Let's all stop doing any research then.[/sarcasm]
We shouldn't take the research for more than what it is, but we want "what it is" to be as good as possible.

Heck, they can't even be sure they got the real blockchain instead of a fake one fed to them by whatever site they scraped it from.
They used blockexplorer.com; I don't think discrepancy between their data and the true blockchain is a problem in practice; and this is off-topic for this thread.

How do you or they know theymos didn't fed dummy data to those pesky IP's that were consuming his server resources? Also, for the last months I've seen many times blockexplorer.com not being able to open certain pages, so it's not that far fetched to think it may have happened.
How did they validated the data they scraped?
I see this(data validation) as their real flaw, and one most people seem not to care about, much to my surprise.

The other flaws aren't flaws per se, only limitations because of the way the bitcoin protocol works.
Yes, they made some wrong assumptions, but I don't think they'll find a better method.
It seems all the emails they got was only pointing their "flaws/wrong assumptions". Not one of those persons offered them a solution. Maybe because there is no better solution and you can only go so far on analysing the blockchain.
staff
Activity: 4284
Merit: 8808
[sarcasm]Let's all stop doing any research then.[/sarcasm]
We shouldn't take the research for more than what it is, but we want "what it is" to be as good as possible.
Right, because we all stopped doing all research when someone figured out that you couldn't predict the times of spontaneous particle emission from a radioactive sample by throwing chicken bones at a table…

They're asking to measure something (ownership) which is unmeasurable via the system. The techniques used can't give lower or upper bounds... and while it may have _some_ correlation to ownership there are a half dozen kinds distinct cofounders which are themselves unmeasurable. Result that have almost 100% error bounds are not that useful. The same data could arise from a single bitcoin owner or from a billion owners (even more owners than txouts! because bank like services completely conceal ownership and activity levels from the system).
donator
Activity: 2058
Merit: 1054
I think people who said they did it wrong aren't expecting them to revise the paper, just to let them know that their research is flawed and will always be.
[sarcasm]Let's all stop doing any research then.[/sarcasm]
We shouldn't take the research for more than what it is, but we want "what it is" to be as good as possible.

Heck, they can't even be sure they got the real blockchain instead of a fake one fed to them by whatever site they scraped it from.
They used blockexplorer.com; I don't think discrepancy between their data and the true blockchain is a problem in practice; and this is off-topic for this thread.

The errors from making these assumptions are systemic, not random. It would be a methodological error to treat them as random errors.
From a Bayesian perspective everything is "random". Can you clarify in what ways the errors are not random?

We are looking for assumptions that will have less errors.

You can't even give a distribution over the bias without knowing other difficult or impossible to measure cofounding factors (e.g. amount of decenteralized mixing and joint payment transactions, volume on various services, etc).
You can ask services what are their volumes. (Some publish them without being asked).
Decentralized mixing (of the kind I know anyway) has a characteristic format, you can try to detect it and filter these transactions for linking addresses.
You can analyze the coin selection algorithm to see what kind of patterns are expected with multiple addresses in a single wallet, as opposed to joint payment transactions.
You can assign a probability to the deduction that linked addresses have a common owner, and increase the confidence if there are multiple links.
staff
Activity: 4284
Merit: 8808
I'm not asking for a reliable identification. I'm asking for identification which is as good as possible for the purpose of statistical analysis of the data. This means it should have as little bias and as little variance as possible.
The errors from making these assumptions are systemic, not random. It would be a methodological error to treat them as random errors.

You can't even give a distribution over the bias without knowing other difficult or impossible to measure cofounding factors (e.g. amount of decenteralized mixing and joint payment transactions, volume on various services, etc).
legendary
Activity: 1386
Merit: 1002
I think people who said they did it wrong aren't expecting them to revise the paper, just to let them know that their research is flawed and will always be.

Heck, they can't even be sure they got the real blockchain instead of a fake one fed to them by whatever site they scraped it from.
donator
Activity: 2058
Merit: 1054
For the purpose of statistical analysis of the data, what would be the best way to determine common ownership of addresses?
It's weak for reasons other than the ones listed. Coin selection tries to minimize change, so it doesn't tend to link much— especially if you use addresses once and in some clients like armory linking is explicitly avoided.  So you can have a bunch of coins in a common wallet and still not that much linkage— witness their enormous underestimation of instawallet.
That's weakness #2. "e.g." means it's not necessary that the wallets are separate.

What you're asking for— reliable identification of common ownership— is not possible in Bitcoin _by design_ because the pseudoanonymity of the participants is how we achieve any privacy at all, and thats diluted by combining addresses.
Unless you get people to tell you, you can't determine common ownership of addresses.
I'm not asking for a reliable identification. I'm asking for identification which is as good as possible for the purpose of statistical analysis of the data. This means it should have as little bias and as little variance as possible.
staff
Activity: 4284
Merit: 8808
For the purpose of statistical analysis of the data, what would be the best way to determine common ownership of addresses?
It's weak for reasons other than the ones listed. Coin selection tries to minimize change, so it doesn't tend to link much— especially if you use addresses once and in some clients like armory linking is explicitly avoided.  So you can have a bunch of coins in a common wallet and still not that much linkage— witness their enormous underestimation of instawallet.

What you're asking for— reliable identification of common ownership— is not possible in Bitcoin _by design_ because the pseudoanonymity of the participants is how we achieve any privacy at all, and thats diluted by combining addresses.

I also expect techniques like the ones used here to become increasingly unreliable as time goes on— as payment protocols reduce address reuse, as client software makes a greater effort to avoid linking, and as decenteralized mixing and joint payment transactions become more common.

Even if everything works nicely and you ignore the uncommon cases, what the available techniques are measuring is point of time control— which is only weakly informative about the past and future and doesn't indicate ownership at all.

If you want to know about how people use Bitcoin— You're just going to have to ask them.

As an aside, we might also care to remind researchers that most institutions have ethical standards required for experimentation on human subjects which normally precludes experimentation without informed consent— in the US the existence these standards are mandated by law for institutions which receive public funding— and deanonymizing and monitoring businesses and individuals using Bitcoin is something with ethical considerations.
legendary
Activity: 1078
Merit: 1003
Unless you get people to tell you, you can't determine common ownership of addresses.
donator
Activity: 2058
Merit: 1054
In Quantitative Analysis of the Full Bitcoin Transaction Graph, Dorit Ron and Adi Shamir have analyzed statistics and interesting structures in the chains of ownership of bitcoins. To this end, they needed a way to determine when two addresses are owned by the same person.

The method they used is as follows: If there is any transaction which has inputs from both addresses, they have the same owner; this is extended transitively. Address pairs which are not in the transitive closure are assumed to have distinct owners.

This method has several weaknesses:

1. If two addresses by distinct owners are linked in an advanced transaction, they will be assumed to have the same owner.
2. If two addresses are owned by the same person but are never visibly linked (e.g. separate wallets), they will be assumed to have distinct owners.
3. If people are using a shared eWallet, they will all be assumed to be the same person - while technically the keys are indeed all held by the same entity, each person has a separate legal ownership of the coins.

They also had this to say:

Quote from: Adi Shamir
1. While the notions of a bitcoin and of an address are completely clear, the notion of an owner is quite fuzzy since it can not be derived in a precise way from the available data (this may be a feature ofthe scheme, rather than a bug!). There are several ways how to deal with this issue:

1.1 Ignore it completely, and derive from the graph only statistical information about the behavior of addresses. However, we believe that this will completely distort many types of statistical informationwe try to extract from the graph, e.g., what is the distribution of the number of bitcoins that users keep, how many bitcoins they receive and spend, how big are their typical transactions, etc. Since thescheme enables (and even encourages)  users to keep multiple addresses and to constantly shuffle bitcoins internally among their accounts, we believe that it is essential to find a way to distinguish between"internal" and "external" transactions, and thus to determine in some way what is the common ownership of different addresses.

1.2 Use our methodology, which is to assume that in most of the transactions, sending bitcoins from multiple addresses indicates that all these accounts are owned by a single entity. This classification istechnically easy to apply, but it creates two types of errors: We underestimate the common ownership of accounts just because we never saw it in the given transactions, and we overestimate it sinceoccasionally there may be multiple owners who send bitcoins in a single transaction. All the anecdotal evidence we saw so far indicates that overall we tend to underestimate the number of addresseswhich are associated with a single entity (as was demonstrated in the case of Instawallet, in which we found only about 1/3 ot the actual addresses associated with it), and that the errors in the otherdirection, while they exist, are not likely to distort our statistical conclusions in a major way.

1.3 Use a different methodology, which will be closer to the ground truth, and which can be derived either from the available data, or from reliable alternative sources. Here we need your help insuggesting such a methodology, which will be discussed by and accepted by most of the bitcoin community as better representing the issue of common ownership. It is always easier to complain about theshortcomings of one methodology than to suggest a better one!

For the purpose of statistical analysis of the data, what would be the best way to determine common ownership of addresses?
Jump to: