The "entropy" will depend upon the model of attacker. Start by enumerating those.
My attempt to collect a few random thoughts on the subject (long post).
1/ First studies in this field have modeled the blockchain as transactions graph and/or addresses graph. In a generic way, the enlarged blockchain ecosystem can be modeled as a graph in which nodes are:
- Txo : associated to a given amount and controlled by a given script
- Entity (human / organizational / machine) : controls txos by controlling associated scripts (alone or with others entities)
- Tx : acts like "micro-mixers" of txos (amounts of input txos are mixed/splitted and forwarded to output txos)
Note: the description is purposefully simplified, but it should be enough for the discussion.
2/ An attacker has 2 objectives:
- deanonymization of entities
- determination of the links between input and output of transactions
These 2 objectives are not orthogonal. They're mutually reinforcing. Every information gained for one can be used for the other.
Deanonymization of entities3/ It often starts with side-channel attacks:
- information gathering (bitcoin addresses, id, emails, ...) from various sources (forums, social networks, db managed by exchanges, merchants, ...). These information allow to associate deanonymized entities to a subset of the txos. Solutions like Stealth addresses help to address this issue.
- network eavesdropping (like the one described
here). It's a 2-steps process (at least):
a - association of an ip address to a tx
b - association of an entity (person, ...) to the ip address
c - for more complex scenarii (mixed or coinjoin txs) ip address has to be associated to a subset of the input txos of the tx.
4/ As stated by gmaxwell, starting from the txos associated to deanonymized entities, the attackers want to follow the deanonymized funds forwards or backwards and expand their knowledge recursively. It leads us to the second objective.
Determination of links between inputs and outputs of transactions5/ I think the problem can (should ?) be addressed in a probabilistic way. An attacker doesn't need 100% certainty before deciding of an action. She just needs to be above a given threshold of confidence. If a bunch of the analysis can be automated, the attacker can study several alternative hypotheses and decide which one seems the best.
6/ Taint analysis is the first tool usable for this kind of analysis. The result has 100% certainty (it's just a basic "read" of the blockchain) but produces limited insights.
7/ You can use some heuristics to enrich information provided by taint analysis. First studies in the domain used very simple heuristics like multi-input transactions and shadow addresses (see the paper "Evaluating User Privacy in Bitcoin") but gave quite good results. This privacy issue was addressed by avoiding address reuse and some new proposal like mixers or coinjoin txs. You can also use some "best-guess" heuristics like the one used by blockchain.info to determine which output is payment and which one is change.
8/ The more you know about entities, the more you can use sophisticated heuristics. For example, an attacker can use a specific knowledge (2 persons live in the same city and are occasional users of localbitcoin) in order to infer that the input of a coinjoin tx is linked to a specific output (if she already knows which input/output is controlled by these 2 persons). The attacker has not 100% certainty about this inference but it could be a reasonable hypothesis.
9/ Coinjoin has been proposed as an additional solution to strengthen privacy. As already stated by gmaxwell, it does not provide 100% anonymity but can provide better privacy. With proper design it helps to increase the cost to retrieve a given quantity of information (links between input and output). But as stated above, coinjoin remains attackable if you have side-channel informations.
Here's a "dumb" example. The attacker is an intel. agency with access to a huge amount of side-channel infos. As an analyst of this agency, I investigate on a man (let's call him Charlie) suspected to finance a terrorist attack by repeated small bitcoin transactions. Our hypothesis is that funds are received by another person (Mr. X) suspected to sell the coins on localbitcoin to gather dollars which will be used to buy some materials for the attack (I told you, it's a dumb scenario). Today, I want to analyze a given coinjoin transaction because I know that Charlie controls one of its inputs. Let's say that this coinjoin tx has 3 inputs and 3 outputs, all with same amount. The agency has a program of massive surveillance which tells me that :
- input A is controlled by a woman suffering a cancer (information retrieved from her facebook account)
- input B is controlled by Charlie
- input C is controlled by a teenager (boy - information retrieved from snapchat)
- output D is controlled by a small e-commerce website running on tor and selling weed
- output E is controlled by a porn website
- output F is controlled by an unknown entity, for now.
According to additional information that I can access, I will be able to build different hypotheses with more or less confidence:
Hyp1) The woman has bought some weed to cure her pain and the teenager has watched some porn. Charlie may have send some coins to Mr. X and I should investigate deeper the output F
Hyp2) If I know that Charlie smokes weed may be this tx is just a false positive for my investigation => Charlie has bought some weed. The woman has nothing against a porn movie from time to time and the teenager...has bought a video game.
Hyp3) [use your own fantasy here]
...
I think you get the idea. This thought experiment illustrates a few interesting facts:
- to be effective, this kind of attack requires side-channels informations. The more information you have, the more effective you are.
- let's forget the "paranoïd" scenario that intel. agencies want to do massive surveillance just because they can do it or because they pursue some nasty goals. This experiment shows that to do their job (investigating potential serious threats) intel. agencies "have to" break privacy of all users (I don't argue it's good or bad, I just draw a logical conclusion. So please, do not yell at me)
- to strengthen his privacy Charlie could chain several coinjoin txs but this solution has a major drawback: if coinjoin txs are rare among "normal" users, chaining several txs becomes a real red flag telling "Hey ! I prepare (do) something illegal"
- as proposed by others persons, systematic coinjoin txs for all users would produce a combinatorial explosion. It's not perfect anonymity but it would raise very significantly the cost of this kind of attack.
Automated analysis (machine learning algorithms, ...)10/ To my knowledge, few studies have used this kind of tool until now but they could be very effective (see the paper "Unsupervised Approaches to Detecting Anomalous Behavior in the Bitcoin Transaction Network"). By trying to detect repeated patterns in the blockchain, this approach helps to infer additional information. The rationale is simple : some activities are associated to very specific patterns. For example, it's likely that transactions corresponding to mining pools paying the miners or gambling sites paying the players have very specific characteristics.
Another example is provided by the MTGox's source code which was leaked a few month ago. It was quickly identified that there was an automatic process in place to split/merge amounts in their hot wallet. This knowledge was later used to detect that some funds were transfered according to this pattern (a few days before MTGox officially states that some funds had been retrieved). In this case, the pattern has been provided by a leak but you get the idea on how this kind of information can be used to reach the 2 objectives of the attacker.
11/ Patterns corresponding to automated (and non-random) processes should be the easiest to detect but it does not seem impossible that human operations could also be processed by studying temporal or behavorial patterns. For example, if you live in Europe it's likely that you send your txs at very different GMT hours compared to american or asian users.
Conclusion12/ The current model provided by bitcoin (or by bitcoin with occasional coinjoin txs) is ultimately very close to the current situation of the interweb. Thus, it's quite straightforward to deduce the different levels of attacks and attackers:
- Group A : Intel. agencies
They have access to a massive amount of side-channel information already gathered from various sources. They will get the best results. Since not all side-channels information are not obtained by legal (official) means, these information don't always allow an official (legal) reaction.
- Group B : Large corporations
They have access to some amounts of side-channel information by providing paying services requiring the user to provide some personal informations. They'll be able to extract additional information by merging these data with data from the blockchain but at a lower level than group A. They'll an incentive to monetize these information by selling them to others entities (marketing, ads, ...).
- Group C : Basically the rest of the world
We have occasional access to some side-channel information by our direct interaction with others bitcoiners. Individually, it'll be difficult to extract significant information by merging these data with data from the blockchain.
13/ A bitcoin bank located in an exotic island and providing offchain transactions could remain the best solution for those wanting to hide illegal activities like money laundering.
14/ In this matter, I would say that the current bitcoin model does not change a lot of things compared to the current statu quo (even if bitcoin remains a great technology and a great innovation)
Disclosure : english is not my mother tongue