Author

Topic: Dragnet: A Method for Tagging Bitcoin Addresses of Exchanges (Read 258 times)

newbie
Activity: 6
Merit: 49
the model being coinjoin-attacked by exchanges

So far as I see, no CoinJoin happened to exchanges. Most of the exchanges like to use the hot-cold-deposit wallet model. Because it's safe and reliable. The hot and cold wallets have obvious features. The cold wallets have a lot of Bitcoin so they are always shown in the rich list (https://chain.info/richlist). And the cold wallet only has one-to-one transactions with the hot wallet, which has a huge number of transactions.

I manually checked the gathered hot and cold wallets of every exchange. They all look well.

But there is also an exception we found later. Some exchanges like to use a changing addresses as the hot wallet. The hot wallet changes its address every time after it generates a transaction.
legendary
Activity: 1456
Merit: 1175
Always remember the cause!
The point of all this is that a single transaction that connects two addresses together is not necessarily enough to link two businesses together, absent additional evidence.
The paper addresses this issue:
Quote
Although someone would use the CoinJoin method [9] to combine UTXOs from multiple senders into a single transaction to make it more challenging to determine the relationship between input and output addresses, we detect this method has not been adopted by the exchange so far.
I don't think this is a valid assumption. A CJ transaction can consist of two inputs, each from different entities. In 2013, many exchanges were not as professional as they are today, and were dealing with much less customer money.

The OP appears to be interested in weeding out exchanges with fake volume. An exchange with fake volume could possibly pay a whale to conduct a small number of Coin Join transactions to evade detection of their fake volume.
It is not that simple. Imagine that we have this model approved and standardized and many watchdogs involved using the basic idea. An exchange confident enough about its volume might decide to let analyzer do their hob and provide the info which puts them in the top list. A shady exchange can not change anything by using coinjoin. It is because of what coinjoin does: hiding assets. The incentive goes the opposite way.

Also, a classification model that is accurate 96% of the time (it is unclear how you are measuring accuracy) has very high accuracy. My first reaction to that high of claimed accuracy is that you might have data leakage. I can't point to the source without looking at your specific steps to train your model, which understandably may not be something you want to share.
I suppose they are presenting a model more than a software. So far, the model seems to me to be solid up to the extent that a good heuristic-based data mining model could be. The implementation is not open and it is not good news, so the results presented are highly suspicious.

For example, consider a conspiracy theory to be true: A shady exchange (such as Bittrex) with very low liquidity and a high incentive to put itself in the top 10 list and faking high volumes of trade, as a part of its scam, hires a team of technical writers and they publish an acceptable analysis model and faking privately generated results in favor of the exchange.

I would recommend that you learn about machine learning.
Thank you for the recommendation and the Wikipedia page you linked.  Cheesy
It is not how it works in technical discussions tho. You got deep knowledge in ML? Good for you! But for now, the only serious objection you've made to the article is about the possibility of the model being coinjoin-attacked by exchanges, making void one of the basic heuristic assumptions of the proposed model. Well,  I'm not convinced, nobody would because there is no sign of that and no incentive for that.
copper member
Activity: 1652
Merit: 1901
Amazon Prime Member #7
The point of all this is that a single transaction that connects two addresses together is not necessarily enough to link two businesses together, absent additional evidence.
The paper addresses this issue:
Quote
Although someone would use the CoinJoin method [9] to combine UTXOs from multiple senders into a single transaction to make it more challenging to determine the relationship between input and output addresses, we detect this method has not been adopted by the exchange so far.
I don't think this is a valid assumption. A CJ transaction can consist of two inputs, each from different entities. In 2013, many exchanges were not as professional as they are today, and were dealing with much less customer money.

The OP appears to be interested in weeding out exchanges with fake volume. An exchange with fake volume could possibly pay a whale to conduct a small number of Coin Join transactions to evade detection of their fake volume.

Also, a classification model that is accurate 96% of the time (it is unclear how you are measuring accuracy) has very high accuracy. My first reaction to that high of claimed accuracy is that you might have data leakage. I can't point to the source without looking at your specific steps to train your model, which understandably may not be something you want to share.
I suppose they are presenting a model more than a software. So far, the model seems to me to be solid up to the extent that a good heuristic-based data mining model could be. The implementation is not open and it is not good news, so the results presented are highly suspicious.

For example, consider a conspiracy theory to be true: A shady exchange (such as Bittrex) with very low liquidity and a high incentive to put itself in the top 10 list and faking high volumes of trade, as a part of its scam, hires a team of technical writers and they publish an acceptable analysis model and faking privately generated results in favor of the exchange.

I would recommend that you learn about machine learning. The Wikipedia article will tell you about ML but will be insufficient for you to be able to speak to it coherently.

The CoinJoin problem is really hard to solve. But as we see, nowadays, there is actually no/few CoinJoin in exchanges, especially big exchanges.
This paper has been peer-reviewed by some experts. They also addressed the issue of CoinJoin. But I cannot find a way to solve this issue.
You are correct, the CJ problem is difficult to solve programmatically. You could rule out transactions that have inputs from unique addresses above a threshold and unique output addresses above a threshold. This would not address Bob and Alice's exchange (who is faking volume) from broadcasting a single CJ transaction with two inputs and two outputs.

Once you find address clusters, you could remove a percentage of transactions associated with each address cluster in your data set, and re-run your cluster analysis. If a high enough percentage of addresses are no longer part of the cluster with transactions excluded, you can either flag the address cluster for closer analysis separately, or you can do something such as looping through each transaction that connects the cluster, and each loop removes a single transaction and adds the previous transaction back in (each loop assumes exactly one transaction is removed). If enough loops produce distinct, large clusters, then you may have a 'hidden' CJ transaction.

Obviously the above would be very expensive computational wise and is Big O squared. You might be able to make different assumptions that would make your function more efficient.
newbie
Activity: 6
Merit: 49
The CoinJoin problem is really hard to solve. But as we see, nowadays, there is actually no/few CoinJoin in exchanges, especially big exchanges.
This paper has been peer-reviewed by some experts. They also addressed the issue of CoinJoin. But I cannot find a way to solve this issue.
legendary
Activity: 1456
Merit: 1175
Always remember the cause!
The point of all this is that a single transaction that connects two addresses together is not necessarily enough to link two businesses together, absent additional evidence.
The paper addresses this issue:
Also, a classification model that is accurate 96% of the time (it is unclear how you are measuring accuracy) has very high accuracy. My first reaction to that high of claimed accuracy is that you might have data leakage. I can't point to the source without looking at your specific steps to train your model, which understandably may not be something you want to share.
I suppose they are presenting a model more than a software. So far, the model seems to me to be solid up to the extent that a good heuristic-based data mining model could be. The implementation is not open and it is not good news, so the results presented are highly suspicious.

For example, consider a conspiracy theory to be true: A shady exchange (such as Bittrex) with very low liquidity and a high incentive to put itself in the top 10 list and faking high volumes of trade, as a part of its scam, hires a team of technical writers and they publish an acceptable analysis model and faking privately generated results in favor of the exchange.
copper member
Activity: 1652
Merit: 1901
Amazon Prime Member #7
Take a look at this thread. Some forum members were creating CoinJoin transactions with each other. Review how the Wasabi Wallet works; it combines inputs from various users to obfuscate the relationship between each user's inputs and outputs. At one point, now-defunct exchange Mt Gox allowed users to 'import' their private keys into their accounts, and Mt Gox would use those private keys to create transactions that spent any unspent outputs spendable with the private keys, and the transactions would be spending unspent outputs of other customer's private keys, and some of Mt Gox's own coin. With the advent of Lightning Network, many people will sign transactions that have other people's inputs.

The point of all this is that a single transaction that connects two addresses together is not necessarily enough to link two businesses together, absent additional evidence.

Also, a classification model that is accurate 96% of the time (it is unclear how you are measuring accuracy) has very high accuracy. My first reaction to that high of claimed accuracy is that you might have data leakage. I can't point to the source without looking at your specific steps to train your model, which understandably may not be something you want to share.
legendary
Activity: 2464
Merit: 3878
Hire Bitcointalk Camp. Manager @ r7promotions.com
Now you can choose the language manually, on the bottom of the page.
Good addition. It was very quick!

Was it there already and I missed it at the first place? :-P

The site really looks very nice now.
newbie
Activity: 6
Merit: 49
only Putonghua in your web for me as well... Hope it doesn't depend on localization services (wouldn't be a  fair trade-off IMHO, and infact mine are disabled)

It's a bug. We'll fix it.
Now you can choose the language manually, on the bottom of the page.
member
Activity: 90
Merit: 91
only Putonghua in your web for me as well... Hope it doesn't depend on localization services (wouldn't be a  fair trade-off IMHO, and infact mine are disabled)


Real Data.

The website changes language according to your browser settings. It doesn't work, does it?

I wonder if there is any suggestion for the algorithm we use.
member
Activity: 90
Merit: 91
Cool...
Just to note the the Vertical Mining Heuristic should be very effective with batch-transactions-enabled exchanges, like recently Coinbase...

legendary
Activity: 2464
Merit: 3878
Hire Bitcointalk Camp. Manager @ r7promotions.com
Real Data.

Awesome and worrying. This means anything happen from this exchanges is going to create a big disaster in the market. We really need a culture of not having centralized exchanges. We need to focus on more into P2P exchange.

Quote
The website changes language according to your browser settings. It doesn't work, does it?
I had to change the browser language manually. It was not automated.

Quote
I wonder if there is any suggestion for the algorithm we use.
Not very much tech guy here but I think if your data has accuracy then you are doing a great job.
newbie
Activity: 6
Merit: 49
Real Data.

The website changes language according to your browser settings. It doesn't work, does it?

I wonder if there is any suggestion for the algorithm we use.
legendary
Activity: 2464
Merit: 3878
Hire Bitcointalk Camp. Manager @ r7promotions.com
Quote
To solve the problem of information asymmetry between users and exchanges, we propose a method for tagging Bitcoin addresses of exchanges. Through vertical, forward, and backward address mining, the method can utilize only one or several addresses of an exchange to find out all its addresses and distinguish different address types: deposit wallet, hot wallet, and cold wallet. Then the balance and transfers of the exchange can be further obtained through these addresses, helping users understand the real Bitcoin holdings of the exchange.
https://www.techrxiv.org/articles/Dragnet_A_Method_for_Tagging_Bitcoin_Addresses_of_Exchanges/11852739

Interesting idea and I think this will be a realistic move. These exchanges are really faking the volumes and creating confusion to their users.


Are these real data or just some samples?

Quote
May be an English version will give it more exposure?

Anyway, what are you looking for?
newbie
Activity: 6
Merit: 49
https://www.techrxiv.org/articles/Dragnet_A_Method_for_Tagging_Bitcoin_Addresses_of_Exchanges/11852739

We are a group of developers and data analysts. This paper is a recent work by our team. We explained how to find out the Bitcoin addresses of exchanges. The final results are shown on this website (https://chain.info/). Hope for suggestions.

Jump to: