I haven't had a chance to look at your code yet, but do you use Tarjan's union-find to compute the "wallet-sets"? I've been meaning to write a tool that does that for a while.
Edit: oh, you're using boost's connected_components. I think they use union-find for incremental_components.
Yes I do because I build the entire graph in the first place.
Not sure what algorithm they use in there.
Also, given that I build the whole graph, not sure why using
TUFA has any advantage over a dumb [D|B]FS mark and sweep ...
Yeah, it doesn't. I wasn't planning on maintaining the entire graph, so can get away with it.
Turns out that the graph will grow so fat at some point that it might
very well be necessary to move to some sort of incremental data
structure ... that being said, since there is no way to know that two
disjoin subsets will or won't merge until the very last tx in the chain,
I don't think it's possible to get away with _not_ building the whole
graph in the first place.
Well, I haven't tried implementing it yet, but my thinking was that the union-find structure represents just as much of the graph as we need. We basically have an equivalence of "has been used as an input in the same transaction as" and we're building equivalence classes.
So basically, you make a pass over the data building up an associative map structure from each address to something your union-find algorithm knows how to deal with (I'm language-agnostic
). As you process transactions in the block building these things up, you make sure to call union on your union-find structure to join classes together if you see them as inputs to the same transaction. This is linear in the number of inputs, not quadratic, because the structure knows what's already in the class. Once you've made your pass over the blockchain, you're left with a big map containing all input addresses ever seen and your union-find structure. Now all you need to do is traverse your map of addresses and look up (using find to find a representative element of your equivalence class) which set each element is in, and group on that. The resulting sets should be your connected components. You could then make your map from addresses point to the set it belongs to for easy lookups, or do whatever.
If I'm not spouting nonsense, that would allow us to run the set-building in O(n * alpha(n)) time, where n is the total number of inputs to all transactions. You'd use O(total unique input addresses) space for it, and it should have fairly low constant factors.
I'll probably try implementing that in a bit, but I need to finish decoding the scripts first