s/\(
s//--\1--/g
# change
for a real line break
s/
/\n/g
s/
/\n/g
# do sentence breaking after . and ! and ? when space cap
s/\([?!\.]\)\s\+\([A-Z]\)/\1\n\2/g
# cleanup links, just use their href as if it was text
s:\|::g
# punctuation stuff
s/\([,\.?]\)\($\|\s\)/ \1 \2/g
s/'s/ 's/g
s/\([()]\)/ \1 /g
# cleanup any spurious space at the end of the lines
s/\s\+$/\n/g
So, the next step was to do ngram counts over each of these models. To do this, you simply count all of the 1, 2 and 3 grams in the corpus and create a counts file that you can use to create language models. Note, I'm quite happy to share these count files for anyone who wants to see them. The thing is that I guess they're a little too large for most pastebin services. The quickseller counts file is approximately 8MB, for example. I can tar these up and email them to anyone who's interested. Or if anyone has a site they don't mind hosting them on then I could send them to that person. Just let me know.
tspacepilot@computer:~/lm/counts$ ls -lah
total 43M
drwxr-xr-x 2 tspacepilot tspacepilot 4.0K Sep 4 12:05 .
drwxr-xr-x 8 tspacepilot tspacepilot 16K Sep 4 11:55 ..
-rw-r--r-- 1 tspacepilot tspacepilot 1.3M Sep 3 10:40 as.count
-rw-r--r-- 1 tspacepilot tspacepilot 16M Sep 4 08:21 d.count
-rw-r--r-- 1 tspacepilot tspacepilot 12M Sep 4 08:20 h.count
-rw-r--r-- 1 tspacepilot tspacepilot 617K Sep 3 10:41 pan.count
-rw-r--r-- 1 tspacepilot tspacepilot 8.2M Sep 3 10:38 qs.count
-rw-r--r-- 1 tspacepilot tspacepilot 5.8M Sep 3 10:40 tsp.count
The next step is to generate language models from the count files. I used Good-Turning smoothing over an MLE parameter estimation in order to generate plain text files that include the models. These models are in the standard NIST format. Here's the top of the file from tsp:
tspacepilot@computer:~/lm/lms$ head tsp.lm
\data\
ngram 1: type=21218 token=294893
ngram 2: type=117148 token=287741
ngram 3: type=215034 token=280589
\1-grams:
9787 0.0331883089798673 -1.4790148753233 ,
9243 0.0313435720752951 -1.50385151060555 the
8592 0.0291359916986839 -1.53557019528667 to
7152 0.0242528645983458 -1.61523695785429
7152 0.0242528645983458 -1.61523695785429
What you're seeing thereis the counts for each ngram type. So the tspacepilot model has 294893 tokens/word instances, which fall into 21218 types. To be clear for those who don't have a background in this, if I say "the" twice, that's two tokens and one type. Then, you see the start of the 1 grams section. You can see that I used a comma "," 9787 times and that the comma represents 0.033... of the probability mass of the unigram model, the second colum is that mass converted to a log value. Here I reused a perl script that I had made some time ago. It's short enough to show you the entirety here:
#!/usr/bin/perl
# Build ngram LM for given count file
# tspacepilot
use strict;
#setting up the input file handles
$#ARGV != 1 and die "Usage: $0\n";
my $ngram_count_file = $ARGV[0];
my $lm_file_name = $ARGV[1];
open(DATA, "<:", $ngram_count_file) || die "cannot open $ngram_count_file.\n";
open(OUT, ">:", $lm_file_name) || die "cannot open $lm_file_name for writing.\n";
my @data = ;
my %unis;
my $uni_toks;
my %bis;
my %flat_bis;
my $bi_toks;
my %tris;
my %flat_tris;
my $tri_toks;
#here we build up the hash tables that we'll use to print the answer
foreach my $line (@data){
my @tokens = split(/\s+/, $line);
my $l = $#tokens;
if($l<1){
print "error on this line of count file:\n$line\n";
print "l = $l";
} elsif($l==1){
#print "this is a unigram\n";
$unis{$tokens[0]}=$tokens[1];
$uni_toks += $tokens[1];
} elsif($l==2){
#print "this is a bigram\n";
$bis{$tokens[0]}{$tokens[1]}=$tokens[2];
$flat_bis{"$tokens[0] $tokens[1]"}=$tokens[2];
$bi_toks += $tokens[2];
} elsif($l==3){
#print "this is a trigram\n";
$tris{"$tokens[0] $tokens[1]"}{$tokens[2]}=$tokens[3];
$flat_tris{"$tokens[0] $tokens[1] $tokens[2]"}=$tokens[3];
$tri_toks += $tokens[3];
} else {
print "error on this line of count file:\n$line\n";
print "l = $l";
}
}
print OUT "\\data\\\n";
print OUT "ngram 1: type=",scalar keys %unis," token=$uni_toks\n";
print OUT "ngram 2: type=", scalar keys %flat_bis," token=$bi_toks\n";
print OUT "ngram 3: type=", scalar keys %flat_tris," token=$tri_toks\n";
print OUT "\\1-grams:\n";
foreach my $uni (sort {$unis{$b} <=> $unis{$a} or $a cmp $b } (keys %unis)){
my $prob = $unis{$uni}/$uni_toks;
my $lgprob;
$lgprob = log10($prob);
print OUT "$unis{$uni} $prob $lgprob $uni\n";
}
print OUT "\\2-grams:\n";
#compute output for two grams
my @two_gram_output;
foreach my $flat_bi(keys %flat_bis){
my ($firstword) = $flat_bi =~ m/(\S+)/;
my $denominator;
foreach my $secondword (keys % {$bis{$firstword}}){
$denominator += $bis{$firstword}{$secondword};
}
my $prob = $flat_bis{$flat_bi}/$denominator;
my $lgprob = log10($prob);
push(@two_gram_output, "$flat_bis{$flat_bi} $prob $lgprob $flat_bi\n");
}
my @sorted_two_grams = sort{(split /\s+/,$b)[0] <=> (split /\s+/,$a)[0]} @two_gram_output;
#print output for two grams
foreach (@sorted_two_grams){
print OUT;
}
#compute output for 3grams
print OUT "\\3-grams:\n";
my @three_gram_output;
foreach my $flat_tri (keys %flat_tris){
my ($first_two_words) = $flat_tri =~ m/(\S+\s+\S+)/;
my $denominator;
foreach my $thirdword (keys % {$tris{$first_two_words}}){
$denominator += $tris{$first_two_words}{$thirdword};
}
my $prob = $flat_tris{$flat_tri}/$denominator;
my $lgprob = log10($prob);
push(@three_gram_output, "$flat_tris{$flat_tri} $prob $lgprob $flat_tri\n");
}
my @sorted_three_grams = sort{(split /\s+/,$b)[0] <=> (split /\s+/,$a)[0]} @three_gram_output;
#print output for 3grams
foreach(@sorted_three_grams){
print OUT;
}
sub log10 {
my $n = shift;
return log($n)/log(10);
}
Okay, with the language models all built (again, email me or PM me if you want to see the models themselves, I don't mind sharing them) we can start to get to the fun stuff. The goal of the experiment is to use the language models as predictors of the other accounts texts. The typical measure for this is called "perplexity" (https://en.wikipedia.org/wiki/Perplexity). One nitty-gritty detail about this is what sorts of weighting to give to the 1,2,3 gram portions of the model when calculating perplexity. Intuitively, putting more weight into the 1 grams puts more value on shared single-words, ie, the basic vocabulary of the person. Putting more weight onto the 3-grams puts more weight on how that person puts words together, what three-word phrases they tend to use. I ended up using weights 0.3 0.4 0.3 (uni,bi,tri grams) in calculating perplexity. For each language model, I calculated the perplexity it assigns to each of the corpora of the accounts in the experiment. Here comes the fun stuff, then, the results:
As plain text, checking the QS language model against every corpus:
==> qstest-acctseller-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=2722 num=57708 oov_num=1393
logprob=-119405.183085554 ave_logprob=-2.02254828472914 ppl=105.329078517105
==> qstest-dooglus-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=48667 num=827638 oov_num=108735
logprob=-1963318.24588274 ave_logprob=-2.55783608776103 ppl=361.273484388214
==> qstest-hilariousandco-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=42799 num=636455 oov_num=53676
logprob=-1514039.01569095 ave_logprob=-2.42022420176373 ppl=263.162620156841
==> qstest-panthers52-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=1663 num=25359 oov_num=1093
logprob=-53775.973489288 ave_logprob=-2.07397020669089 ppl=118.568740528906
==> qstest-tspacepilot-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=7150 num=280589 oov_num=29664
logprob=-666393.992923604 ave_logprob=-2.5821718218487 ppl=382.09541103913
Well, as you can see, qs' model predicts my corpus with a perplexity of 382, predicts hillarious with 263, predicts dooglus with 361. But crucially, predicts the posts of ACCTSeller and Panthers52 at 105 and 118!!!!
What this means is that QS's posting style, when measured quantitatively shows through his attempts to hide what he was doing. This isn't too surprising for anyone who knows how language works, but it may be to others. For fun, I also ran each model as a predictor against each of the other corpora.
hillariousancco against all:
==> htest-acctseller-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=2722 num=57708 oov_num=2260
logprob=-136595.372784586 ave_logprob=-2.34820994988114 ppl=222.951269646594
==> htest-dooglus-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=48667 num=827638 oov_num=109662
logprob=-1934327.44440288 ave_logprob=-2.52311368446967 ppl=333.513704608138
==> htest-panthers52-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=1663 num=25359 oov_num=1828
logprob=-60634.1796607556 ave_logprob=-2.40669126223528 ppl=255.088724501193
==> htest-quickseller-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=24371 num=503617 oov_num=25750
logprob=-1193959.69530073 ave_logprob=-2.37727869117974 ppl=238.384871857193
==> htest-tspacepilot-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=7150 num=280589 oov_num=26006
logprob=-662995.55023098 ave_logprob=-2.5330988076818 ppl=341.270546308425
So, we can see that hillarious doesn't really have a style predicts any of the rest of us better than another. At least not significantly. However, it is interesting that hillarious' model assigns perplexities to all three of quickseller's accounts which are in the same range. This provides an oblique suggestion as to the similarities of those corpora. Here is dooglus' model predicting each of the other accounts:
==> dtest-acctseller-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=2722 num=57708 oov_num=2518
logprob=-141009.183781008 ave_logprob=-2.43488713532615 ppl=272.199382299313
==> dtest-hilariousandco-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=42799 num=636455 oov_num=44764
logprob=-1532563.94318701 ave_logprob=-2.4154264735252 ppl=260.271415205445
==> dtest-panthers52-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=1663 num=25359 oov_num=1752
logprob=-61358.7835651667 ave_logprob=-2.42812756490569 ppl=267.995538997277
==> dtest-quickseller-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=24371 num=503617 oov_num=26384
logprob=-1223316.26268869 ave_logprob=-2.43880882666145 ppl=274.668481585288
==> dtest-tspacepilot-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=7150 num=280589 oov_num=20198
logprob=-680500.394458114 ave_logprob=-2.5435368577456 ppl=349.572175864552
here's my model predicting all the other corpora
==> ttest-acctseller-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=2722 num=57708 oov_num=2850
logprob=-139530.390079984 ave_logprob=-2.42324400972532 ppl=264.998862488461
==> ttest-dooglus-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=48667 num=827638 oov_num=99717
logprob=-1946265.50900313 ave_logprob=-2.50617510057216 ppl=320.756230152803
==> ttest-hilariousandco-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=42799 num=636455 oov_num=50287
logprob=-1518909.27782387 ave_logprob=-2.41492682099994 ppl=259.972147091511
==> ttest-panthers52-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=1663 num=25359 oov_num=2043
logprob=-61310.1514410114 ave_logprob=-2.45446781060136 ppl=284.752673700336
==> ttest-quickseller-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=24371 num=503617 oov_num=30864
logprob=-1209678.28851218 ave_logprob=-2.43335322477326 ppl=271.239680896164
Finally, we can also use the acctseller models and the panthers models to predict the other corpora. These models are a bit smaller than the qs model, so I think it's not as impressive as the results from the QS model. But they do demonstrate the same pattern.
==> atest-dooglus-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=48667 num=827638 oov_num=158655
logprob=-1864342.35403158 ave_logprob=-2.59784345298067 ppl=396.135216494324
==> atest-hilariousandco-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=42799 num=636455 oov_num=87812
logprob=-1444217.53179264 ave_logprob=-2.44185825794015 ppl=276.603873729012
==> atest-panthers52-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=1663 num=25359 oov_num=2433
logprob=-54938.2415881704 ave_logprob=-2.23426091293548 ppl=171.498731827101
==> atest-quickseller-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=24371 num=503617 oov_num=36302
logprob=-1072293.35965131 ave_logprob=-2.18084989129508 ppl=151.652610771117
==> atest-tspacepilot-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=7150 num=280589 oov_num=47163
logprob=-623320.832692272 ave_logprob=-2.59095185177354 ppl=389.898758003026
Again, dooglus, me and hillariuos are all above 270 whereas the other known quickseller account is at 151 and the "suspected" alt is at 171. And with the panthers model:
tspacepilot@computer:~/quickseller/ppls/ptest$ tail -n 3 *
==> ptest-acctseller-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=2722 num=57708 oov_num=5835
logprob=-126943.515020739 ave_logprob=-2.32518573167395 ppl=211.439309416701
==> ptest-dooglus-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=48667 num=827638 oov_num=200298
logprob=-1733046.66220228 ave_logprob=-2.56365194769031 ppl=366.144021870075
==> ptest-hilariousandco-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=42799 num=636455 oov_num=110187
logprob=-1420281.45120892 ave_logprob=-2.49580708635173 ppl=313.18942275869
==> ptest-quickseller-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=24371 num=503617 oov_num=55974
logprob=-1089757.40317691 ave_logprob=-2.30873957801444 ppl=203.582094424962
==> ptest-tspacepilot-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=7150 num=280589 oov_num=56725
logprob=-602993.466557261 ave_logprob=-2.61020313295844 ppl=407.570866725746
Again, the panthers model is actually the smallest in terms of input data, so you can see how it's a little less robust for that reason. Nevertheless, the similarities with the acctseller corpus and the quickseller corpus really stand out when comparing to values assigned to the dooglus, hillarious and tspacepilot corpora.
Lets summarize this in a table:
qs | accts | pan52 | doog | hilarious | tsp | |
qs | X | 105.3 | 118.1 | 361.2 | 263 | 382.1 |
accts | 151.6 | X | 171.4 | 396.1 | 276.6 | 389.9 |
pan52 | 203.5 | 211.4 | X | 366.1 | 313.1 | 407.6 |
doog | 274.6 | 272.1 | 267.9 | X | 260.3 | 349.5 |
hilarious | 238.3 | 222.9 | 255.1 | 333.5 | X | 341.2 |
tsp | 271. | 264.9 | 284.7 | 320.7 | 259.9 | X |
So, one thing I want to be clear on. Perplexity measures how well a model predicts a certain corpus. The first row shows us that the QS model predicts the acctseller and panthers52 corpora at approximately equally well, and far better than it predicts any of the others. Most of the other rows here are just providing prespective to you. You can see that the dooglus, hillarious and tsp models don't predict any of the other corpora very well (nothing anywhere below 250).
For completeness, here's the script I used to calculate perplexity:
#!/usr/bin/perl
#Build ngram LM for given count file
#
use strict;
use Try::Tiny;
#setting up the input file handles
$#ARGV != 5 and die "Usage: $0
In sum, we know that Quickseller is adept at checking the blockchain to reveal transactions signed by particular accounts and to link them. So it makes sense that he knows how to cover his tracks there and to use mixers and whatnot to make it difficult to detect his alts in that way. He is an expert in this, so while I haven't tried, I suspect it would be difficult to link any of his accounts on the blockchain. However, presumably, he's not an expert in forensic linguistics and statistical NLP so he didn't realize that providing a corpus of 552365 word tokens would actually give someone who wanted to detect his alts a reasonably reliable way to find the statistical fingerprint which is right there in the statistics of how he writes.
There's plenty of other circumstantial evidence that Panthers52 is an alt of Quickseller, but I'll leave that for others to talk about and discuss. Also, I'm not a trader here so I'm not really affected by QS giving escrow for himself, but perhaps others who are will have more to say about whether this practice is truly a scam. I opened this thread here because it seemed like scammy behavior to me, and I wanted others to be aware of it.
Here is a screenshot of QS feedback taken today:
Again, if anyone has any questions about this experiment or wants access to the particular data I ended up using, just let me know. I believe I've provided all the tools in this post in order to replicate these results for yourself, but if something's missing, let me know about it.