I think this is really interesting analysis that could help with the issue of sock puppets in the forums if it achieved legitimacy here.
It might. But it only becomes useful when you have a lot of text to start out with. In QS' case, he's written a whole lotta words .5 million words is not a small amount of text. And it's especially useful here because we have a known alt to compare the model's accuracy with. That is, it's quite interesting that the model predicts acctseller's text with the same metric as panthers52. I imagine that there are few cases where the person suspected of scamming has written as much as QS. But you're right, it could be useful.
I have a remedial question.
For the index numbers why is lower better? It seems like you are using one users posts' content to predict the others, right? Or amount of commonality? I tend to conceptualize this as a percent or ratio, how does that translate into the index numbers you've calculated?
I think the wikipedia article on perplexity is a reasonably good place to start
https://en.wikipedia.org/wiki/Perplexity. I'm not a very good teacher, and much better prose than I can produce has been written to explain the metric. I think the simple way to think of it is as the cross-entropy of the model<->test-corpus relationship.
Here's another web-page which talks about it and how it's used to predict text.
http://itl.nist.gov/iad/mig/publications/proceedings/darpa98/html/lm30/lm30.htmWhat do the number mean in relation to each other, are they a linear index?
It should be more or less linear.
Yamato has a point right? All your analysis does is say their posts are written in a similar manner. Can you do some statistical test to show how likely that is just random chance? e.g. a t-test?
You could certainly compute pearsons r on the matrix I provided.
I also have the intuition that a t-test could be relevant. But it's not clear to me at the moment how to set up the parameters. Perhaps someone smarter than me will pick this up and/or describe how the t-test would work in this particular case.
If this type of analysis does become a useful tool on the forum I can see lots of applications, for example in detecting account sales.
As I said above, I doubt it would become generally useful unless the people in question have a large number of posts. Having a lot of data to start from makes a better model. You couldn't use this on newb accounts.