Pages:
Author

Topic: "Multiple Accounts" / Copy-pasta detection scripts/bots - page 2. (Read 920 times)

legendary
Activity: 2814
Merit: 2472
https://JetCash.com
If it helps you guys to know about declared alts, here are mine.

Talk Merit
JetAid

hero member
Activity: 1582
Merit: 759
Once you get it running to some meaningful extent I would suggest to post the scope you're working on (set of users, threads) in iasenko's thread here:

https://bitcointalksearch.org/topic/clubthe-spambuster3-cases-openover-3000-accounts-reviewed1800-0908-4720640

So that we don't duplicate the effort.

I'm experimenting with some NLP techniques for plagiarism detection and the results are promising although scalability is a bit of an issue. Currently working just on comparing Bitcointalk posts (not to outside sources).

Perhaps it's better not to publicize too many specific details on how the scripts work - might inadvertently help bot-farmers. I wish there was a section of the forum designated for spam-busting efforts, I believe hilarious has suggested this.

Good point, I'll post within that thread once completed. I'm also hoping to hook it into BitcoinTalk & have it automatically update threads, but we'll see. Very much in the planning stage TBH

That's the thing, I've been debating closed source vs open source and the perks of both. What I might end up doing is creating a repo for these scripts, but keeping it private (I have an account on Github I can do this with), and then just inviting users who wish to contribute. Might just leave this to "if there's interest", but thanks for the flag on the potentials of abuse if open sourcing it. I didn't clue into that until now.

+1 for the forum section for spam busting, it'd be easier to keep lists of reported within.

If you're working on plagiarism detection already, I'll probably work on multiple account detection first. Granted, multiple bots running from different developers with different sets of algorithms probably isn't a bad idea (will make it harder for bots to avoid)
legendary
Activity: 3654
Merit: 8909
https://bpip.org
Once you get it running to some meaningful extent I would suggest to post the scope you're working on (set of users, threads) in iasenko's thread here:

https://bitcointalksearch.org/topic/clubthe-spambuster3-cases-openover-3000-accounts-reviewed1800-0908-4720640

So that we don't duplicate the effort.

I'm experimenting with some NLP techniques for plagiarism detection and the results are promising although scalability is a bit of an issue. Currently working just on comparing Bitcointalk posts (not to outside sources).

Perhaps it's better not to publicize too many specific details on how the scripts work - might inadvertently help bot-farmers. I wish there was a section of the forum designated for spam-busting efforts, I believe hilarious has suggested this.
hero member
Activity: 1582
Merit: 759
If your script will be catching multi-accounts that do not hesitate to write proof of authentication in the bounty threads one after one with the same error like at screenshot, it will be an excellent trap for them!



I probably wouldn't worry too much about misspellings of "address", considering for Ethereum addresses I would just look for strings starting with 0x (unless I'm wrong on this, I'm more familiar with Bitcoin) and then just gather the entire address until the next space.

Not to mention, not all people start off with "Ethereum Address:", some threads may require other formats, so it's better to go off the string itself.

Also, and this unrelated to the above quote. I did post the following response on Theymos' announcement the other day: https://bitcointalksearch.org/topic/m.45889515

If merit requirements are posted to above 1 merit, I'll probably introduce a feature into my script looking for random merit sending of whatever the amount may be. Unfortunately, with the merit requirement only being 1, it would be much more difficult to detect abuse of this from a programming perspective.
copper member
Activity: 350
Merit: 1
If your script will be catching multi-accounts that do not hesitate to write proof of authentication in the bounty threads one after one with the same error like at screenshot, it will be an excellent trap for them!




copper member
Activity: 168
Merit: 0
That's a nice idea. And you should run the script for copy-paste eth accounts on the registry forum on the bounty thread. They have a Google sheet with all the ETH addresses
hero member
Activity: 1582
Merit: 759
P.S: If any mods/admins aren't ok with me scraping the site, by all means let me know. I'd obviously write the bot/script in such a way that it doesn't slam the server & only send a certain amount of requests per second/minute (more or less like a Google bot). I know other users have written similar bots/scraping tools, so I thought it'd be ok. But if not, just let me know Smiley
I've recently started scraping recent. It saves the first unedited version of the post in raw HTML, excluding quotes. Your post for example looks like this:
Code:
Initscri
186520
45883661
Other / Meta / "Multiple Accounts" / Copy-pasta detection scripts/bots

Hey all,

I've been planning to write a few scripts relating to BitcoinTalk. It's been on my "developer bucket list" to write something to detect users who have multiple accounts. In order to accomplish this, and have a reliable list, I'd have to determine some logic in order to base this.

I have a few things in mind:

Index/scrape posts &:

For multiple account detection:

- Look for same address usage between posts (BTC, ETH, etc)
- Look for same account usage between posts (telegram, skype, etc)
- [other ideas here]

For copy-pasta detection:
- write a script to determine copy-pasta from accounts by matching the text of posts to similar text of other sites in order to return a probability percentage of the user copy/pasting (including src for manual analysis)
- [other ideas here]

Results would be posted here for mods to look at (if need be), or just to keep a record of such a connection. I'd also probably link to results in this topic

I wanted to post this thread in advance to see if anyone else had any other logic / ideas in mind for these scripts/bots? This will solely be when I have the time to create this (which won't be for a couple of weeks), so I thought I'd post this well in advance.

Thanks!
The first line is your Username, then userID, post number, some raw headers, and the last line is the post itself.
In compressed format, it takes about 10 MB per day. Instead of scraping the same data again, I could easily send it to you, and a few day's worth of data should be enough for you to start testing. If interested, let me know.

Not a bad idea. I'll take that into account. I probably won't be starting for a little while, but I'll send you a message in a little while if I need it.

I like the idea of scraping recent & just grabbing raw HTML to compare.

In order to minimize requests but allow multiple filtering scripts to parse the data separately, I'll probably end up scraping recent with 1 bot, caching that for a set time period (sort of like a mirror), and then using multiple other scripts to parse the data on the caching server/mirror.

What I might end up doing is creating the server that stores the cache & keeping it closed source. But I'll release the scripts that parse the data / determining abusers as open source. These scripts would connect back to the mirror server/site instead of BitcoinTalk. That way if others wish to volunteer by using some computational power to run those scripts, they can do so and it allows for others to contribute code without slamming BitcoinTalk with a massive amount of requests by testing.

I will closely follow this project. We've been waiting for such thing for a very long time. Most of the bots are using now word spinner to hide the copy-pasting, it's not easy to detect them but it's not impossible either.

Thanks! I'll try to keep this thread updated as much as I can.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
P.S: If any mods/admins aren't ok with me scraping the site, by all means let me know. I'd obviously write the bot/script in such a way that it doesn't slam the server & only send a certain amount of requests per second/minute (more or less like a Google bot). I know other users have written similar bots/scraping tools, so I thought it'd be ok. But if not, just let me know Smiley
I've recently started scraping recent. My script saves the first unedited version of the post in raw HTML, excluding quotes. Your post for example looks like this:
Code:
Initscri
186520
45883661
Other / Meta / "Multiple Accounts" / Copy-pasta detection scripts/bots

Hey all,

I've been planning to write a few scripts relating to BitcoinTalk. It's been on my "developer bucket list" to write something to detect users who have multiple accounts. In order to accomplish this, and have a reliable list, I'd have to determine some logic in order to base this.

I have a few things in mind:

Index/scrape posts &:

For multiple account detection:

- Look for same address usage between posts (BTC, ETH, etc)
- Look for same account usage between posts (telegram, skype, etc)
- [other ideas here]

For copy-pasta detection:
- write a script to determine copy-pasta from accounts by matching the text of posts to similar text of other sites in order to return a probability percentage of the user copy/pasting (including src for manual analysis)
- [other ideas here]

Results would be posted here for mods to look at (if need be), or just to keep a record of such a connection. I'd also probably link to results in this topic

I wanted to post this thread in advance to see if anyone else had any other logic / ideas in mind for these scripts/bots? This will solely be when I have the time to create this (which won't be for a couple of weeks), so I thought I'd post this well in advance.

Thanks!
The first line is your Username, then userID, post number, some raw headers, and the last line is the post itself.
In compressed format, it takes about 10 MB per day. Instead of scraping the same data again, I could easily send it to you, and a few day's worth of data should be enough for you to start testing. If interested, let me know.


You'll be in for a surprise if you start looking for plagiarism! I sometimes sort a day's worth of posts and search for exact duplicates. This typically gives a few dozen posts that are posted a few dozen times. Most of them are spam, many of them are just spammers posting the same useless "proof of authentication" and more crap like that.
Detecting the text spinners will be a whole different level!
legendary
Activity: 2240
Merit: 3150
₿uy / $ell ..oeleo ;(
I will closely follow this project. We've been waiting for such thing for a very long time. Most of the bots are using now word spinner to hide the copy-pasting, it's not easy to detect them but it's not impossible either.
hero member
Activity: 1582
Merit: 759
Good idea, i have been thinking about doing something like that myself but at the moment got busy with other things. Automated checks is the way to go for the spam problems, plagiarism and so on.

A trivial check i was experimenting with, is getting the hash of the messages posted, save it in a dictionary and see if the same hash comes up again.

Another simple check which could be done for monitoring activity on threads. Using the global average of posts per thread and calculating then the variance for a thread you should be able to spot spam-spree. The same can be applied to user posting.

There are other more complex techniques out there, but better start with something simple at start.


Not bad ideas. I like the idea of monitoring threads for abnormal posting frequencies/amount of posts. OFC these threads would have to be manually checked through (as there may be extenuating circumstances where a thread may require a higher post frequency).

TBH, If I do create this, I may just create a repo so others can contribute.
My only fear is that others will run the script (which is okay, unless many users run it. I don't want to add unnecessary load to BitcoinTalk servers unintentionally)
hero member
Activity: 784
Merit: 1416
Good idea, i have been thinking about doing something like that myself but at the moment got busy with other things. Automated checks is the way to go for the spam problems, plagiarism and so on.

A trivial check i was experimenting with, is getting the hash of the messages posted, save it in a dictionary and see if the same hash comes up again.

Another simple check which could be done for monitoring activity on threads. Using the global average of posts per thread and calculating then the variance for a thread you should be able to spot spam-spree. The same can be applied to user posting.

There are other more complex techniques out there, but better start with something simple at start.
hero member
Activity: 1582
Merit: 759
Hey all,

I've been planning to write a few scripts relating to BitcoinTalk. It's been on my "developer bucket list" to write something to detect users who have multiple accounts. In order to accomplish this, and have a reliable list, I'd have to determine some logic in order to base this.

Content within HR tags will be updated as the thread goes along.



I have a few things in mind (and I'll be updating this as the thread goes along - adding new ideas & such):
When and if topics are created for the data (which will either be by me, or others): I'll post the links here under the respective categories.

For account quality detection:
- Looking for # of words, paragraphs, sentences, etc.. gathering the average of each user in order to determine a account quality number. This number can be used in tandem to determine if a report is made on a account (with other scripts). Obviously this isn't enough to report by itself, but usernames w/ low quality could be sent into a spreadsheet of some sort for manual lookup.
- [your/others ideas here]

For multiple account detection:

- Look for same address usage between posts (BTC, ETH, etc)
- Look for same account usage between posts (telegram, skype, emails, etc)
- [your/others ideas here]

For copy-pasta detection:
- Write a script to determine copy-pasta from accounts by matching the text of posts to similar text of other sites in order to return a probability percentage of the user copy/pasting (including src for manual analysis). Users w/ percentage points above a certain number will be put into a list & potentially reported to threads/mods. IE: external plagiarism detection
- Write a script to determine copy-pasta by matching post content against other users post content. High similarities will raise red flags. IE: internal plagiarism detection *note: suchmoon mentioned that working on something similar, so other scripts may set precedence*
- Original script may want to ignore quote tags. However, if the case, depending on how built (if use full text, or word by word) another side-script would have to be built to prevent users from just wrapping their messages in quote tags.
- [your/others ideas here]

For trust abuse/merit abuse:
- Detecting trust abuse (users who send out a large amount of negative trusts, using the same text). This would obviously avoid trusted members (as some good campaign managers send out trusts w/ same text). This is mostly targeted towards members w/ no trust, or negative trust (ie: newer members, no trade history, etc). Results would be posted in a thread in a list format using tildas "~" so people can copy/paste the list of abusers into their trust lists. Allowing the ability for users to request they be removed from this list by public poll within thread (this should probably be handled manually)
- [your/others ideas here]

General ideas for all scripts
- Automatic posting to anti-spam threads w/ results (in such a way as to not create more spam though)
- Platform where users which have been reported by scripts can be documented, with automatic ban detection. That way scripts aren't looking into users if they have already been reported/banned.
- [your/others ideas here]



Results would be posted here for mods to look at (if need be), or just to keep a record of such a connection. I'd also probably link to results in this topic and maybe load it up on a website of mine.

I wanted to post this thread in advance to see if anyone else had any other logic / ideas in mind for these scripts/bots? This will solely be when I have the time to create this (which won't be for a couple of weeks), so I thought I'd post this well in advance. I'll update the above list with approved suggestions that I plan to work on.

Thanks!

P.S: If any mods/admins aren't ok with me scraping the site, by all means let me know. I'd obviously write the bot/script in such a way that it doesn't slam the server & only send a certain amount of requests per second/minute (more or less like a Google bot). I know other users have written similar bots/scraping tools, so I thought it'd be ok. But if not, just let me know Smiley



Change log:

Code:
Edit (September 19th, 2018): I'll be updating this thread (see under bolds) with new ideas as this thread progresses. Also, if anyone else wishes to contribute to my scripts (or even build their own one-offs targeting the ideas above), just let me know that you're working on it, and I'll mark it in the thread. While I agree different scripts/algorithms would be harder to avoid/abuse, obviously I'd want all of the scripts to developed in a timely manner, so duplicating work probably isn't a good idea as of this moment.
Edit (September 20th, 2018): Adding trust/merit abuse columns - automatic detection of users abusing trust/merit system system.
Pages:
Jump to: