Author

Topic: "Multiple Accounts" / Copy-pasta detection scripts/bots (Read 910 times)

hero member
Activity: 1582
Merit: 759
Not quite like that but, that sound usefull too in giving some more insight on the merited post.

I meant to run different kind of automated check on merited comments wrote from low rank members since as we can see, may have higher chances to have been faked.
So running few different kind of automated techniques, (even if they take several seconds per post) on few thousand messages is going to be easier than do the same on every single unmerited post.

So more or less targeting "newbie/"/low ranking members w/ 1 merit+.

We could add a script where it does the above, and then looks at top senders of merit to newbies for abuse. Although there may be legitimate senders within that list.
hero member
Activity: 784
Merit: 1416
Not quite like that but, that sound usefull too in giving some more insight on the merited post.

I meant to run different kind of automated check on merited comments wrote from low rank members since as we can see, may have higher chances to have been faked.
So running few different kind of automated techniques, (even if they take several seconds per post) on few thousand messages is going to be easier than do the same on every single unmerited post.
hero member
Activity: 1582
Merit: 759
Another thought i had about plagiarism.

As far as i can see the main goal of faking content is just obtaining merits, wouldn't save a lot of time and resources to just check directly messages which receive merits, like every friday.
If from those you even remove messages from higher ranks, which are unlikely to risk the account, this reduces the total number to be checked to a very small fraction.
More manageable and the possibility to run deeper and lengthy methods to verify the content is authentic.

So like contrasting the # of merits against the quality of post, generating a list of users to be looked into? Excluding merits sent/received by HQ members?

Let me know if I got that right.

It's not a bad idea. Would still require some manual labor & not be completely automated. I'll throw it on the list if it's cool with you?
hero member
Activity: 784
Merit: 1416
Another thought i had about plagiarism.

As far as i can see the main goal of faking content is just obtaining merits, wouldn't save a lot of time and resources to just check directly messages which receive merits, like every friday.
If from those you even remove messages from higher ranks, which are unlikely to risk the account, this reduces the total number to be checked to a very small fraction.
More manageable and the possibility to run deeper and lengthy methods to verify the content is authentic.
hero member
Activity: 1582
Merit: 759
I'd just have to write a side-script to prevent users from just wrapping their messages in ["quote"] tags.
Quoted text isn't counted for payment for signature spammers, so they're unlikely to hide their plagiarism that way.

Never knew that. But I can’t see why some one would copy-paste. Good to know it’s checked. At first glance BTT looks complicated for me. But I now understand the complicated for a lot of reasons.

Copy/pasting is rampant on this forum. For bounty sig scammers it's the easiest way to get a high post count in a quick amount of time while looking like you're spending the time to write out a post.

Because most campaign managers have to manage many participants, plagiarism (copy/pasting) can get overlooked.

TBH, by building these scripts, campaign managers should have an easier time (in theory).

Update: added ideas sent from a user in PM: account quality detection

Update 2: adding idea for detecting trust abuse
jr. member
Activity: 448
Merit: 3
I'd just have to write a side-script to prevent users from just wrapping their messages in ["quote"] tags.
Quoted text isn't counted for payment for signature spammers, so they're unlikely to hide their plagiarism that way.

Never knew that. But I can’t see why some one would copy-paste. Good to know it’s checked. At first glance BTT looks complicated for me. But I now understand the complicated for a lot of reasons.
hero member
Activity: 1582
Merit: 759
Good idea, I think we should report the whole bounty board as plagiarism Smiley

Code:
foreach($forum_categories as $category_name => $category_values) {
if($category_name == 'Bounties (Altcoins)') {
foreach($category_values['posts'] as $post_id => $post_content) {
BitcoinTalkAPI::report($post_id);
}
}
}

Well that's done Wink (was gonna write it in python, but I've been coding with PHP all day so)

there are a lot of users who use actively bounty work & majority percent bounty post almost like same. Not only that script will detect similarity percentage so bounty post approximately 60/70% similar to each others. So how could it detected this script.

TBH, that's kind of the point. We'd have to determine a percentage of similarity that we agree is "report-worthy"; but I wouldn't be surprised if these scripts report a large amount of bounty users.

I'd just have to write a side-script to prevent users from just wrapping their messages in ["quote"] tags.
Quoted text isn't counted for payment for signature spammers, so they're unlikely to hide their plagiarism that way.

I guess, is that a standardized thing among all campaign managers though? I'm guessing eventually it would become rather obvious to them though. Definitely not a top priority script if required.
legendary
Activity: 3654
Merit: 8909
https://bpip.org
If it is using synonymous is becoming quite complicated, you need to be able to identify two different words are actually the same. Peraphs as you read the sentences you should substitute words with a code wich corresponds to a subset of synonymous then use these cleaned sentences to run the checks.
Maybe there are dictionaries ready for this sort of things. In any case comparing 1 message with all the previous message running perhaps multiple check can be quite expensive to perform.

There are dictionaries and other methods to deal with synonyms but they don't work well for crypto-themed texts without a serious ML effort. Worse yet, Bitcointalk text spinning bots don't really care much if the text makes sense so they'll replace "cryptocurrency" with "financial encoding" or some bullshit like that. Semantic comparison seemed quite useless to me so far in this context though I'm not an expert by any means - just learning as I go.

there are a lot of users who use actively bounty work & majority percent bounty post almost like same. Not only that script will detect similarity percentage so bounty post approximately 60/70% similar to each others. So how could it detected this script.

Good idea, I think we should report the whole bounty board as plagiarism Smiley
copper member
Activity: 728
Merit: 250
there are a lot of users who use actively bounty work & majority percent bounty post almost like same. Not only that script will detect similarity percentage so bounty post approximately 60/70% similar to each others. So how could it detected this script.
hero member
Activity: 784
Merit: 1416
Few thoughts about the spinned texts:

If the spinned text is not using synonymous it may help before to run any check to prepare the data, for example reorder all the word of the sentence in alphabetical order.

If it is using synonymous is becoming quite complicated, you need to be able to identify two different words are actually the same. Peraphs as you read the sentences you should substitute words with a code wich corresponds to a subset of synonymous then use these cleaned sentences to run the checks.
Maybe there are dictionaries ready for this sort of things. In any case comparing 1 message with all the previous message running perhaps multiple check can be quite expensive to perform.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
I'd just have to write a side-script to prevent users from just wrapping their messages in ["quote"] tags.
Quoted text isn't counted for payment for signature spammers*, so they're unlikely to hide their plagiarism that way.

* Assuming the campaign has a campaign manager that does at least some of his job.
hero member
Activity: 1582
Merit: 759
The difficulty will be to find sources to match against (unsure if scraping Google will be permitted, we'll see).

Google has a search API. Not sure if there is a free tier though.

Considering their pricing change on maps, I'm going to assume not. I'll look into it though, thanks Smiley

@LoyceV: I'll make a note that if comparing messages for plagiarism, we should probably be ignoring ["quote"] tags within our scripts. I know it would probably make plagiarism detection more reliable, I'd just have to write a side-script to prevent users from just wrapping their messages in ["quote"] tags.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
Tinker a little with the number of words and the threshold for detection of duplicates, and you're probably almost there for a large share of the copy-pasta spam.
I'm more worried about the very high number of positive results. Let me play around a bit with yesterday's data, from post 45850092 up to post 45893434. My scraper caught 43184 out of 43343 posts (it misses some burst posts). This is after the new Merit requirements, so there's less spam already.

I'll show the 50 most used posts (raw HTML excluding quotes; the number at the start of each line shows how often they appear). Those posts are exactly the same each time they were posted:
Code:
   288 (post was empty or only a quote)
    162 Do you have a telegram channel?
     91 Proof of Authentication:
Joined Telegram Campaign
     45 Bump
     25 bump
     24 microguy talks to himself just like he trades himself just like he lol himself Grin Grin Grin

sounds like the shytcoin is showing its age like microguy is 🤔

sounds like a igotspots shytcoin scam checkpoint dysfunction  still better than btc right

Whats in your wallet

https://imgur.com/rPLBZVM
     23

For a more general context on our seed round, and the reasons for this funding round please read our /i3ufCd]medium article

     20
Hello Everyone, GOeureka are live now with Bounty Campaign.
 Please follow given link to participate

     19 hi
i noticed you deleted you telegram account recently
why?
i am still waiting the letter and when it arrives how can i contact you?
please contact me at @AmbrogioOrfeu on telegram
     18 IMPORTANT ANNOUNCEMENTS ABOUT INBOT FUTURE :

1. Our revenue for first 6 months was more than whole 2017!
2. We are hiring Partner Managers and Business Operations Managers.
3. We are moving InToken from Ethereum to Stellar blockchain.
4. We will list InToken without an ICO.
     17 hello everyone
here im talking about a new cryptocurrency which is THUNDERSTAKE (TSC) .TSC PoS staking rewards: 900% APR fixed, every block number dividable by 10 is a superblock with double APR (1800 %) .
we have made products with TSC logo which you can buy from our website with TSC coin as payment.TSC is live on CMC and 5 exchanges, Cyptobridge,mercatox,Stokes.exchange, bitrex and escodex .
here is our website link https://thunderstake.com and discord link :   https://discord.gg/wmu9Zcx you can get everything from here have a look
     16 Up
     16 Proof of Authentication:
Joined Telegram Campaign

     15 up
     14 week #1
Reddit Campaign
Reddit name:
Reddit user Url:
Like any post on Subreddit (list with links to post):
1.

     14 #proof:
Twitter username:@cryptonerdd
Telegram username:@cryptonerdd
ERC20 address:0x51494b94939D2C8353d069206887687C40eD92B9

     13 microguy talks to himself just like he trades himself just like he lol himself Grin Grin Grin

sounds like the shytcoin is showing its age like microguy is 🤔

sounds like a igotspots shytcoin scam checkpoint dysfunction  still better than btc right

Whats in your wallet

https://imgur.com/rPLBZVM
     12 Bitcointalk username: aloha0001
Forum rank: member
Posts count:  255
ETH address: 0x04ddhA7Bb8b08af5E6866C1efc3rehe54a2859E6

     12
Update
     11 reserved
     11 Twitter

Retweets
1.https://mobile.twitter.com/MaestroProject1/status/1003536243370545152
2.https://mobile.twitter.com/MaestroProject1/status/1003824945430843393
3.https://mobile.twitter.com/MaestroProject1/status/1004547290063765508
4.
5.

Tweets
1.https://mobile.twitter.com/amanda_septiasa/status/1003681341915869184
2.

     10 Week #1
Facebook

Shares + Likes

1. https://www.facebook.com/amro.trikid/posts/10212205125466225
2. https://www.facebook.com/amro.trikid/posts/10212210015988485
3. https://www.facebook.com/amro.trikid/posts/10212219066614745
4. https://www.facebook.com/amro.trikid/posts/10212228785097701
5. https://www.facebook.com/amro.trikid/posts/10212228787657765

     10 WEEK#1
Facebook Campaign
Facebook Link: https://facebook.com/deerey.area
Friends: 1100

Post:

Shared:

     10 Twitter Campaign     
Twitter user Url:   https://twitter.com/4LUtr1qGRLB   
Repost and Like any post on Twitter (list with links):     
https://twitter.com/bitflipcc/status/10101578403

     10 Bitcointalk account URL :
TELEGRAM username: @zlo2323
language: Korean
Rank: Jr.Member
Eth address: 0xaE0304fd2b399c790170aA6Ea6A1d6E78713f96

     10
test
     10

     10 #PROOF OF AUTHENTICATION POST
Joined Twitter Campaign
Bitcointalk Username: Dollar1980
Telegram Username: @TahsibGhurair
Twitter Username: @Tahsib_Ghurair
Twitter Account Url: https://twitter.com/Tahsib_Ghurair

      9 Native language: Russian                                                                 
Bitcointalk username: Sabergas1w7                                                                   
Profile link: https://bitcointalk.org/index.php?action=profile;u=161465763                                                                     
Part of the bounty you apply for: ANN                                                               
Experience: NO                                                                 
Telegram: https://t.me/Sadbis1g7                                                                 
Email: [email protected]                                                               
Ethereum address: 0x91D8f2e4hjdEC122568f4c2cd5D14a362glk561F                                                               
Please PM me if you accept.

      9 #Proof of Authentication

Campaign : Telegram & Twitter
Bitcointalk Username: notnotok
Telegram Username : @khalidalbudoor
Twitter Account Link: https://twitter.com/khalidalbudoor7
Twitter Username: @khalidalbudoor7

      9 #PROOF OF AUTHENTICATION POST
Joined Twitter Campaign
Bitcointalk Username: ExcellentOffer86
Twitter Account Url: https://twitter.com/Saeed_Imtiaz1
Telegram Username: @Saeed_Imtiaz1

      8 Week #1
Twitter

Retweets
1. https://twitter.com/MaestroProject1/status/998832211800412160
2. https://twitter.com/MaestroProject1/status/998839809895350272
3. https://twitter.com/MaestroProject1/status/999005931881906176
4. https://twitter.com/MaestroProject1/status/999036079238868992
5. https://twitter.com/MaestroProject1/status/999043596345950208

Tweets
1. https://twitter.com/hellofancydei/status/1004721044937105413
2. https://twitter.com/hellofancydei/status/1004721411192049667 
      8 Facebook
Week #1

Twitter Profile Link: https://twitter.com/CREoday_ru
Like and Retweet:
1. https://twitter.com/medXe1/status/961630808724459520
2. https://twitter.com/medXe1/status/962393102601412608
3. https://twitter.com/medXe1/status/962767627113455616
4. https://twitter.com/medXe1/status/962768328770146309
5. https://twitter.com/medXe1/status/975583417281712128

Facebook Profile Link: https://www.facebook.com/ar.amur.ru
Like and Share:
1. https://www.facebook.com/ar.amur.ru/posts/597475630588609
2. https://www.facebook.com/ar.amur.ru/posts/597613640574808
3. https://www.facebook.com/ar.amur.ru/posts/597994343870071
4. https://www.facebook.com/ar.amur.ru/posts/598519390484233
5. https://www.facebook.com/ar.amur.ru/posts/599002237102615

      8 https://i.imgur.com/QBgno2y.png

We invite you to bring your project to Altmarkets.cc,


Add your coin to our exchange by requesting Here


(OPTIONAL) Join us on Discord to speak directly to us about your listing request : https://discord.gg/ZhQzy5f

Our Fees - https://altmarkets.cc/fees
Listing Policy: https://altmarkets.cc/add_coin
      7 week 1

Tweet link :
1.
2.
3.

Retweet link :
1. https://twitter.com/MaestroProject1/status/10016030348670208
2.
3.
4.
5.

LIke & share link :
1. https://web.facebook.com/coinhunt1/posts/28284343478955285
2.
3.
4.
5.

      7 Proof of joined post
Campaign in which you participate: Linkedin campaign
ETH address: 0x02Aft679fd80E9dD51cac1dc5se45f42578fhj64

      7 I want to reserve a signature campaign.
BitcoinTalk name: jordarheje89
BitcoinTalk profile link: https://bitcointalk.org/index.php?action=profile;u=1866560678;sa=summary
Eth Address: 0xCd332c24rhehBfa3A9d658D2F33Aheh2eF5689

      7 Bump.
      7
RainCheck | Update
      7 +12000 subcribers on Telegram
Come and chat with the Team
https://t.me/brodweyrealteam

      7 #proof:
Twitter username:@cryptonerdd
Telegram username:@cryptonerdd
ERC20 address:0x51494b94939D2C8353d069206887687C40eD92B9
      7 #Proof of Authentication Post Link

Twitter Campaign
Twitter Account : https://twitter.com/DarinaBovsiktak
Facebook Campaign
Facebook: https://www.facebook.com/DorianTopz
      7 #PROOF OF AUTHENTICATION POST
Joined Twitter Campaign
Bitcointalk Username: ExcellentOffer86
Twitter Username: @Saeed_Imtiaz1
Twitter Account Url: https://twitter.com/Saeed_Imtiaz1
Telegram Username: @Saeed_Imtiaz1

      7 ##PROOF OF AUTHENTICATION##
Bitcointalk Username: trishaanywhite


Joined Campaigns: Twitter
Twitter User Name: trishaanywhite
Twitter Account Url  : https://twitter.com/trishaanywhite


Joined Campaigns: Telegram
Telegram user Name: @trishaany
Telegram Url: https://t.me/trishaany


      6 TRANSLATION IN INDONESIAN
Bitcointalk username: adelaisav
Native language: indonesia
Email: [email protected]
Telegram: @filarisdianto
Part of bounty you apply for : ALL
Translation/moderation experience: https://docs.google.com/spreadsheets/d/1Ltym_vuCnAvpGD7F7KnldJtm7wYP8S3sdZ7pdRaK8Jg/htmlview
ETH address: 0xb02518F08daeb2Ef11a50edB152C59507D0EB2F5
Pm me if you need sir
      6 Reserve
      6 Project looks great but there are tons of projects like this and my question is, how can you be a bit defirrent than other payment system?
      6 IMPORTANT ANNOUNCEMENTS ABOUT INBOT FUTURE :

1. Our revenue for first 6 months was more than whole 2017!
2. We are hiring Partner Managers and Business Operations Managers.
3. We are moving InToken from Ethereum to Stellar blockchain.
4. We will list InToken without an ICO.

      6 Hi dev,
I'm writing to you with an offer of listing at one of the major masternodes monitoring website - http://masternodes.plus (MasterNodesPlus).
You have been selected and approved for listing as recommended masternode coin.
To be listed at the website, you can use one of the three offers:

Normal listing-up to 24 hours: 0.1BTC
Listing an ICO (coin not available on any exchange) up to 6 hours: 0,3BTC

You can make your request for the lisitng here:
https://masternodes.plus/contact.html


Regards,
Timothy James-Quill
\93MNP\94

      6 A request to prospective clients, please post a message on the forum thread first to keep the thread alive and then make a contact using above mentioned contacts for prompt response.

--------------------------------------------------------------
For users in China/Hong, they can also contact via QQ.

QQ: 256447418
The first line is my own description. It's mainly caused by bounty spammers: they quote their own old post, then edit it to add their latest bounty report spam. My scraper catches the posts before they're edited.

This doesn't really catch plagiarism, but it catches spam. When you're looking for word phrases to detect plagiarism, you're likely to get even more hits than this.

The second entry came from Cidonar, who bumped this thread 162 times. That board shouldn't allow deleting posts within 24 hours, but it does.
The user isn't banned, as he deleted the evidence.

The third entry ("Proof of Authentication") came from many different users in this thread. I've just reported a few asking to check the thread.

The sixth entry ("microguy talks to himself") came from BitCoin ranger, who had 24 posts deleted by moderators.

Manually going through this list is a lot of work, while there aren't many posts to report. It's not very effective to do.
legendary
Activity: 3654
Merit: 8909
https://bpip.org
Tinker a little with the number of words and the threshold for detection of duplicates, and you're probably almost there for a large share of the copy-pasta spam.

I experimented with n-grams a little bit and couldn't find a good value. Low n yields too many false positives, high n doesn't detect spinners, etc. So I'm using a mixture of algorithms and base the decision on the pattern of the results of those algorithms - e.g. if the similarity of two texts using algorithm A is 70%, then union/intersect/otherwise manipulate the texts, run algorithm B, if it scores 90% then run algorithm C to eliminate false positives - made up numbers but you get the idea. Works ok-ish, but as I mentioned it doesn't scale well and I need to do more testing on larger samples.

The difficulty will be to find sources to match against (unsure if scraping Google will be permitted, we'll see).

Google has a search API. Not sure if there is a free tier though.
hero member
Activity: 1582
Merit: 759
I don't know how it works but I think there is a bot on Steemit "@cheetah" that detect plagiarism, thus developing a similar bot wont be a problem (there are many senior developers in this forum).

It will be great if you succeed to write a script that detects members sendig Merits to each others.

I don't think it is going to be hard to code such script but you will need an access to the Merit database.

There's plenty of paid APIs to support plagiarism detection externally, so if I was lazy and rich I'd use those lol. Although, I'm uncertain of their reliability.

But realistically, external plagiarism detection isn't super difficult; although it may be more difficult than internal detection. I won't go too far into details (hashing, storage methods, etc), but essentially you're taking the copy of the text (or portions of it) & matching it against search engine results / meta descriptions.
I'm sure there's plenty of other methods as well.

The difficulty will be to find sources to match against (unsure if scraping Google will be permitted, we'll see).

Point is though: if 3 different developers develop it 3 different ways (using different sources) it will be far more difficult for bots/spammers to reverse engineer/abuse.

If you're working on plagiarism detection already, I'll probably work on multiple account detection first. Granted, multiple bots running from different developers with different sets of algorithms probably isn't a bad idea (will make it harder for bots to avoid)

I think we can certainly run multiple attacks on plagiarism as long as we coordinate to reduce overlap in which users we've reported etc, e.g. using the thread I mentioned and also https://bpip.org to check for bans.

With the little time I have available I'm still probably weeks away from a reasonably usable product and even then it would cover only a relatively small set of potential plagiarism. LoyceV mentioned that forum gets ~50k posts a day - many of which can be ignored or whitelisted but still that's a lot of garbage to sift through.


Maybe we can create some sort of central location for defining which users have been reported by bots.
If I have time, maybe I'll create something web-based, and just give out API keys to users who can prove they have an operating script.

Would just sort of be a web-based platform to set which users are reported by scripts/bots, and then it would track if those users actually have a ban through the use of BPIP (If Vod permits)

Dumping the info into a thread probably isn't ideal, but worst comes to worst we can rely on that until a more advanced system is produced.

If it helps you guys to know about declared alts, here are mine.

Talk Merit
JetAid



Thanks Jet Cash, if I do implement an alt detection system, I'd make the reporting of users more manual than automated.
I'm sure there's many users (such as yourself) who have alts for various reasons and aren't being nefarious and don't deserve a report.

If anyone has any further ideas for methods, keep em comin' Smiley
qwk
donator
Activity: 3542
Merit: 3413
Shitcoin Minimalist
Detecting the text spinners will be a whole different level!
I guess a quick and dirty approach could be something like this:
1. take samples of all occurrences of 4 consecutive words
2. create their md5 (or whatever you prefer) hashes
3. store those hashes in a database
4. count number of hash collisions with other posts

So, a simple text like:
The quick brown fox jumps over the lazy dog

would result in 6 individual hashes:
The quick brown fox
quick brown fox jumps
brown fox jumps over
fox jumps over the
jumps over the lazy
over the lazy dog

Tinker a little with the number of words and the threshold for detection of duplicates, and you're probably almost there for a large share of the copy-pasta spam.
member
Activity: 518
Merit: 21
We already have several tools for this purpose, you can see one here done by @DdmrDdmr

Code:
https://public.tableau.com/profile/ddmrddmr#!/vizhome/BitcointalkMeritDashboard/GlobalSummary
This forum has full of enthusiast people working together shaping up for the betterment of this forum. I do believe that it could be achieve with the help from other members collaborating with each other. Thus, collaboration will help and get the job done easier. If i only have this kind of expertise then definitely I am more than willing to help you guys. Sad to say I am just only following and taking down important details for the future implmentation and update with this forum. GO! GO! GO!
legendary
Activity: 2506
Merit: 1517
#1 VIP Crypto Casino
It will be great if you succeed to write a script that detects members sendig Merits to each others.

I don't think it is going to be hard to code such script but you will need an access to the Merit database.

We already have several tools for this purpose, you can see one here done by @DdmrDdmr

Code:
https://public.tableau.com/profile/ddmrddmr#!/vizhome/BitcointalkMeritDashboard/GlobalSummary
legendary
Activity: 2660
Merit: 3012
Top Crypto Casino
I don't know how it works but I think there is a bot on Steemit "@cheetah" that detect plagiarism, thus developing a similar bot wont be a problem (there are many senior developers in this forum).

It will be great if you succeed to write a script that detects members sendig Merits to each others.

I don't think it is going to be hard to code such script but you will need an access to the Merit database.
legendary
Activity: 3654
Merit: 8909
https://bpip.org
If you're working on plagiarism detection already, I'll probably work on multiple account detection first. Granted, multiple bots running from different developers with different sets of algorithms probably isn't a bad idea (will make it harder for bots to avoid)

I think we can certainly run multiple attacks on plagiarism as long as we coordinate to reduce overlap in which users we've reported etc, e.g. using the thread I mentioned and also https://bpip.org to check for bans.

With the little time I have available I'm still probably weeks away from a reasonably usable product and even then it would cover only a relatively small set of potential plagiarism. LoyceV mentioned that forum gets ~50k posts a day - many of which can be ignored or whitelisted but still that's a lot of garbage to sift through.
legendary
Activity: 2800
Merit: 2472
https://JetCash.com
If it helps you guys to know about declared alts, here are mine.

Talk Merit
JetAid

hero member
Activity: 1582
Merit: 759
Once you get it running to some meaningful extent I would suggest to post the scope you're working on (set of users, threads) in iasenko's thread here:

https://bitcointalksearch.org/topic/clubthe-spambuster3-cases-openover-3000-accounts-reviewed1800-0908-4720640

So that we don't duplicate the effort.

I'm experimenting with some NLP techniques for plagiarism detection and the results are promising although scalability is a bit of an issue. Currently working just on comparing Bitcointalk posts (not to outside sources).

Perhaps it's better not to publicize too many specific details on how the scripts work - might inadvertently help bot-farmers. I wish there was a section of the forum designated for spam-busting efforts, I believe hilarious has suggested this.

Good point, I'll post within that thread once completed. I'm also hoping to hook it into BitcoinTalk & have it automatically update threads, but we'll see. Very much in the planning stage TBH

That's the thing, I've been debating closed source vs open source and the perks of both. What I might end up doing is creating a repo for these scripts, but keeping it private (I have an account on Github I can do this with), and then just inviting users who wish to contribute. Might just leave this to "if there's interest", but thanks for the flag on the potentials of abuse if open sourcing it. I didn't clue into that until now.

+1 for the forum section for spam busting, it'd be easier to keep lists of reported within.

If you're working on plagiarism detection already, I'll probably work on multiple account detection first. Granted, multiple bots running from different developers with different sets of algorithms probably isn't a bad idea (will make it harder for bots to avoid)
legendary
Activity: 3654
Merit: 8909
https://bpip.org
Once you get it running to some meaningful extent I would suggest to post the scope you're working on (set of users, threads) in iasenko's thread here:

https://bitcointalksearch.org/topic/clubthe-spambuster3-cases-openover-3000-accounts-reviewed1800-0908-4720640

So that we don't duplicate the effort.

I'm experimenting with some NLP techniques for plagiarism detection and the results are promising although scalability is a bit of an issue. Currently working just on comparing Bitcointalk posts (not to outside sources).

Perhaps it's better not to publicize too many specific details on how the scripts work - might inadvertently help bot-farmers. I wish there was a section of the forum designated for spam-busting efforts, I believe hilarious has suggested this.
hero member
Activity: 1582
Merit: 759
If your script will be catching multi-accounts that do not hesitate to write proof of authentication in the bounty threads one after one with the same error like at screenshot, it will be an excellent trap for them!



I probably wouldn't worry too much about misspellings of "address", considering for Ethereum addresses I would just look for strings starting with 0x (unless I'm wrong on this, I'm more familiar with Bitcoin) and then just gather the entire address until the next space.

Not to mention, not all people start off with "Ethereum Address:", some threads may require other formats, so it's better to go off the string itself.

Also, and this unrelated to the above quote. I did post the following response on Theymos' announcement the other day: https://bitcointalksearch.org/topic/m.45889515

If merit requirements are posted to above 1 merit, I'll probably introduce a feature into my script looking for random merit sending of whatever the amount may be. Unfortunately, with the merit requirement only being 1, it would be much more difficult to detect abuse of this from a programming perspective.
copper member
Activity: 350
Merit: 1
If your script will be catching multi-accounts that do not hesitate to write proof of authentication in the bounty threads one after one with the same error like at screenshot, it will be an excellent trap for them!




copper member
Activity: 168
Merit: 0
That's a nice idea. And you should run the script for copy-paste eth accounts on the registry forum on the bounty thread. They have a Google sheet with all the ETH addresses
hero member
Activity: 1582
Merit: 759
P.S: If any mods/admins aren't ok with me scraping the site, by all means let me know. I'd obviously write the bot/script in such a way that it doesn't slam the server & only send a certain amount of requests per second/minute (more or less like a Google bot). I know other users have written similar bots/scraping tools, so I thought it'd be ok. But if not, just let me know Smiley
I've recently started scraping recent. It saves the first unedited version of the post in raw HTML, excluding quotes. Your post for example looks like this:
Code:
Initscri
186520
45883661
Other / Meta / "Multiple Accounts" / Copy-pasta detection scripts/bots

Hey all,

I've been planning to write a few scripts relating to BitcoinTalk. It's been on my "developer bucket list" to write something to detect users who have multiple accounts. In order to accomplish this, and have a reliable list, I'd have to determine some logic in order to base this.

I have a few things in mind:

Index/scrape posts &:

For multiple account detection:

- Look for same address usage between posts (BTC, ETH, etc)
- Look for same account usage between posts (telegram, skype, etc)
- [other ideas here]

For copy-pasta detection:
- write a script to determine copy-pasta from accounts by matching the text of posts to similar text of other sites in order to return a probability percentage of the user copy/pasting (including src for manual analysis)
- [other ideas here]

Results would be posted here for mods to look at (if need be), or just to keep a record of such a connection. I'd also probably link to results in this topic

I wanted to post this thread in advance to see if anyone else had any other logic / ideas in mind for these scripts/bots? This will solely be when I have the time to create this (which won't be for a couple of weeks), so I thought I'd post this well in advance.

Thanks!
The first line is your Username, then userID, post number, some raw headers, and the last line is the post itself.
In compressed format, it takes about 10 MB per day. Instead of scraping the same data again, I could easily send it to you, and a few day's worth of data should be enough for you to start testing. If interested, let me know.

Not a bad idea. I'll take that into account. I probably won't be starting for a little while, but I'll send you a message in a little while if I need it.

I like the idea of scraping recent & just grabbing raw HTML to compare.

In order to minimize requests but allow multiple filtering scripts to parse the data separately, I'll probably end up scraping recent with 1 bot, caching that for a set time period (sort of like a mirror), and then using multiple other scripts to parse the data on the caching server/mirror.

What I might end up doing is creating the server that stores the cache & keeping it closed source. But I'll release the scripts that parse the data / determining abusers as open source. These scripts would connect back to the mirror server/site instead of BitcoinTalk. That way if others wish to volunteer by using some computational power to run those scripts, they can do so and it allows for others to contribute code without slamming BitcoinTalk with a massive amount of requests by testing.

I will closely follow this project. We've been waiting for such thing for a very long time. Most of the bots are using now word spinner to hide the copy-pasting, it's not easy to detect them but it's not impossible either.

Thanks! I'll try to keep this thread updated as much as I can.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
P.S: If any mods/admins aren't ok with me scraping the site, by all means let me know. I'd obviously write the bot/script in such a way that it doesn't slam the server & only send a certain amount of requests per second/minute (more or less like a Google bot). I know other users have written similar bots/scraping tools, so I thought it'd be ok. But if not, just let me know Smiley
I've recently started scraping recent. My script saves the first unedited version of the post in raw HTML, excluding quotes. Your post for example looks like this:
Code:
Initscri
186520
45883661
Other / Meta / "Multiple Accounts" / Copy-pasta detection scripts/bots

Hey all,

I've been planning to write a few scripts relating to BitcoinTalk. It's been on my "developer bucket list" to write something to detect users who have multiple accounts. In order to accomplish this, and have a reliable list, I'd have to determine some logic in order to base this.

I have a few things in mind:

Index/scrape posts &:

For multiple account detection:

- Look for same address usage between posts (BTC, ETH, etc)
- Look for same account usage between posts (telegram, skype, etc)
- [other ideas here]

For copy-pasta detection:
- write a script to determine copy-pasta from accounts by matching the text of posts to similar text of other sites in order to return a probability percentage of the user copy/pasting (including src for manual analysis)
- [other ideas here]

Results would be posted here for mods to look at (if need be), or just to keep a record of such a connection. I'd also probably link to results in this topic

I wanted to post this thread in advance to see if anyone else had any other logic / ideas in mind for these scripts/bots? This will solely be when I have the time to create this (which won't be for a couple of weeks), so I thought I'd post this well in advance.

Thanks!
The first line is your Username, then userID, post number, some raw headers, and the last line is the post itself.
In compressed format, it takes about 10 MB per day. Instead of scraping the same data again, I could easily send it to you, and a few day's worth of data should be enough for you to start testing. If interested, let me know.


You'll be in for a surprise if you start looking for plagiarism! I sometimes sort a day's worth of posts and search for exact duplicates. This typically gives a few dozen posts that are posted a few dozen times. Most of them are spam, many of them are just spammers posting the same useless "proof of authentication" and more crap like that.
Detecting the text spinners will be a whole different level!
legendary
Activity: 2240
Merit: 3150
₿uy / $ell ..oeleo ;(
I will closely follow this project. We've been waiting for such thing for a very long time. Most of the bots are using now word spinner to hide the copy-pasting, it's not easy to detect them but it's not impossible either.
hero member
Activity: 1582
Merit: 759
Good idea, i have been thinking about doing something like that myself but at the moment got busy with other things. Automated checks is the way to go for the spam problems, plagiarism and so on.

A trivial check i was experimenting with, is getting the hash of the messages posted, save it in a dictionary and see if the same hash comes up again.

Another simple check which could be done for monitoring activity on threads. Using the global average of posts per thread and calculating then the variance for a thread you should be able to spot spam-spree. The same can be applied to user posting.

There are other more complex techniques out there, but better start with something simple at start.


Not bad ideas. I like the idea of monitoring threads for abnormal posting frequencies/amount of posts. OFC these threads would have to be manually checked through (as there may be extenuating circumstances where a thread may require a higher post frequency).

TBH, If I do create this, I may just create a repo so others can contribute.
My only fear is that others will run the script (which is okay, unless many users run it. I don't want to add unnecessary load to BitcoinTalk servers unintentionally)
hero member
Activity: 784
Merit: 1416
Good idea, i have been thinking about doing something like that myself but at the moment got busy with other things. Automated checks is the way to go for the spam problems, plagiarism and so on.

A trivial check i was experimenting with, is getting the hash of the messages posted, save it in a dictionary and see if the same hash comes up again.

Another simple check which could be done for monitoring activity on threads. Using the global average of posts per thread and calculating then the variance for a thread you should be able to spot spam-spree. The same can be applied to user posting.

There are other more complex techniques out there, but better start with something simple at start.
hero member
Activity: 1582
Merit: 759
Hey all,

I've been planning to write a few scripts relating to BitcoinTalk. It's been on my "developer bucket list" to write something to detect users who have multiple accounts. In order to accomplish this, and have a reliable list, I'd have to determine some logic in order to base this.

Content within HR tags will be updated as the thread goes along.



I have a few things in mind (and I'll be updating this as the thread goes along - adding new ideas & such):
When and if topics are created for the data (which will either be by me, or others): I'll post the links here under the respective categories.

For account quality detection:
- Looking for # of words, paragraphs, sentences, etc.. gathering the average of each user in order to determine a account quality number. This number can be used in tandem to determine if a report is made on a account (with other scripts). Obviously this isn't enough to report by itself, but usernames w/ low quality could be sent into a spreadsheet of some sort for manual lookup.
- [your/others ideas here]

For multiple account detection:

- Look for same address usage between posts (BTC, ETH, etc)
- Look for same account usage between posts (telegram, skype, emails, etc)
- [your/others ideas here]

For copy-pasta detection:
- Write a script to determine copy-pasta from accounts by matching the text of posts to similar text of other sites in order to return a probability percentage of the user copy/pasting (including src for manual analysis). Users w/ percentage points above a certain number will be put into a list & potentially reported to threads/mods. IE: external plagiarism detection
- Write a script to determine copy-pasta by matching post content against other users post content. High similarities will raise red flags. IE: internal plagiarism detection *note: suchmoon mentioned that working on something similar, so other scripts may set precedence*
- Original script may want to ignore quote tags. However, if the case, depending on how built (if use full text, or word by word) another side-script would have to be built to prevent users from just wrapping their messages in quote tags.
- [your/others ideas here]

For trust abuse/merit abuse:
- Detecting trust abuse (users who send out a large amount of negative trusts, using the same text). This would obviously avoid trusted members (as some good campaign managers send out trusts w/ same text). This is mostly targeted towards members w/ no trust, or negative trust (ie: newer members, no trade history, etc). Results would be posted in a thread in a list format using tildas "~" so people can copy/paste the list of abusers into their trust lists. Allowing the ability for users to request they be removed from this list by public poll within thread (this should probably be handled manually)
- [your/others ideas here]

General ideas for all scripts
- Automatic posting to anti-spam threads w/ results (in such a way as to not create more spam though)
- Platform where users which have been reported by scripts can be documented, with automatic ban detection. That way scripts aren't looking into users if they have already been reported/banned.
- [your/others ideas here]



Results would be posted here for mods to look at (if need be), or just to keep a record of such a connection. I'd also probably link to results in this topic and maybe load it up on a website of mine.

I wanted to post this thread in advance to see if anyone else had any other logic / ideas in mind for these scripts/bots? This will solely be when I have the time to create this (which won't be for a couple of weeks), so I thought I'd post this well in advance. I'll update the above list with approved suggestions that I plan to work on.

Thanks!

P.S: If any mods/admins aren't ok with me scraping the site, by all means let me know. I'd obviously write the bot/script in such a way that it doesn't slam the server & only send a certain amount of requests per second/minute (more or less like a Google bot). I know other users have written similar bots/scraping tools, so I thought it'd be ok. But if not, just let me know Smiley



Change log:

Code:
Edit (September 19th, 2018): I'll be updating this thread (see under bolds) with new ideas as this thread progresses. Also, if anyone else wishes to contribute to my scripts (or even build their own one-offs targeting the ideas above), just let me know that you're working on it, and I'll mark it in the thread. While I agree different scripts/algorithms would be harder to avoid/abuse, obviously I'd want all of the scripts to developed in a timely manner, so duplicating work probably isn't a good idea as of this moment.
Edit (September 20th, 2018): Adding trust/merit abuse columns - automatic detection of users abusing trust/merit system system.
Jump to:
© 2020, Bitcointalksearch.org