Author

Topic: Need data from the forum (Read 513 times)

staff
Activity: 3374
Merit: 6530
Just writing some code
June 16, 2016, 09:58:02 AM
#8
Google Prediction API is the most failed when it comes to NLP. Understanding the heuristics is next to impossible with it.Perhaps if you write your won genetic algorithms ,it might get the closest to detect spam and hopefully filter out spam.
Not easy as it seems since you would have to write your own model which analysis's the topic first and then filter out the spam.
If Google's Prediction API doesn't work, then I will find some other Machine Learning platform and see how well they do. If all else fails, I can attempt to figure out TensorFlow. Either way, I'm going to need the same data.

Speaking practically,its impossible because model will fail most of the time since a post could be off topic but yet constructive to discussion.
That's also why I'm looking for the Topic URLs. I may be able to use the rest of the topic as context and train it to analyze the post in context.


You know there are a few things bots can't beat humans at..
Some AIs are very intelligent. Especially ones on cloud platforms where there is a ton of computing power backing it.
hero member
Activity: 910
Merit: 1000
「きみはこれ&#
June 16, 2016, 09:40:24 AM
#7
Google Prediction API is the most failed when it comes to NLP. Understanding the heuristics is next to impossible with it.Perhaps if you write your won genetic algorithms ,it might get the closest to detect spam and hopefully filter out spam.
Not easy as it seems since you would have to write your own model which analysis's the topic first and then filter out the spam.

Speaking practically,its impossible because model will fail most of the time since a post could be off topic but yet constructive to discussion.You know there are a few things bots can't beat humans at..
staff
Activity: 3374
Merit: 6530
Just writing some code
June 16, 2016, 09:29:33 AM
#6
The 'URL' would not work for trashed and deleted threads. We could only manually copy the body. We have a few patterns that we look out for. You can just stick around in the speculation section and will encounter various spam.
The URL is probably not necessary. I'm thinking that I might need it to provide context to the posts for training the model, but it probably isn't that useful anyways.

Additionally, our own system might be less effective if we make patterns public.
You could just PM me.

I want it to strictly be just the post text, although I'm not sure how I'm going to handle quotes.

Why not just strip the quotes (if you end up with nothing else other than say +1 it was a pointless post anyway)?

When I feed data to the model for spam detection after it's done training, I will strip out the quotes as that part will be automated. But for gathering the data, I noticed that stripping out the quotes by hand can be a hassle, especially for the posts of people who respond to things line by line (like myself) and use quotes a lot. And it is pretty much impossible to strip out the posts when the entire post is copied since nothing indicates where the quote stops.
legendary
Activity: 1890
Merit: 1078
Ian Knowles - CIYAM Lead Developer
June 16, 2016, 09:21:02 AM
#5
I want it to strictly be just the post text, although I'm not sure how I'm going to handle quotes.

Why not just strip the quotes (if you end up with nothing else other than say +1 it was a pointless post anyway)?
legendary
Activity: 2674
Merit: 2965
Terminated.
June 16, 2016, 09:19:36 AM
#4
Would it be possible for the mods or admins to give me a file with all of the posts that were deleted for being spam and the topic they were posted in?
I doubt that. Not all of the posts/threads that are deleted/trashed are spam.

If that is not possible, could the mods, as you delete spam posts, put them into this spreadsheet: https://docs.google.com/spreadsheets/d/16frPDZkHcg-WYuWtj_Qqkc0fzPtoj4kBKrjpCrlU9h4/edit?usp=sharing on the sheet labeled SPAM.
The 'URL' would not work for trashed and deleted threads. We could only manually copy the body. We have a few patterns that we look out for. You can just stick around in the speculation section and will encounter various spam. Additionally, our own system might be less effective if we make patterns public.
staff
Activity: 3374
Merit: 6530
Just writing some code
June 16, 2016, 09:18:38 AM
#3
As much as you probably won't like it - if you just got rid of every single ad-sig post then you would get rid of at least 99% of the spam (and my guess is that any AI type analysis will probably end up identifying the ad-sig as being the key indicator).

(to keep posts made by people such as yourself would require a few exceptions to the ad-sig rule but I haven't seen more than a dozen accounts with ad-sigs that are not spammers)

It's probably a good place to start; just going through and adding the posts of users in 777coin or yobit to the spam part. But I'm not going to train it with people's sigs in the posts. I want it to strictly be just the post text, although I'm not sure how I'm going to handle quotes.

Right now my method is to choose some users who I know have a good post quatlity (e.g. DannyHamilton) and put them into the NOT SPAM sheet. Then I will do the same with users who I know have a terrible post quality.
legendary
Activity: 1890
Merit: 1078
Ian Knowles - CIYAM Lead Developer
June 16, 2016, 09:13:03 AM
#2
As much as you probably won't like it - if you just got rid of every single ad-sig post then you would get rid of at least 99% of the spam (and my guess is that any AI type analysis will probably end up identifying the ad-sig as being the key indicator for spammers).

(to keep posts made by people such as yourself would require a few exceptions to the ad-sig rule but I haven't seen more than a dozen accounts with ad-sigs that are not spammers)
staff
Activity: 3374
Merit: 6530
Just writing some code
June 16, 2016, 09:09:41 AM
#1
I'm thinking about (going to try) using the Google Prediction API: https://cloud.google.com/prediction/ to detect spam. However, in order to do so, it needs to be trained to know what spam and not spam looks like. Would it be possible for the mods or admins to give me a file with all of the posts that were deleted for being spam and the topic they were posted in?

If that is not possible, could the mods, as you delete spam posts, put them into this spreadsheet: https://docs.google.com/spreadsheets/d/16frPDZkHcg-WYuWtj_Qqkc0fzPtoj4kBKrjpCrlU9h4/edit?usp=sharing on the sheet labeled SPAM.

Users can also help. If you see a post that you think is spam, you can put it into the above spreadsheet. You can also put posts that you think are good and not spam into the sheet on the sheet labeled NOT SPAM. If you put things into the spreadsheet, it would be best not to include quoted stuff.

I know that this sounds like a lot of work for users and mods, but I also think that having a prediction model for spam would be a beneficial thing for this forum.

If anyone wants to help me here, feel free and please do so. If anyone has any suggestions for me for any part of this project, please let me know.
Jump to: