Author

Topic: Turning your post history into word clouds! (Read 400 times)

sr. member
Activity: 602
Merit: 295
Hail Eris!
December 10, 2017, 02:45:31 PM
#12
I am all about open source.  Note the end of the first post!  Smiley
I mean,putting the code on Github but anyway,I get your point.


Not sure we will ever know the ultimate goal, but the direction I am going in involves the use of data mining and visualization (visual data mining) to model and detect shilling within product review sets, forum threads, and so on.  Visualization is a means of doing this, and the techniques I hope to use go well beyond word clouds.  Word clouds visualize surface level information and can be rather insightful but there are many great techniques which work at the level of meaning.  Currently working with sentiment analysis and emotion detection.  When shilling occurs we see certain patterns of sentiment. (posted a toy project a few weeks ago somewhere visualizing shilling)

So yeah, shitpost modeling. Smiley
Don't ML make an integral part of it along with data mining ? Honestly,this isn't as easy as it looks like.I'm pretty sure complexity will keep on increasing as the project progresses.


Sure I will open a github for this project. 

Increasing complexity is our friend.  As long as we recenter often.  Increasing improvement, increasing complexity, and every now and then a shakedown and paradigm change.
Machine learning and data mining go hand in hand.  I like decision tree ensembles!  And don't forget I am doing this for fun.   And I do have a graduate degree in this stuff. (data mining with a focus on visualization and natural language processing) Tongue  
legendary
Activity: 1988
Merit: 1317
Get your game girl
December 10, 2017, 02:28:14 PM
#11
I am all about open source.  Note the end of the first post!  Smiley
I mean,putting the code on Github but anyway,I get your point.


Not sure we will ever know the ultimate goal, but the direction I am going in involves the use of data mining and visualization (visual data mining) to model and detect shilling within product review sets, forum threads, and so on.  Visualization is a means of doing this, and the techniques I hope to use go well beyond word clouds.  Word clouds visualize surface level information and can be rather insightful but there are many great techniques which work at the level of meaning.  Currently working with sentiment analysis and emotion detection.  When shilling occurs we see certain patterns of sentiment. (posted a toy project a few weeks ago somewhere visualizing shilling)

So yeah, shitpost modeling. Smiley
Don't ML make an integral part of it along with data mining ? Honestly,this isn't as easy as it looks like.I'm pretty sure complexity will keep on increasing as the project progresses.
sr. member
Activity: 602
Merit: 295
Hail Eris!
December 10, 2017, 02:22:33 PM
#10
Interesting !

The first thing that comes to my mind is,why don't you open-source such projects ? We all can work on it together ! Having said that,this could be transformed into a useful tool that helps us find out shitposters in a way.Looking forward to your ultimate goal.Cheers.

I am all about open source.  Note the end of the first post!  Smiley

Not sure we will ever know the ultimate goal, but the direction I am going in involves the use of data mining and visualization (visual data mining) to model and detect shilling within product review sets, forum threads, and so on.  Visualization is a means of doing this, and the techniques I hope to use go well beyond word clouds.  Word clouds visualize surface level information and can be rather insightful but there are many great techniques which work at the level of meaning.  Currently working with sentiment analysis and emotion detection.  When shilling occurs we see certain patterns of sentiment. (posted a toy project a few weeks ago somewhere visualizing shilling)

So yeah, shitpost modeling. Smiley

Other than that my goal is to finding constructive ways of meeting signature campaign requirements.  I like to meet the requirements but feel bad when not being productive or contributing.  Things like this help me constructively contribute while meeting my needs.
legendary
Activity: 1988
Merit: 1317
Get your game girl
December 10, 2017, 02:03:49 PM
#9
Interesting !

The first thing that comes to my mind is,why don't you open-source such projects ? We all can work on it together ! Having said that,this could be transformed into a useful tool that helps us find out shitposters in a way.Looking forward to your ultimate goal.Cheers.
sr. member
Activity: 602
Merit: 295
Hail Eris!
December 10, 2017, 01:43:21 PM
#8
This is for user apoorvlathey.  Please let me know one thing you would change about the word clouds! 



legendary
Activity: 2814
Merit: 2472
https://JetCash.com
December 10, 2017, 05:43:11 AM
#7
Thanks for posting this - it's really useful on two counts. It will help me to get my head around python. It will help me to check on the quality of my postin.

It's a great idea.
sr. member
Activity: 602
Merit: 295
Hail Eris!
December 09, 2017, 02:15:57 PM
#6
It's kinda cool to see my own work like this!

Now, some comments:
  • Post 1 = Post 21, Post 2 = Post 22, and so on. I think your scraper fails to reach the second page in post history.
  • My Post 25 is missing entirely.
  • You should exclude quotes, Post 2 for instance shows _Arnold_, which is the name of the person I quoted.
  • Some of the small fonts are impossible to read. If you limit it to 4x5 images, they can be a bit bigger without getting wider, and it fixes my first comment at the same time.
  • Post 12 shows AMSo, I can't see that in my original post.

I am totally on it.  Was excited to see the intial pieces come together.  It also should order them the reverse direction.

I will look into the problems, mine did 25 but yeah I think my number of posts per page counter is off.

Thanks for the input.  The fastest way to develop neat software is iteratively with constant user feedback in my book.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
December 09, 2017, 02:09:11 PM
#5
It's kinda cool to see my own work like this!

Now, some comments:
  • Post 1 = Post 21, Post 2 = Post 22, and so on. I think your scraper fails to reach the second page in post history.
  • My Post 25 is missing entirely.
  • You should exclude quotes, Post 2 for instance shows _Arnold_, which is the name of the person I quoted.
  • Some of the small fonts are impossible to read. If you limit it to 4x5 images, they can be a bit bigger without getting wider, and it fixes my first comment at the same time.
  • Post 12 shows AMSo, I can't see that in my original post.
sr. member
Activity: 602
Merit: 295
Hail Eris!
December 09, 2017, 01:19:25 PM
#4
I'll bite, do me please: ID 459836.

Yes! 

Rewriting it a bit to make it look nicer.  Also parsing dates so I can show word clouds for given data ranges such as a weekly word cloud.  I will play around and have yours sometime today.

Just wrote the darn script so there is a lot of room for improvement.   Lets see what we can find out about you using text visualization.

Got another program which detects emotions in text that we used to process novels.  Might color code by emotion, or add emoticons indicating the emotion!  Unemployed right now so pardon the excitement over little things.  Projects keep me going.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
December 09, 2017, 01:04:29 PM
#3
I'll bite, do me please: ID 459836.
sr. member
Activity: 602
Merit: 295
Hail Eris!
December 09, 2017, 11:46:27 AM
#2
No bites?  *shrugs*

Probably shouldn't post at 1 in the AM.

I would love some examples of known shills and non-shills that I can visualize.  These are just word clouds and not very useful and there are a number of other techniques which are more useful.  Like plotting shifts of sentiment. 

BTW there is a really neat Kaggle contest going on where you are given a collection of story snippets and their authors (it was a Halloween contest and focuses on spooky authors like Poe) for training and the goal is to build a classifier that can correctly match unlabeled snippets with their authors.
sr. member
Activity: 602
Merit: 295
Hail Eris!
December 09, 2017, 04:49:12 AM
#1
Wrote a little script that generates word clouds for a user's last 25 posts.

Each post gets a word cloud but I could easily change it to generate one for each week, month, whatever.

This is a tiny part of a hopefull larger project approaching shilling analysis and detection.  Later versions may calculate and make use of sentiment.

Here are my last 25!  If you want me to do one send your user id.  Any requests for special formatting or additional features may or may not be granted so feel free to ask.  The python script is included at the end of this post so you can do it yourself as well though I just wrote it so like most code it could use some refactoring.



Code:
import requests
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from bs4 import BeautifulSoup
import time

def processPostsPage(pageNumber, userId):
    
    #grab the next page of posts
    pageUrl = "https://bitcointalk.org/index.php?action=profile;u=" + str(userId) + ";sa=showPosts;start="+str(pageNumber*20)
    response = requests.get(pageUrl)
    data = response.content  
    
    #grab divs with the 'post' attribute
    soup = BeautifulSoup(data)
    postsOnPage = soup.findAll("div", { "class" : "post" })
    
    #return an array of word clouds
    wordCloudsForPageOfPosts = []
    
    count = 1;
    
    stopwords = ["post","posts"]
    
    #convert posts to wordcloud images
    for div in postsOnPage:
    
        text = div.get_text()
        
        #remove the stopwords with a little magic
        text = " ".join([word for word in text.split() if word not in stopwords])
        
        nextCloud = WordCloud().generate(text)
        
        wordCloudsForPageOfPosts.append(nextCloud)
        
        count = count+1
        
    return wordCloudsForPageOfPosts
 

def buildPlot(userId, numberOfPosts):
    width = 25
    height = 15
    fig = plt.figure(figsize=(width,height))

    
    currentPage = 1
    
    count = 0
    
    while(count <= numberOfPosts):
        
        #sleep in between grabbing pages of posts for a stress free bitcointalk crawl
        time.sleep(1)
        
        #grab the clouds for that page
        clouds = processPostsPage(currentPage,userId)
        
        #display the clouds until we reach the right number
        for cloud in clouds:
            count = count+1
            if(count < numberOfPosts):
                a = fig.add_subplot(5,5,count)
                
                #hack to delete tick marks
                for a in fig.get_axes():
                    a.set_xticks([])
                    a.set_yticks([])
                a.set_title("Post " + str(count))
                plt.imshow(cloud)
            
            
userID = 249526
numberOfPostToDisplay = 25
buildPlot(userID, numberOfPostToDisplay)
plt.show()
Jump to: