Pages:
Author

Topic: Please help me to get details on boards, sub-boards of posts/topics (Read 402 times)

legendary
Activity: 2310
Merit: 4085
Farewell o_e_l_e_o
I have 60611 pieces of data for you. It's too much to post, too large for pastebin, and loyce.club is currently offline. If you PM me an email address I'll send it.
Thank you so much. I released Merit distributions over boards (24/1/2018 - 19/12/2019).

This topic is now locked. Cheesy
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
I have 60611 pieces of data for you. It's too much to post, too large for pastebin, and loyce.club is currently offline. If you PM me an email address I'll send it.
legendary
Activity: 2310
Merit: 4085
Farewell o_e_l_e_o
How about you ask me for an update every six months or so?
I agree with the six-month updates.  Thank you. Smiley
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
It depends on you. If you can dump data weekly, I will definitely make weekly updates too. It depends on you.  Grin

 I only think such big stats won't change too much weekly.
How about you ask me for an update every six months or so?
legendary
Activity: 2310
Merit: 4085
Farewell o_e_l_e_o
It's running now. Results tomorrow Smiley
Thanks.
Quote
Wait.. You want weekly updates on this too?
It depends on you. If you can dump data weekly, I will definitely make weekly updates too. It depends on you.  Grin

 I only think such big stats won't change too much weekly.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
It's running now. Results tomorrow Smiley

If it only takes much time (one day) for the first scraping round (next Friday), but takes less time (that I guess) for the second round and later, please do it.  Cheesy
Wait.. You want weekly updates on this too?
legendary
Activity: 2310
Merit: 4085
Farewell o_e_l_e_o
It's a $2/year VPS Smiley Up to you if you want the data Smiley
Cool.

If it only takes much time (one day) for the first scraping round (next Friday), but takes less time (that I guess) for the second round and later, please do it.  Cheesy

I already have the List of boards, subboards (some local subboards are not listed)

For this analysis, I think I am going to make updates monthly or quarterly.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
It works but honestly I don't think I should ask you to run your computer one day just to do this. It sounds crazy.
It's a $2/year VPS that I can't use for anything else because it has only 128 MB ram Smiley Up to you if you want the data Smiley
legendary
Activity: 2310
Merit: 4085
Farewell o_e_l_e_o
It works but honestly I don't think I should ask you to run your computer one day just to do this. It sounds crazy but now I understood why sometimes I asked your help and you rejected it.

For off-limited boards, I don't need them.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
Yes, I just need at least one variable in the two datasets to merge them together, but what did you mean by 'the board'?
Is it the board's name or the board's id number. I can use both of them but it is more convenient for me if you have board's id number.
How's this?
topicID:boardID
Code:
1:offlimits
5:1
6:1
7:1
8:1
9:1
12:6
13:1
15:223
16:1
20:5
22:6
30:5
34:1
41:1
I'm currently not scraping Investigations as it's a hidden board. If you really want data on that board too, I'll have to scrape with an account that logs in, but I prefer not to.
If the above list works for you, I'll run it. It'll take a day to complete.
I could also wait until next Friday's Merit data dump is included.
legendary
Activity: 2310
Merit: 4085
Farewell o_e_l_e_o
Those two lines have very different formats (different hierarchical formats, I meant).
It's just taken from the HTML links on top of each page. Different child boards have different "depths", that's what makes it "messy".

Quote
This one is what I need.
Code:
time                amount    board/subboard_idnumber    user_from    user_to 
1576204400      2    24                                  18321   307884
How about just the topicID (say: "5209104" taken from the msgID: "5209104.msg53329921") and the board? Is that enough to combine with my merit.all.txt ?
Yes, I just need at least one variable in the two datasets to merge them together, but what did you mean by 'the board'?
Is it the board's name or the board's id number. I can use both of them but it is more convenient for me if you have board's id number.

If you don't have it available, I am going to do it, it's my turn. And if you only have board's name, please help me by moving it to the last column (last variable).
Quote
See Ignore Boards Preferences, then view the page source.
It is helpful.
Quote
This can literally be done in one (long) line of code Smiley
For things you manage well, it is easy, but for the others who don't know how to do, it is a challenge.  Cheesy


Another thing I don't know. I meant I use nearly same things for my statistical stuffs but with computer programming, I have to learn from scratch. Thanks.
copper member
Activity: 1652
Merit: 1901
Amazon Prime Member #7
You will need to run a for loop through the links where the boards are to get the board number.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
Those two lines have very different formats (different hierarchical formats, I meant).
It's just taken from the HTML links on top of each page. Different child boards have different "depths", that's what makes it "messy".

Quote
This one is what I need.
Code:
time                amount    board/subboard_idnumber    user_from    user_to 
1576204400      2    24                                  18321   307884
How about just the topicID (say: "5209104" taken from the msgID: "5209104.msg53329921") and the board? Is that enough to combine with my merit.all.txt ?
That shouldn't be too hard to scrape.

Quote
If you have id numbers of boards/ subboards, please give me a dump too. According to https://bitcointalk.org/index.php?action=stats, there are 252 boards at the moment.
If you don't have that list, I am going to get it myself.  Smiley
See Ignore Boards Preferences, then view the page source.

Quote
Thanks for your informative explanations, but I have not yet had knowledge and skills to scrap data (in any methods). I will try to learn it.  Smiley
This can literally be done in one (long) line of code Smiley
legendary
Activity: 2310
Merit: 4085
Farewell o_e_l_e_o
Without much new work (only a lot of scraping), I can get OP a list like this one (but with the topic/msg-ID instead of a Merit count).
@tranthidung: can you work with that?
It is not perfect for me but looks cool and you have available data on it. That is a plus point, sure.
But it looks a little bit messy and I can not use it.

Example:
< ... >
Thanks for your informative explanations, but I have not yet had knowledge and skills to scrap data (in any methods). I will try to learn it.  Smiley
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
The path needs to be parsed, and constitutes a text based solution (not an Id based one). On top of that, you need to do some cleansing when the moderators are included in the path
Without much new work (only a lot of scraping), I can get OP a list like this one (but with the topic/msg-ID instead of a Merit count).
@tranthidung: can you work with that?
legendary
Activity: 2338
Merit: 10802
There are lies, damned lies and statistics. MTwain
<...>
Basically you need to derive that information not from the message Id information (https://bitcointalksearch.org/topic/m.53358425), but from the page itself where the message is displayed (Bitcoin Forum > Other > Meta > Merit & new rank requirements).

The path needs to be parsed, and constitutes a text based solution (not an Id based one). On top of that, you need to do some cleansing when the moderators are included in the path (i.e. https://bitcointalksearch.org/topic/m.49500922 has as a path Bitcoin Forum > Economy > Marketplace > Goods > Collectibles (Moderators: malevolent, Cyrus, hilariousandco) > [WTS] Old peseta coins and few 1800 coins. Whole lot 500 eur now.), or when a title includes a “>” character (which is not a subselvel).

It's what I do for the Dashboard, but there are some issues and I do not work with all path levels.

As I said, the output I create it's text base, not Id based.

Something like this :
https://docs.google.com/spreadsheets/d/1hnuC0EadNbxm4gcK7GOTobCg-IWsOCAw1UCjunuIQlY/edit?usp=sharing

I only cleanse three levels in the path. The fourth is interesting to enter childboards, but I only cleans it for the Spanish local board (í've ommited levels 4..10 since they do not fit, and are not cleansed homogeneusly).

Every now and then, the data should be regenerated retrospectively to cover posts being deleted and moved. It's a bit of a p.i.t.a. so I only do it every few months or so.

legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
I asked this because if it is easy to do, I want to get statistics on merit distributions over boards, sub-boards from raw merit data, dumped by theymos.
Doesn't DdmrDdmr have this data available?
I never kept track of all topic locations.

Last Friday, there were 60612 different topics that received Merit. That means (with 1 second delay) you can scrape the data you need within 24 hours.
I do have all titles for 177716 merited posts, but that's not going to help you here.
legendary
Activity: 2310
Merit: 4085
Farewell o_e_l_e_o
Thanks, looks like a good guide but I have to learn about programming to do this. It surely takes time.  Smiley
copper member
Activity: 1652
Merit: 1901
Amazon Prime Member #7
The topic number of this thread, 5210219 I believe was assigned because the previous thread that was created was topic number 5210218. I don't believe it has anything to do with the fact it was created in the meta sub-board, and if it were to be moved to another sub, I don't believe the topic number would change.

If you visit a single post in a thread, you know which board every post in that thread is located in. So if you need to check for 100 posts in the same thread, you only need to visit one page in that thread. If you have a relational database, you can create a new table that has a list of each thread you are looking at, and record the board ID, and the topic number (along with any other information you need about the thread).

If you are visiting a thread, you can tell your program to look at the following to get the board number the thread is on:
find all the "div" tags with the class "nav" ("1st Query")
From the 1st Query, search for all the links ("2nd Query")
From the 2nd Query, you can search for all the URLs that contain "php?board=" ("3rd Query").
From the 3rd Query, you can isolate the board number for each link ("4th Query")
From the 4th Query, convert the result to an integer (if you haven't already), and find the highest result. This is the board number the thread is located in.
legendary
Activity: 2310
Merit: 4085
Farewell o_e_l_e_o
Are you trying to figure out what board a particular thread is in from visiting the thread? Or do you want to know based on other information?
It is easy to visit one post to see which board / sub-board it was posted in, but it is a serious issue if you have hundreds or thousands of posts to check. I have to check it automatically with machine, not handy.

I would like to check it with the available figures of topic or post numbers (as I described in OP).
Pages:
Jump to: