Author

Topic: Additional data dumps? (Read 950 times)

legendary
Activity: 2310
Merit: 4085
Farewell o_e_l_e_o
January 10, 2020, 10:36:40 PM
#28
From the reply of admin yesterday, I think now it is a very good time to think of a consistent format for the forum's data dumps. Each dataset has different variables inside, but I think all of them should be connected with only common variable (at least one variable) - userid.

Username, no matter it is username or display name or both will result in differences when connecting different datasets dumped by the forum.

For additional data dumps, it is not the priority and I am not in a position to ask for it too much, but for current data formats, a small adjustment: from username to userid will be good.

LoyceV asked for this change too: https://bitcointalksearch.org/topic/request-theymos-can-you-show-userids-instead-of-user-names-in-trusttxtxz-5104467
legendary
Activity: 2310
Merit: 4085
Farewell o_e_l_e_o
January 08, 2020, 10:48:01 PM
#27
UID -> name, merit, potential activity, posts
 post ID -> topic ID, time, UID
 topic ID -> board ID, first post ID
 board ID -> board name
The new year, so I bump it to ask for additional data dump granted by theymos.

Besides these formats above, I ask for this one (for merit data):
Code:
time amount msg user_from user_to boardid
1516831941  1 2818066.msg28853325 35 877396 24
I already collected the boardid, so if the merit data has only one additional variable for board's ID (boardid), it will eliminate the need to scrap data (from LoyceV's help) each 6 months. Although I don't know the others need such variable in data dumps or not.

For some sorts of analyses like these:
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
February 08, 2019, 12:02:00 PM
#26
Is there any place to have the modlog in raw format available? Even for the limited time. I want to check some things Smiley
Do you mean older versions? I used archive.li and archive.org.
legendary
Activity: 2240
Merit: 3150
₿uy / $ell ..oeleo ;(
February 08, 2019, 07:44:52 AM
#25
Scraping works, but the current modlog covers only a limited time. It's not possible to get a complete overview of all banned users. From various sources, I have a list of 170k banned users now, but it's far from complete.

Is there any place to have the modlog in raw format available? Even for the limited time. I want to check some things Smiley
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
February 08, 2019, 07:27:18 AM
#24
Scraping works, but the current modlog covers only a limited time. It's not possible to get a complete overview of all banned users. From various sources, I have a list of 170k banned users now, but it's far from complete.
legendary
Activity: 2240
Merit: 3150
₿uy / $ell ..oeleo ;(
February 08, 2019, 07:23:34 AM
#23
Can one still able to scrape data from the server? I thought theymos prohibited it since bitcointalk.to started to scrape the whole forum.

Both LoyceV and Vod are doing it, also i've seen other users too, so I think there is no any prohibition, yet.
I think if those dumps are available for download directly from the forum, more people can benefit out of it and there will be less traffic to the server.
sr. member
Activity: 860
Merit: 423
February 08, 2019, 07:15:11 AM
#22
It's not a necrobump.
Can we have the modlog and seclog dumps instead everyone to scrape the data from the server?


Can one still able to scrape data from the server? I thought theymos prohibited it since bitcointalk.to started to scrape the whole forum.
legendary
Activity: 2240
Merit: 3150
₿uy / $ell ..oeleo ;(
February 08, 2019, 03:37:25 AM
#21
It's not a necrobump.
Can we have the modlog and seclog dumps instead everyone to scrape the data from the server?
member
Activity: 308
Merit: 22
April 16, 2018, 11:21:36 AM
#20
I wish it was in csv format as easiest one to work with. I'd love to practice my Seaborn skills what I learned from short Udemy course.

legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
April 16, 2018, 06:50:34 AM
#19
Any follow up on this?
legendary
Activity: 2814
Merit: 2472
https://JetCash.com
March 24, 2018, 02:31:22 PM
#18
Deleted posts that have been awarded merit.
legendary
Activity: 2338
Merit: 10802
There are lies, damned lies and statistics. MTwain
March 24, 2018, 02:11:54 PM
#17
Is there a possibility of including the Rank in the merit.txt file or having another file to complement it so as to perform rank analysis tied to data in the merit.txt file?
I've seen Zentdex managed to cross this information, but it's not in the public raw data files for general usage as far as I can see.

It's True that Rank will vary for some user's within the timeframe of data within the merit.txt file, but is would be a helpful source to breakdown data and comprehend it better.
legendary
Activity: 2240
Merit: 3150
₿uy / $ell ..oeleo ;(
March 19, 2018, 05:28:34 AM
#16
For me it will be useful to get a data of all the users IDs posting in a specific topic and time, like in the ANN section.
If we can get a UID and Time on a topic, I can easily check for ICO pumpers.
legendary
Activity: 1988
Merit: 1317
Get your game girl
March 18, 2018, 11:44:00 AM
#15
Currently there are two big data dumps available which auto-update weekly, trust.txt.xz and merit.txt.xz. These auto-updating dumps are pretty easy to set up, so I was thinking that it might be a good idea to produce several more of these, perhaps in the end forming a "ghetto API". What dumps would be most useful? Some that I was thinking of were:
How big of a operation is to auto-update the data on a daily basis ? I was thinking I could set-up end points which downloads the file daily and keep my source for the charts (whatever I choose to represent) updated every day.

It would be great if you can send a status flag along with the account details like "active/inactive/banned".
hero member
Activity: 504
Merit: 732
March 18, 2018, 07:17:08 AM
#14
Recently someone asked about their account which was hacked in december, and I even didn't have a possibility to look at the date it happened (since it's gone from the page). So the security log dump would be indeed helpful.
sr. member
Activity: 812
Merit: 270
March 18, 2018, 05:25:06 AM
#13
I was asking for such information here today and just see this thread now.

I think all dumps related to forum architecture will be great to compute local boards stats.

I am especially interested in analyses of this data which could point to sub-communities where the initial sMerit is exhausted and new sources are necessary, and people who might be good merit sources.

This kind of requests would be easier to implement.

And what about some automatic dump archiving to avoid several people to do the same?
member
Activity: 308
Merit: 22
March 18, 2018, 03:45:12 AM
#12
I hope user zentdex will come up with some beautiful and informative charts. I'd love to see his posts.

Meanwhile, I will try to come up with something decent myself. That will be the perfect way to study data analyze.
full member
Activity: 350
Merit: 106
Telegram Moderator, Hire me
March 17, 2018, 08:58:26 PM
#11
Currently there are two big data dumps available which auto-update weekly, trust.txt.xz and merit.txt.xz. These auto-updating dumps are pretty easy to set up, so I was thinking that it might be a good idea to produce several more of these, perhaps in the end forming a "ghetto API". What dumps would be most useful? Some that I was thinking of were:

 UID -> name, merit, potential activity, posts
 post ID -> topic ID, time, UID
 topic ID -> board ID, first post ID
 board ID -> board name

I'm not going to dump post contents in any form, since that would both be a massive file and it'd make things very easy for those annoying phishing mirror sites.
i think the UID -> name, merit, potential activity, post is useful. in this you can easily compile the post contents of a user and create an outbox for each user to be compiled into it and easy to look or search the users activity and recent post, also some useful ideas are suggested like this
UID -> name, merit, potential activity, posts
I can think of a few:
1. Add "Activity" (not just "potential")
2. Add a banned-status to this list (ignore temporary bans)
3. Add either "merit earned" or "merit received for free at introduction"

Side note: there are more than 200 usernames with a comma, this will make processing a CSV difficult. Can you make this a file with just UID and name?
you can also monitor the give and take of merits by each user. this is what i understand by this thread please feel free to correct me if im wrong.
legendary
Activity: 3654
Merit: 8909
https://bpip.org
March 17, 2018, 04:30:36 PM
#10
Currently there are two big data dumps available which auto-update weekly, trust.txt.xz and merit.txt.xz. These auto-updating dumps are pretty easy to set up, so I was thinking that it might be a good idea to produce several more of these, perhaps in the end forming a "ghetto API". What dumps would be most useful? Some that I was thinking of were:

 UID -> name, merit, potential activity, posts
 post ID -> topic ID, time, UID
 topic ID -> board ID, first post ID
 board ID -> board name

I'm not going to dump post contents in any form, since that would both be a massive file and it'd make things very easy for those annoying phishing mirror sites.

All of the above, plus

Starting merit, starting sMerit, activity, rank for each user

This should allow us to see who's doing well (or not) at sending merits. Ideally we would also want merit source info but you didn't seem to want to publish that.

UID -> name, merit, potential activity, posts
I can think of a few:
1. Add "Activity" (not just "potential")
2. Add a banned-status to this list (ignore temporary bans)
3. Add either "merit earned" or "merit received for free at introduction"

Side note: there are more than 200 usernames with a comma, this will make processing a CSV difficult. Can you make this a file with just UID and name?

Usernames should be double-quoted then, and double quotes should be doubled inside double quotes... Yes, CSV format sucks but there is an RFC document for it and most modern tools should be able to handle that.



legendary
Activity: 2338
Merit: 10802
There are lies, damned lies and statistics. MTwain
March 17, 2018, 12:01:52 PM
#9
It all comes down really to what needs to be found out. That is, building a set of questions that need to be answered and derive the raw data information that enables an aggregated or derived dataset to be queried for the answers.

Some questions are answerable by a snapshot of the data, whilst others require the inclusion on a timeframe and datestamps to resolve.

For example, in order to see how long it takes to rank up for members, we would need the whole history per UserId  of rank changes , where the registry would only be necessary to be created when there is a user creation or a change in the Rank, being Date the associated timestamp.
If we wanted to see this in relation to Merit, we would need to build a registry in the shape of .

The other key factor is related to the current way in which data is stored. The raw data layout and capture process is part of the process to reach our solution goal.
For example, if there is a trigger in the database that currently logs  changes on the User Table for the record structure, the underlying table is direct and all that has to be done, once exported, is to select records that relate to a change in user’s rank (and ignore those that are a mere activity change).

If alas the underlying user table does not hold a historical record of changes (i.e. no logged timestamp historical), then the question of how long it takes to rank up would not be answerable or need to be crossed with other raw data from another table.

Questions that I would boldly put on the list due to sMerit introduction would be such as:

- What is the average time per Rank to rank-up?: before and after the introduction of the Merit system (this is not entirely comparable yet, since merit system is only a few months old so top Ranks are not comparable yet).

- How much sMerit is assigned per rank (from/to), per forum section, per forum subsection, in relation to number of posts in topic, in relation to topic heatness, in relation to post position in topic (quartiles for example), in relation to size of merited post, etc.

- How much sMerit is being withheld and for how long (averages).

- Round merit assignment candidate (from User A to User B and back -> That is derivable from current Merit.txt file as I’ve posted previously – it is not necessarily a cheat, but a source of study for such cases).

The match between a closed set of key questions to answer, and potential raw data structure should give us what additional files are required in my opinion.
hero member
Activity: 536
Merit: 513
March 17, 2018, 09:37:35 AM
#8
It seems to me that some local boards do not have sufficient smerit distribution, and it would be good to clarify that directly from data dump, which would help designing an appropriate distribution of merit sources.  It would be useful to have

post ID, topic ID, board ID, merit

and check how much each local board is active and whether sufficient smerits are distributed.  Of course spams and non-high-quality posts will be counted but I assume they are roughly proportional to the total number of posts.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
March 17, 2018, 04:01:20 AM
#7
UID -> name, merit, potential activity, posts
I can think of a few:
1. Add "Activity" (not just "potential")
2. Add a banned-status to this list (ignore temporary bans)
3. Add either "merit earned" or "merit received for free at introduction"

Side note: there are more than 200 usernames with a comma, this will make processing a CSV difficult. Can you make this a file with just UID and name?
jr. member
Activity: 40
Merit: 5
PM me to buy my sig space.
March 17, 2018, 12:17:39 AM
#6
Just fyi,

You can see and gauge how much sMerit someone has simply by the transparency of the system. So that's a data dump hidden field.

You can calculate how much they've receieved versus how much they've sent... and from there you'll know how much sMerit they have left :/
copper member
Activity: 2996
Merit: 2374
March 16, 2018, 11:41:21 PM
#5
I might suggest dumping the post history of individual users/accounts. This could be restricted by rank and otherwise be rate limited. I think this would be difficult to recreate any meaningful mirror site with this information.

As others have mentioned, the security log would be beneficial. The mod log, not so much because of its limited information.

It would be helpful if users outboxes (and other folders) can be downloaded since they cannot be easily searched. Obviously downloading this information would be restricted to users who are logged into their own account.
legendary
Activity: 1582
Merit: 1064
March 16, 2018, 12:33:12 AM
#4
Currently there are two big data dumps available which auto-update weekly, trust.txt.xz and merit.txt.xz. These auto-updating dumps are pretty easy to set up, so I was thinking that it might be a good idea to produce several more of these, perhaps in the end forming a "ghetto API". What dumps would be most useful? Some that I was thinking of were:

 UID -> name, merit, potential activity, posts
 post ID -> topic ID, time, UID
 topic ID -> board ID, first post ID
 board ID -> board name

I'm not going to dump post contents in any form, since that would both be a massive file and it'd make things very easy for those annoying phishing mirror sites.

Modlog definitely.
hero member
Activity: 908
Merit: 657
March 16, 2018, 12:11:17 AM
#3
It might be helpful to have a continuous version of the seclog without having to rely on archived pages.
legendary
Activity: 2968
Merit: 3406
Crypto Swap Exchange
March 16, 2018, 12:00:58 AM
#2
What dumps would be most useful? Some that I was thinking of were:

 UID -> name, merit, potential activity, posts
This (the rest, aren't that important). I hope accounts with 0 post/activity are excluded (to eliminate having a massive file for information that's not needed).

Can we get another weekly dump, in form of tracking the positive/negative ratings (ex. Sent from where and sent to where) and also knowing removed ratings from someone? (Credit goes to Vod, based on this thread).
administrator
Activity: 5222
Merit: 13032
March 15, 2018, 11:13:52 PM
#1
Currently there are two big data dumps available which auto-update weekly, trust.txt.xz and merit.txt.xz. These auto-updating dumps are pretty easy to set up, so I was thinking that it might be a good idea to produce several more of these, perhaps in the end forming a "ghetto API". What dumps would be most useful? Some that I was thinking of were:

 UID -> name, merit, potential activity, posts
 post ID -> topic ID, time, UID
 topic ID -> board ID, first post ID
 board ID -> board name

I'm not going to dump post contents in any form, since that would both be a massive file and it'd make things very easy for those annoying phishing mirror sites.
Jump to: