Additional data dumps? | Bitcointalksearch.org

tranthidung

legendary

Activity: 2520

Merit: 4355

Farewell o_e_l_e_o

From the reply of admin yesterday, I think now it is a very good time to think of a consistent format for the forum's data dumps. Each dataset has different variables inside, but I think all of them should be connected with only common variable (at least one variable) - userid.

Username, no matter it is username or display name or both will result in differences when connecting different datasets dumped by the forum.

For additional data dumps, it is not the priority and I am not in a position to ask for it too much, but for current data formats, a small adjustment: from username to userid will be good.

LoyceV asked for this change too: https://bitcointalksearch.org/topic/request-theymos-can-you-show-userids-instead-of-user-names-in-trusttxtxz-5104467

tranthidung

legendary

Activity: 2520

Merit: 4355

Farewell o_e_l_e_o

Quote from: theymos on March 15, 2018, 11:13:52 PM

UID -> name, merit, potential activity, posts
post ID -> topic ID, time, UID
topic ID -> board ID, first post ID
board ID -> board name

The new year, so I bump it to ask for additional data dump granted by theymos.

Besides these formats above, I ask for this one (for merit data):

Code:

time amount msg user_from user_to boardid
1516831941  1 2818066.msg28853325 35 877396 24

I already collected the boardid, so if the merit data has only one additional variable for board's ID (boardid), it will eliminate the need to scrap data (from LoyceV's help) each 6 months. Although I don't know the others need such variable in data dumps or not.

For some sorts of analyses like these:

LoyceV

legendary

Activity: 3290

Merit: 16489

Thick-Skinned Gang Leader and Golden Feather 2021

Quote from: TheBeardedBaby on February 08, 2019, 07:44:52 AM

Is there any place to have the modlog in raw format available? Even for the limited time. I want to check some things

Do you mean older versions? I used archive.li and archive.org.

TheBeardedBaby

legendary

Activity: 2240

Merit: 3150

₿uy / $ell ..oeleo ;(

Quote from: LoyceV on February 08, 2019, 07:27:18 AM

Scraping works, but the current modlog covers only a limited time. It's not possible to get a complete overview of all banned users. From various sources, I have a list of 170k banned users now, but it's far from complete.

Is there any place to have the modlog in raw format available? Even for the limited time. I want to check some things

LoyceV

legendary

Activity: 3290

Merit: 16489

Thick-Skinned Gang Leader and Golden Feather 2021

Scraping works, but the current modlog covers only a limited time. It's not possible to get a complete overview of all banned users. From various sources, I have a list of 170k banned users now, but it's far from complete.

TheBeardedBaby

legendary

Activity: 2240

Merit: 3150

₿uy / $ell ..oeleo ;(

Quote from: 100bitcoin on February 08, 2019, 07:15:11 AM

Can one still able to scrape data from the server? I thought theymos prohibited it since bitcointalk.to started to scrape the whole forum.

Both LoyceV and Vod are doing it, also i've seen other users too, so I think there is no any prohibition, yet.
I think if those dumps are available for download directly from the forum, more people can benefit out of it and there will be less traffic to the server.

100bitcoin

sr. member

Activity: 861

Merit: 423

Quote from: TheBeardedBaby on February 08, 2019, 03:37:25 AM

It's not a necrobump.
Can we have the modlog and seclog dumps instead everyone to scrape the data from the server?

Can one still able to scrape data from the server? I thought theymos prohibited it since bitcointalk.to started to scrape the whole forum.

TheBeardedBaby

legendary

Activity: 2240

Merit: 3150

₿uy / $ell ..oeleo ;(

It's not a necrobump.
Can we have the modlog and seclog dumps instead everyone to scrape the data from the server?

mobilazy

member

Activity: 308

Merit: 22

I wish it was in csv format as easiest one to work with. I'd love to practice my Seaborn skills what I learned from short Udemy course.

LoyceV

legendary

Activity: 3290

Merit: 16489

Thick-Skinned Gang Leader and Golden Feather 2021

Any follow up on this?

Jet Cash

legendary

Activity: 2870

Merit: 2474

https://JetCash.com

Deleted posts that have been awarded merit.

DdmrDdmr

legendary

Activity: 2338

Merit: 10802

There are lies, damned lies and statistics. MTwain

Is there a possibility of including the Rank in the merit.txt file or having another file to complement it so as to perform rank analysis tied to data in the merit.txt file?
I've seen Zentdex managed to cross this information, but it's not in the public raw data files for general usage as far as I can see.

It's True that Rank will vary for some user's within the timeframe of data within the merit.txt file, but is would be a helpful source to breakdown data and comprehend it better.

TheBeardedBaby

legendary

Activity: 2240

Merit: 3150

₿uy / $ell ..oeleo ;(

For me it will be useful to get a data of all the users IDs posting in a specific topic and time, like in the ANN section.
If we can get a UID and Time on a topic, I can easily check for ICO pumpers.

Joel_Jantsen

legendary

Activity: 1988

Merit: 1317

Get your game girl

Quote from: theymos on March 15, 2018, 11:13:52 PM

Currently there are two big data dumps available which auto-update weekly, trust.txt.xz and merit.txt.xz. These auto-updating dumps are pretty easy to set up, so I was thinking that it might be a good idea to produce several more of these, perhaps in the end forming a "ghetto API". What dumps would be most useful? Some that I was thinking of were:

How big of a operation is to auto-update the data on a daily basis ? I was thinking I could set-up end points which downloads the file daily and keep my source for the charts (whatever I choose to represent) updated every day.

It would be great if you can send a status flag along with the account details like "active/inactive/banned".

esmanthra

hero member

Activity: 504

Merit: 732

Recently someone asked about their account which was hacked in december, and I even didn't have a possibility to look at the date it happened (since it's gone from the page). So the security log dump would be indeed helpful.

JeremyB

sr. member

Activity: 812

Merit: 270

I was asking for such information here today and just see this thread now.

I think all dumps related to forum architecture will be great to compute local boards stats.

Quote from: theymos on March 01, 2018, 06:37:16 AM

I am especially interested in analyses of this data which could point to sub-communities where the initial sMerit is exhausted and new sources are necessary, and people who might be good merit sources.

This kind of requests would be easier to implement.

And what about some automatic dump archiving to avoid several people to do the same?

mobilazy

member

Activity: 308

Merit: 22

I hope user zentdex will come up with some beautiful and informative charts. I'd love to see his posts.

Meanwhile, I will try to come up with something decent myself. That will be the perfect way to study data analyze.

1020kingz

full member

Activity: 350

Merit: 106

Telegram Moderator, Hire me

Quote from: theymos on March 15, 2018, 11:13:52 PM

Currently there are two big data dumps available which auto-update weekly, trust.txt.xz and merit.txt.xz. These auto-updating dumps are pretty easy to set up, so I was thinking that it might be a good idea to produce several more of these, perhaps in the end forming a "ghetto API". What dumps would be most useful? Some that I was thinking of were:

UID -> name, merit, potential activity, posts
post ID -> topic ID, time, UID
topic ID -> board ID, first post ID
board ID -> board name

I'm not going to dump post contents in any form, since that would both be a massive file and it'd make things very easy for those annoying phishing mirror sites.

i think the UID -> name, merit, potential activity, post is useful. in this you can easily compile the post contents of a user and create an outbox for each user to be compiled into it and easy to look or search the users activity and recent post, also some useful ideas are suggested like this

Quote from: LoyceV on March 17, 2018, 04:01:20 AM

Quote from: theymos on March 15, 2018, 11:13:52 PM

UID -> name, merit, potential activity, posts

I can think of a few:
1. Add "Activity" (not just "potential")
2. Add a banned-status to this list (ignore temporary bans)
3. Add either "merit earned" or "merit received for free at introduction"

Side note: there are more than 200 usernames with a comma, this will make processing a CSV difficult. Can you make this a file with just UID and name?

you can also monitor the give and take of merits by each user. this is what i understand by this thread please feel free to correct me if im wrong.

suchmoon

legendary

Activity: 3654

Merit: 8909

https://bpip.org

Quote from: theymos on March 15, 2018, 11:13:52 PM

Currently there are two big data dumps available which auto-update weekly, trust.txt.xz and merit.txt.xz. These auto-updating dumps are pretty easy to set up, so I was thinking that it might be a good idea to produce several more of these, perhaps in the end forming a "ghetto API". What dumps would be most useful? Some that I was thinking of were:

UID -> name, merit, potential activity, posts
post ID -> topic ID, time, UID
topic ID -> board ID, first post ID
board ID -> board name

I'm not going to dump post contents in any form, since that would both be a massive file and it'd make things very easy for those annoying phishing mirror sites.

All of the above, plus

Starting merit, starting sMerit, activity, rank for each user

This should allow us to see who's doing well (or not) at sending merits. Ideally we would also want merit source info but you didn't seem to want to publish that.

Quote from: LoyceV on March 17, 2018, 04:01:20 AM

Quote from: theymos on March 15, 2018, 11:13:52 PM

UID -> name, merit, potential activity, posts

I can think of a few:
1. Add "Activity" (not just "potential")
2. Add a banned-status to this list (ignore temporary bans)
3. Add either "merit earned" or "merit received for free at introduction"

Side note: there are more than 200 usernames with a comma, this will make processing a CSV difficult. Can you make this a file with just UID and name?

Usernames should be double-quoted then, and double quotes should be doubled inside double quotes... Yes, CSV format sucks but there is an RFC document for it and most modern tools should be able to handle that.

DdmrDdmr

legendary

Activity: 2338

Merit: 10802

There are lies, damned lies and statistics. MTwain

It all comes down really to what needs to be found out. That is, building a set of questions that need to be answered and derive the raw data information that enables an aggregated or derived dataset to be queried for the answers.

Some questions are answerable by a snapshot of the data, whilst others require the inclusion on a timeframe and datestamps to resolve.

For example, in order to see how long it takes to rank up for members, we would need the whole history per UserId of rank changes , where the registry would only be necessary to be created when there is a user creation or a change in the Rank, being Date the associated timestamp.
If we wanted to see this in relation to Merit, we would need to build a registry in the shape of .

The other key factor is related to the current way in which data is stored. The raw data layout and capture process is part of the process to reach our solution goal.
For example, if there is a trigger in the database that currently logs changes on the User Table for the record structure, the underlying table is direct and all that has to be done, once exported, is to select records that relate to a change in user’s rank (and ignore those that are a mere activity change).

If alas the underlying user table does not hold a historical record of changes (i.e. no logged timestamp historical), then the question of how long it takes to rank up would not be answerable or need to be crossed with other raw data from another table.

Questions that I would boldly put on the list due to sMerit introduction would be such as:

- What is the average time per Rank to rank-up?: before and after the introduction of the Merit system (this is not entirely comparable yet, since merit system is only a few months old so top Ranks are not comparable yet).

- How much sMerit is assigned per rank (from/to), per forum section, per forum subsection, in relation to number of posts in topic, in relation to topic heatness, in relation to post position in topic (quartiles for example), in relation to size of merited post, etc.

- How much sMerit is being withheld and for how long (averages).

- Round merit assignment candidate (from User A to User B and back -> That is derivable from current Merit.txt file as I’ve posted previously – it is not necessarily a cheat, but a source of study for such cases).

The match between a closed set of key questions to answer, and potential raw data structure should give us what additional files are required in my opinion.

Topic: Additional data dumps? (Read 971 times)