Author

Topic: Standalone Database with Merit data + Full user details (Trust included) (Read 318 times)

Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
Where does Vod get active/inactive data about users?

By their Last Active date.  90 days and I mark them as inactive.  Unless they have no posts and no trust, then I move them to Archive.
hero member
Activity: 784
Merit: 1416
If i remember correctly, i may have read somewhere the archived accounts are on the order of some x00k, but still i would need an offline list with the userid to exclude directly.  Undecided

I think Vod is probably looking at the last time a user had logged to define if he/she is active/inactive.

I was thinking, if exists somewhere a list of public proxies which actually work, i could cut down the time quite significantly. I imagine the server can handle many request per second, but i would not want to make a mess  Smiley
sr. member
Activity: 616
Merit: 279
That i'm aware of i don't think there is any other way to get the data other than scraping the content from the web page.

The space is not an issue, the only problem in doing this is that will take several days to complete from 1 machine, if we consider a round number of 2M user and hitting the forum at 1 user/s it should take ~556 hours, we can round up at 23 days  Huh, unless i'm missing something
Yeah, it's going to take some time to get that. I'll work on something more efficient if I have the time and get the data for all the users.

[...]
Maybe you could group the user profiles in inactive, active and archived profiles like Vod did it with bpip.org.

Should reduce the scraping time a lot.
You would still need to get the data first to know if the user is active or inactive so, it's not that straight forward as it seems.
Where does Vod get active/inactive data about users?
sr. member
Activity: 588
Merit: 422
Thanks for sharing your database!

I thought about scraping BCT myself. Currently, I'm learning a bit about Python and web scraping.

~~~
The space is not an issue, the only problem in doing this is that will take several days to complete from 1 machine, if we consider a round number of 2M user and hitting the forum at 1 user/s it should take ~556 hours, we can round up at 23 days  Huh, unless i'm missing something

Maybe you could group the user profiles in inactive, active and archived profiles like Vod did it with bpip.org.

Should reduce the scraping time a lot.
hero member
Activity: 784
Merit: 1416
That i'm aware of i don't think there is any other way to get the data other than scraping the content from the web page.

The space is not an issue, the only problem in doing this is that will take several days to complete from 1 machine, if we consider a round number of 2M user and hitting the forum at 1 user/s it should take ~556 hours, we can round up at 23 days  Huh, unless i'm missing something
sr. member
Activity: 616
Merit: 279
Userdata contains all the users that either sent or received any merit and these are the column on the table:
[...]
Looking good! Smiley So, you've added all the data from the profile.
How are you getting this data? Are you crawling every profile or processing some SMF APIs that give you this data in a more friendlier format like xml or json? Feel free to send me a PM if you like.

Any chance the approach you're using to get all of this data for these users could be applied to get the data for all of the users on BitCoinTalk?
I know this may increase the DB size exponentially, but I think we could reduce the DB size by only including these fields:

Code:
UserId Integer PRIMARY KEY,
UserName TEXT,
Trust TEXT,
Merit Integer,
Posts Integer,
Activity Integer,
Rank TEXT,
DateRegistered Integer,
LastActive Integer

I think these are the most relevant fields for any user, that could help us in creating nice stats.

Maybe I could help out with this process, just give me a heads up if this can be done with your approach.
hero member
Activity: 784
Merit: 1416
Thanks a lot for providing the database with extended user data Smiley Will probably use it to create some elaborate queries  Wink

May I ask what extra information have you added to the UserData table? I see Trust & Activity in your queries, anything else?

Also, does the UserData table only contain users that have touched merit at some point (have sent/received merit)?


Userdata contains all the users that either sent or received any merit and these are the column on the table:

Code:
UserData(
UserId Integer PRIMARY KEY,
UserName TEXT,
Trust TEXT,
Merit Integer,
Posts Integer,
Activity Integer,
Rank TEXT,
DateRegistered Integer,
LastActive Integer,
ICQ TEXT,
AIM TEXT,
MSN TEXT,
YIM TEXT,
Email TEXT,
Website TEXT,
Bitcoinaddress TEXT,
Gender TEXT,
Age TEXT,
Location TEXT,
LastUpdatedData Integer
);
sr. member
Activity: 616
Merit: 279
Thanks a lot for providing the database with extended user data Smiley Will probably use it to create some elaborate queries  Wink

May I ask what extra information have you added to the UserData table? I see Trust & Activity in your queries, anything else?

Also, does the UserData table only contain users that have touched merit at some point (have sent/received merit)?
hero member
Activity: 784
Merit: 1416
Since there is interest in having all the data available on users in relation with the merit system to do all sort of analysis and investigations, i'm releasing in here a full version of the database used for Tool to run online SQL queries over Full Merit Data(which will continue to use a limited subset of user information in order to have a light solution and avoid the database to grow too much)

The data will be updated possibly every one or two weeks, since it's 29k users at the moment and bound to grow.

You can download the latest version of the full database (SQLite database), with few scripts that will help you to extract the data here: https://drive.google.com/open?id=1l0QQJl2NcAVwbXz49sSK6X5D1xmXxHlw

Here is a brief description of the necessary step to use the database:

Download the Precompiled Binaries for your favourite platform here (we just need the sqlite3):
https://www.sqlite.org/download.html
Quote
"A bundle of command-line tools for managing SQLite database files, including the command-line shell program, the sqldiff program, and the sqlite3_analyzer program."

If you want to know more in detail how to use Sqlite3 directly: https://www.sqlite.org/cli.html

Example for usage under Windows platform, from command line:

Put the database, scripts and the sqlite3.exe in the same folder and you have everything you need to start:

Query.bat file:
Code:
sqlite3 btctalk_full.db < CommandsQuery.txt

CommandsQuery.txt:
Code:
.mode csv
.separator "\t"
.output ResultQuery.csv
SELECT * FROM UserData WHERE Activity > 1000;
.output stdout
.quit

Now from command line call Query.bat and after a bit of crunching you will get the data in the ResultQuery.csv tab separated.

If you just need to update your existing data by adding the trust you can simply substitute the query in CommandsQuery.txt with:
Code:
SELECT Username,Trust FROM UserData;

Each user data row specify when the data was read the last time through the column LastUpdatedData

If you need help or some clarification let me know.
Jump to: