Pages:
Author

Topic: Open scraped data of all the users - SQL Lite DB - 2.481.270 users - page 2. (Read 946 times)

legendary
Activity: 2338
Merit: 10802
There are lies, damned lies and statistics. MTwain
Thanks for the raw files @Piggy.

I managed to load them onto SQL Server using MS Access as a bridge, since direct import was giving me a hassle. Once I merged all the data, the raw table has  2.563.660 records, out of which 116.071 are the ones you mentioned as baring no associated real record (16K real missing IDs on Bitcointalk, and the rest are 100k IDs greater than the largest UserId –> excedent scrape IDs).

When I have some time I’ll cleanse the data and take a look if there’s anything interesting to derive.
hero member
Activity: 784
Merit: 1416
I love the raw data, I found myself an imposter!

Will you update this once in a while? How did you scrape this many profiles in just 5 days? Did you use 6 different IP-addresses? I can only download one page per second from Bitcointalk.

yes, you need to use different ips, with multiple requests per second you can cut down the time quite significantly. For this kind of job just one call per second it just takes too long to finish.

It should not be a problem to re run once in while.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
I love the raw data, I found myself an imposter!

Will you update this once in a while? How did you scrape this many profiles in just 5 days? Did you use 6 different IP-addresses? I can only download one page per second from Bitcointalk.
hero member
Activity: 784
Merit: 1416
the dates are originally in UTC format, i just uploaded the raw files(raw.zip) where i saved all the data while scraping, i figure it's going to be easier to import the data in SQL Server with these.

You need to remember to clean the table after you import the data, you should end up with some record full of NULL for all the fields beside the user id as the raw files contains empty lines (some user id does not correspond to any user account)

the raw data looks like this:
Code:
450001	RubenDitte	0: -0 / +0	0	0	0	Brand new	2015-02-18T16:33:21	2015-02-18T18:33:06					hidden	How Any Girl Can Develop into Abundant	88 Woodwark Crescent		N/A	Australia, Kirrama	2018-11-03T09:55:25.575278
400001 JetbuyingMoneypak 0: -0 / +0 0 2 2 Newbie 2014-12-03T13:43:16 2016-03-15T04:49:08 jetgogoing hidden Male N/A 2018-11-03T09:55:25.575182
250001 Noobi3 0: -0 / +0 0 1 1 Newbie 2014-02-14T14:47:17 2014-02-14T18:47:00 hidden N/A 2018-11-03T09:55:25.574999
100001 jukee 0: -0 / +0 0 0 0 Brand new 2013-04-14T23:59:11 2017-07-28T18:02:10 hidden N/A 2018-11-03T09:55:25.646278
350001 Kekinos 0: -0 / +0 0 0 0 Brand new 2014-06-30T14:02:44 2014-06-30T14:02:44 hidden N/A 2018-11-03T09:55:25.67539
legendary
Activity: 2338
Merit: 10802
There are lies, damned lies and statistics. MTwain
<...>
Nice initiative Piggy. My environment is normally SQL Server, so I’ve tried to get the data loaded there. In order to do so, I have installed a browser that let’s me see the data in the btctalk_full.db file in a MS SQL Server studio manner. The DB contains 2 tables:

MeritData -> 112.345 records
UserData -> 2.437.064 records.

The only  think I can’t seem to figure out is why the dates on the UserData table show only as years (i.e. 2018, 2017, etc.) , and not as full dates. The fields are defined as bigints, so I assumed they contained the Unix timestamp. Theoretically, I’m looking at the original database, and the installed environment lets me see the content, so it should not have changed the date fields.

Are they full Unix dates or just the year on the UserData table?


P.D. Great scraping speed there (I figure at least 5 processes running 24/7)
hero member
Activity: 784
Merit: 1416
December data is available here: https://drive.google.com/open?id=1mGEk6V3c_D-IhSYbuJvPrGGVEWLb0V8L

There is both raw data and the SQL Lite DB. There are now 2.481.270 users in it and about 44.206 new users since the last run.


Since from time to time people get interested in having all user data available for different purposes, but they don’t know how to scrape it or they scrape each time their own (and can take a really long time to do it), i thought i would do it and share the data with everybody.
We had already some great thread showing different kind of stories and aspects of users over time, i’m sure somebody will come up with something new we have't seen so far (this contain also a fresh snapshot of the trust details of all users) or can simply be useful to update own user data.

The data was taken between the 3th and 8th of November, there are 2.437.064 users.

This is the table and data available for each user in the database, which you can use to run your query as per example below:
Code:
TABLE UserData(
UserId Integer PRIMARY KEY,
UserName TEXT,
Trust TEXT,
Merit Integer,
Posts Integer,
Activity Integer,
Rank TEXT,
DateRegistered Integer,
LastActive Integer,
ICQ TEXT,
AIM TEXT,
MSN TEXT,
YIM TEXT,
Email TEXT,
Website TEXT,
Bitcoinaddress TEXT,
Gender TEXT,
Age TEXT,
Location TEXT,
LastUpdatedData Integer
);


You can download the data here, both db and raw(db is about 366MB): https://drive.google.com/open?id=1l0xz1OC4mc3FvXzX18scSbSTzsnUkD4f


Quote
Here is a brief description of the necessary step to use the database:

Download the Precompiled Binaries for your favourite platform here (we just need the sqlite3):
https://www.sqlite.org/download.html
Quote
"A bundle of command-line tools for managing SQLite database files, including the command-line shell program, the sqldiff program, and the sqlite3_analyzer program."

If you want to know more in detail how to use Sqlite3 directly: https://www.sqlite.org/cli.html

Example for usage under Windows platform, from command line:

Put the database, scripts and the sqlite3.exe in the same folder and you have everything you need to start:

Query.bat file:
Code:
sqlite3 btctalk_full.db < CommandsQuery.txt

CommandsQuery.txt:
Code:
.mode csv
.separator "\t"
.output ResultQuery.csv
SELECT * FROM UserData WHERE Activity > 100 AND LastActive <= '2015-01-01';
.output stdout
.quit

Now from command line call Query.bat and after a bit of crunching you will get the data in the ResultQuery.csv tab separated.


Pages:
Jump to: