Author

Topic: Fast seaching approachs (Read 592 times)

full member
Activity: 399
Merit: 105
May 15, 2016, 11:26:30 PM
#3
Like everything else in life, you're going to have a million options.  I had a similar situation a few years ago (non-bitcoin related), results were in ms per query for about 15m records.  I'd do the reverse, i'd use addresses in the blockchain to search against what you have patterned. Here's one possibility...

Setup...
- Create a database to hold your bitcoin addresses and other data, tableX, columnY, columnZ, columnZa,etc
- Index Y, Z
- Use a language that allows the max size of an array by memory (c++, etc)
- Store all your millions of bitcoin addresses into columnY.
- Create an array (A) in memory to hold all your stored addresses (10m, etc).
- Read your database of bitcoin addresses, hash the addresses (something fast and small, like fnv1a64 to reduce of the number of bytes and still have a reasonable collision acceptance 2^64), if your're going to have more than 2^64 bitcoin pattern addresses, just use another small hashing algorithm.  Note that it doesn't have to be cryptographic hash)
- For each of your addresses, store the hash of the address into the array (the number of bytes should ideally be less than the length of the bitcoin address). Also store the hash of the address back into your database (if it already doesn't exist), columnZ.

Process the blockchain...
- Look in the database to see what block was processed last, or if first run, block 0
- For each block, read in all the addresses, hash them and store them into another memory array (B)
- Search through array (A) for values in (B)
- If you have a match, append the block number into columnZa (block0:block22:block88842:etc)
store the last block searched into the database


I'd use an OO language, 64 bit based libraries.
Your DB is going to be doing more reads than writes, so optimize accordingly.
You're going to hog up a lot of memory on the array where you have millions of records, so go for 128GB or higher
Don't use a virtual server, use a physical server where you can optimize your CPU configuration


That's a lot of really good information.  Thanks for that.  Smiley
legendary
Activity: 1512
Merit: 1057
SpacePirate.io
May 15, 2016, 08:21:45 PM
#2
Like everything else in life, you're going to have a million options.  I had a similar situation a few years ago (non-bitcoin related), results were in ms per query for about 15m records.  I'd do the reverse, i'd use addresses in the blockchain to search against what you have patterned. Here's one possibility...

Setup...
- Create a database to hold your bitcoin addresses and other data, tableX, columnY, columnZ, columnZa,etc
- Index Y, Z
- Use a language that allows the max size of an array by memory (c++, etc)
- Store all your millions of bitcoin addresses into columnY.
- Create an array (A) in memory to hold all your stored addresses (10m, etc).
- Read your database of bitcoin addresses, hash the addresses (something fast and small, like fnv1a64 to reduce of the number of bytes and still have a reasonable collision acceptance 2^64), if your're going to have more than 2^64 bitcoin pattern addresses, just use another small hashing algorithm.  Note that it doesn't have to be cryptographic hash)
- For each of your addresses, store the hash of the address into the array (the number of bytes should ideally be less than the length of the bitcoin address). Also store the hash of the address back into your database (if it already doesn't exist), columnZ.

Process the blockchain...
- Look in the database to see what block was processed last, or if first run, block 0
- For each block, read in all the addresses, hash them and store them into another memory array (B)
- Search through array (A) for values in (B)
- If you have a match, append the block number into columnZa (block0:block22:block88842:etc)
store the last block searched into the database


I'd use an OO language, 64 bit based libraries.
Your DB is going to be doing more reads than writes, so optimize accordingly.
You're going to hog up a lot of memory on the array where you have millions of records, so go for 128GB or higher
Don't use a virtual server, use a physical server where you can optimize your CPU configuration
full member
Activity: 399
Merit: 105
May 15, 2016, 05:04:05 AM
#1
Let's say I have a data field that looks like this: "1K15YmnotHd3TL1ZxL4p41MmxD1mAnt1ts"
I'd like to know whether that pattern was present in the last block added to the blockchain.  No problem.  

Now, assume I have 1 million similar datafields.  What is a fast way to determine whether any of them saw action in the last block?  

SQLServer stored procedure?  In memory dataset?  How would you do it to optimize speed?

Since each block could have up to 3000 transactions - the question is: "which of these 1 million exist in this group of 3000?".  I have to perform that every ten minutes.  And, the future suggests there will one day be 10 million.  I'd like my strategy to have a little room for the future.
Jump to: