Author

Topic: on indexing random data (Read 975 times)

kjj
legendary
Activity: 1302
Merit: 1026
September 07, 2012, 04:51:53 PM
#4
Why not reverse the number 1st before indexing. This way you move all the zeros to the other end and the tree will balance perfectly.

Holy shit!  That is a damn good idea.

Ooh, how about this though?  Just take the bits in the proper order from the far end.

That way, the tree lookup is just
Code:
nextChunkID = ((needle >> (depth *9)) & 0x1F) *4;
sr. member
Activity: 438
Merit: 291
September 07, 2012, 03:59:53 PM
#3
Why not reverse the number 1st before indexing. This way you move all the zeros to the other end and the tree will balance perfectly.
kjj
legendary
Activity: 1302
Merit: 1026
September 05, 2012, 07:48:55 AM
#2
I realized last night that parts of this were off by 8.  Some of the numbers I put in were for 512 byte blocks instead of 4k.  It didn't help that 512 bytes is 4096 bits.  In particular, a tree chunk has room for 512 pointers, not 64.  They can cover 9 bits of index space instead of 6.  List chunks still have about 64 entries, it depends a bit on how much metadata you want to attach to them and if you care about alignment.

Also, when a list chunk gets full and converted into a tree chunk, it isn't necessary to create all 512 new list chunks.  You can use zero as a special value to indicate an empty list rather than creating an empty chunk.

If used to index the block chain (which will have an increasing number of initial zero bits over time), the root chunk can store pointers to initial zero trees, letting the system skip quickly to where it needs to start.  Pointers to zero bit depths of 18, 27, 36, 45, 54 and 63 should keep us happy for years to come in exchange for less than 2% of the dead space in the root chunk.
kjj
legendary
Activity: 1302
Merit: 1026
September 05, 2012, 12:16:00 AM
#1
In IRC, we were talking about new index formats for the block chain.  Since most of the things we want to index by are random, this gets hairy.

I'm thinking of a tree of lists structure.  It should have pretty good performance even in the worst case, and be pretty safe under IO failures.

In practice, this is a collection of 4k chunks.  Each chunk starts with a header of some sort, probably very basic.  At a minimum, each chunk needs to indicate what type of chunk it is, and a chunk ID will make it very easy to keep a journal (see below).  4k was chosen because it is the native block size of modern hard drives.  It is also the native page size for virtual memory in lots of CPUs.  Memory mapped file IO should be FAST, not counting syncs.

The first chunk will be a tree chunk.  After the header, it has 64 pointers, each one 32 bits long.  These pointers are to chunk numbers inside the index file.  The first pointer is for the leftmost branch of a binary tree, corresponding to 000000, the second is 000001, the third is 000010, etc.  Lookups are very fast, simply isolate the 6 bits that are important, multiply by 4 and add the header size.  This could also be a different chunk type, a "root chunk" with extra metadata, but would mostly act like a tree chunk.

If a pointer is to another tree chunk (identified by the header) that subtree covers the next 6 bits, etc, etc.  In the block chain example, this means that there will be 5 mostly empty tree chunks, and the tree will be somewhat unbalanced.  The waste shouldn't be too bad, and a special case could skip through them.  If using this system for keys (or really anything but block header hashes) it will be more balanced.

If the pointer is to a list chunk, that chunk has a list of not more than 62 or 63 pointers (it depends on the size of the header) into the full block structure, each pointer is 64 bits long, just a byte offset.  Lists aren't necessarily sorted, but by the time you get there, it is just a matter of a few dozen comparisons.

Any time an insert would cause a list chunk to hit a high water mark, it triggers a reorg.  64 new list chunks and a new tree chunk are written and items from the old list chunk are distributed among the new lists.  Once those are done and synced, the parent tree chunk would be overwritten with a pointer to the new tree chunk.  Then the old chunk is marked available.  Since chunks are all the same size, the structure won't fragment.  Worst case is a crash just before updating the old chunk, which would leave 65 orphans.  Garbage collection should be trivial though, and the free chunk list easy to maintain.

For extra safety, writes could be done with a journal chunk or two, for example, reserve chunk #2 as the permanent journal, and any time you need to write, write there first, and then write the real chunk.  For performance, you'd probably want dozens or hundreds of journal chunks.  If they have IDs, any time you open the file, you can just replay anything you find there.  Zero the whole journal when you are done with it so that future writes can just start at the beginning of the journal and go in (write) order.

This system could be extended with a recent index buffer in addition to the main index.  This would be a third chunk type, but basically just a list chunk.  The difference would be a pointer back to the previous buffer chunk.  The buffer would be checked first, and the size of the buffer could be dynamic.  If your node mostly looks up recent blocks, most lookups will be buffer hits, so keep more buffer chunks.  If you get mostly buffer misses, you are wasting your time checking it, so keep fewer buffer chunks.

When the system is IO bound, the buffer could even have other uses.  For example, when used for the blockchain, during initial downloads, the buffer chain could be allowed to grow to unusual size, and kept mostly in memory, with completed chunks written only when full.  In a crash, only the most recent few dozen indexes would be lost.  When the node was caught up, it could then replay the buffers starting from the oldest chunk, and only freeing them after their contents were fully committed to the main structure.  This might cause a hole in the structure, but only once, and defragmentation would be easy.
Jump to: