Merkle tree of open transactions for lite mode?

BrightAnarchist

donator

Activity: 853

Merit: 1000

watching

TierNolan

legendary

Activity: 1232

Merit: 1094

Quote from: Mike Hearn on June 26, 2011, 04:12:39 PM

No, transactions and their state are loaded from disk. Bitcoin does not keep transactions in memory unless they are pending. It keeps the block headers in RAM but the rest is all on disk.

By "all transactions", I meant a hash for each of the transactions and there would be a hash table that maps the transaction hash to a block-id.

Now that I think about it, you could just place the transaction in a file and the filename would be determined by the hash.

My thinking was that transactions that has been incorporated into the chain would be stored with the corresponding blocks.

This does make caching more difficulty, since 2 transactions that are next to each other aren't related to each other.

Quote

Ideally, they should have added an additional field into the block representation to say which block all transaction inputs came from.

They'd all have to be recalculated after a re-org.
[/quote]

It's not necessary anyway. Sorry about that

.

When verifying the block, the client will have all the transactions involved loaded for that block.

The path from the root to the transaction could be stored in the same file (or a file with the same name in a different directory). This would indicate the hash for the other hash of the pair.

To replace a transaction with another one could be calculated from just that data. Removing is also easy if placeholder hashes are allowed. To remove a transaction just replace the hash with 0 and recalculate.

The location of the placeholder could be added to a stack. That can be efficiently stored in memory, with a cache to disk if required.

Adding a transaction would require popping a location from the stack. If the stack is empty, the new transaction would be placed in the next available slot. This would require one other transaction to have its info updated.

I think that keeps everything completely local.

Mike Hearn

legendary

Activity: 1526

Merit: 1134

Quote from: TierNolan on June 26, 2011, 08:32:13 AM

A network node already has to hold all unspent transactions in memory anyway. Otherwise, it won't be able to perform the check to see if a block is valid.

No, transactions and their state are loaded from disk. Bitcoin does not keep transactions in memory unless they are pending. It keeps the block headers in RAM but the rest is all on disk.

Quote

Ideally, they should have added an additional field into the block representation to say which block all transaction inputs came from.

They'd all have to be recalculated after a re-org.

TierNolan

legendary

Activity: 1232

Merit: 1094

I had a look at the default client and it seems to operate as follows

Code:

Start:

Scan for magic value (0xD9B4BEF9)

Scan header, goto start on error

if size > max allowed, goto Start

if size > read buffer, rewind to start of magic value and goto Start

if version > 209, read and then check checksum, if it fails goto Start

copy payload to message buffer

process message()

refresh last alive for this node

The message processing function doesn't actually do anything for unknown messages.

TierNolan

legendary

Activity: 1232

Merit: 1094

Quote from: Mike Hearn on June 26, 2011, 07:37:28 AM

Around the end of May there were about 150,000 transactions with spendable outputs. Let's assume each transaction has, on average, a single unspent output. So for a binary merkle tree you'd need log2(150,000) = 17.19 or 17 levels. A perfect binary tree with 17 levels would have 2^(17+1) - 1 = 262,143 nodes and each node is a hash of 256 bits or 32 bytes, thus it would require

(262,143 * 32) / 1024 / 1024 = 7.99 megabytes

of data to be live in memory (all intermediate levels must be stored so it can be updated). Not too bad.

A network node already has to hold all unspent transactions in memory anyway. Otherwise, it won't be able to perform the check to see if a block is valid.

Ideally, they should have added an additional field into the block representation to say which block all transaction inputs came from. This would allow network nodes to have a cache and quickly seek from disk for rare transactions.

A transaction shouldn't be added to the chain unless all of its inputs exists somewhere one the chain, so this doesn't create any uncertainties.

What happens if you send unknown messages to the default client?

Since the standard header has a payload length field, it should be possible to skip unknown messages.

Code:

Additional Message header:
F9 BE B4 D9 - Main network magic bytes
65 78 74 65 6e 73 69 6f 6e 00 00 00 - "extension" command
55 00 00 00 - Payload length
01 23 34 56 - Checksum
Message:

You could prefix a message with extension information. The standard client would skip the prefix and process the main message using the standard rules.

This could be used to inform other clients where the transaction inputs are stored for any transactions sent.

Mike Hearn

legendary

Activity: 1526

Merit: 1134

Ah sorry, then I misunderstood your proposal and my points don't apply. Let's standardise on the word "unspent" rather than open, as the codebase uses open to mean something different: a pending transaction that hasn't yet been finalized (see SignatureHash). That's what confused me.

This probably makes more sense for NameCoin where the rules can be more easily changed. Implementing this in Bitcoin would be quite hard now (possible but a big pain to do forced global upgrades).

Merkle trees don't have to be binary, though they often are. It'd be interesting to investigate different tree parameters to see what is the most efficient.

Around the end of May there were about 150,000 transactions with spendable outputs. Let's assume each transaction has, on average, a single unspent output. So for a binary merkle tree you'd need log2(150,000) = 17.19 or 17 levels. A perfect binary tree with 17 levels would have 2^(17+1) - 1 = 262,143 nodes and each node is a hash of 256 bits or 32 bytes, thus it would require

(262,143 * 32) / 1024 / 1024 = 7.99 megabytes

of data to be live in memory (all intermediate levels must be stored so it can be updated). Not too bad. This is using the naive/obvious approach. There is a more efficient implementation described by Szydlo here:

http://www.szydlo.com/szydlo-loglog.pdf

Of course you need a representation that can support multiple unspent outputs being in the tree so the true size would be larger.

TierNolan

legendary

Activity: 1232

Merit: 1094

Quote from: gmaxwell on June 24, 2011, 12:16:39 PM

Hm? I'm not asking for proof related to unconfirmed transactions, I'm asking for proof related to previously mined transactions with outputs which have not been redeemed as of the completion of a particular block. This should be a deterministic product of the block-chain and identically available to every full node.

This would mean that miners would have to refuse extend a valid main chain. Ofc, if you get >50% of the miners to agree, the other miners would have to switch in order to actually get their blocks accepted.

Assuming a deterministic algorithm for adding transactions to the merkle tree, there would be a compromise between keeping it balanced and having to re-do any hashes.

The easiest rule would just be to delete all input transaction in the current block and then add all outputs in the current block (in order) to the node that is nearest the root and then update any broken hashes.

I think it would be N*log(n).

N = number of changes
n = total number of transactions

Proving that a particular transaction was still active for the block would require that you show log(n) hashes and then it can be compared with the top-level hash.

gmaxwell

staff

Activity: 4326

Merit: 8951

Quote from: bji on June 24, 2011, 02:51:54 PM

Yes, me too; which is one of the reasons that I am trying to write a client that can do this.
But my point was that any end user should still rely on several such service providers, not just one; because it would be foolish to trust just one service provider when you can increase your confidence in the results by consulting several.

Absolutely, but this doesn't eliminate the value in having the system support higher confidence absent trusted parties. They'll still be business there in a system where clients can cheaply do more validation alone, but it will just be business for the value you can offer over and above the fully distributed system (faster, immune to some attacks, SLAs, etc). The suggestion I'm making is still vulnerable to DOS attack, and feeding (mildly) stale data, and it requires more data (the tree fragment).

E.g. a system where you ask trusted parties for a signed statement about the transaction and then after getting it randomly also query for the tree to _prove_ it with reasonable confidence, then automatically announce loudly to the network when you detect lying, along with the cryptographic proof of the the inaccurate response— is a much more robust and secure system than "manually add a bunch of trusted parties; only find some were compromised and screwing you weeks later"... and also more costly since there would be no low cost alternative to trusted services (only run your own full node).

bji

member

Activity: 112

Merit: 10

Quote from: kjj on June 24, 2011, 01:02:27 PM

Quote from: bji on June 24, 2011, 11:42:23 AM

I don't quite see what the problem is. If you don't trust a peer to deliver accurate data, just make sure that you ask lots of peers and that you exclude the outliers. It's already true that the client has to trust that it will get a complete, accurate, and up-to-date complete block chain from its peers. If it's getting pre-pruned blocks from some other peer, then it should likewise not rely on just one peer for this information but get it from many. And if one of those peers reports knowing about the block chain to a certain height but doesn't give the right answer given that height, then the peer can stop trusting it and just drop its connection, leaning on peers that it does trust.

There will also be commercial services in the future to provide trustworthy information. I might even do it myself if this ever catches on.

Yes, me too; which is one of the reasons that I am trying to write a client that can do this.

But my point was that any end user should still rely on several such service providers, not just one; because it would be foolish to trust just one service provider when you can increase your confidence in the results by consulting several.

kjj

legendary

Activity: 1302

Merit: 1026

Quote from: bji on June 24, 2011, 11:42:23 AM

I don't quite see what the problem is. If you don't trust a peer to deliver accurate data, just make sure that you ask lots of peers and that you exclude the outliers. It's already true that the client has to trust that it will get a complete, accurate, and up-to-date complete block chain from its peers. If it's getting pre-pruned blocks from some other peer, then it should likewise not rely on just one peer for this information but get it from many. And if one of those peers reports knowing about the block chain to a certain height but doesn't give the right answer given that height, then the peer can stop trusting it and just drop its connection, leaning on peers that it does trust.

There will also be commercial services in the future to provide trustworthy information. I might even do it myself if this ever catches on.

gmaxwell

staff

Activity: 4326

Merit: 8951

Quote from: Mike Hearn on June 24, 2011, 11:31:13 AM

The memory pools of each node are not synchronized, so you can't require the coinbase to include proof of containment.

If you can't trust the remote nodes at all, and you need to know the state of every transaction, you have no choice but to run a full node and do the bookkeeping yourself. For Namecoin this is likely not a big problem. You can just download a signed snapshot of the database from the same place you downloaded the software.

Hm? I'm not asking for proof related to unconfirmed transactions, I'm asking for proof related to previously mined transactions with outputs which have not been redeemed as of the completion of a particular block. This should be a deterministic product of the block-chain and identically available to every full node.

I think your solution for namecoin is a non-solution because it makes it not a distributed system. You have to trust that the data has not been manipulated in some impossible to validate way. The is also practically true of the software as whole, but only because everyone isn't actually going to review the whole thing, not because they couldn't review it. With the tree solution I described you could randomly spot-check such summaries automatically, making them much less risky.

For namecoin, It's also not a solution because knowing the complete state of the system going to be too much work for most namecoin users and it's also pointless.

I want to know for sure what foo.bit resolves to. I don't have the full history (because I don't need it and can't afford the enormous required storage) so I can't tell on my own. I ask my peers about foo.bit.

They tell me that foo.bit is 1.2.3.4 and that the last update to it was mined in block 100,000. We're currently at block 120,000 (or so say my peers). With the current pruned tree stuff I can tell for sure that it really was mined in block 100,000, but I _can't_ tell if it was updated after that since my peers could trivially by sibyl attackers. (If they lied about the chain height and just claimed we were at 100,000 I'd know because I'd notice if my clock was six months off)

If there is a commitment to the open transactions I can ask single peer for the merkle tree fragments from block 120000-199994 for the relevant open transaction (hopefully most of the tree would be the same in those blocks, reducing the data sent) and then be sure that either I got the correct result, that I got a slightly stale result (e.g. it was updated within whatever amount of lying about the height I'd fail to detect), or that my attacker has a serious amount of computing power being employed to give me a stale result about a name.

Quote from: bji on June 24, 2011, 11:42:23 AM

I don't quite see what the problem is. If you don't trust a peer to deliver accurate data, just make sure that you ask lots of peers and that you exclude the outliers. It's already true that the client has to trust that it will get a complete, accurate, and up-to-date complete block chain from its peers.

If you have the block headers (which are quite small), then your trust is limited to that your peers aren't lying to you about the block height and aren't telling you only about some fork. But you can trust that such a fork would not be very long due to the extreme computational cost, and that the block height could not be too outrageously wrong because you know the current time.

The fact that peers can DOS attack you by denying you information can't be avoided but it's also detectable and once detected you can try to seek more peers or at least refuse to do anything stupid.

Quote

I 100% believe that eventually end users will not be downloading the whole block chain as they do now and will have to utilize service providers that can answer queries about the block chain (themselves keeping the entire block chain); I already wrote up a proposal for this on this forum. I'm even working on my own bitcoin client that will have support for this. And I believe that the only way for it to work is for the protocol to be 'open' so that anybody can implement it, because for the reasons that the O.P. brings up and that I point out in this response, a peer just cannot trust a single other peer to provide accurate information and has to have the ability to discover and talk to many peers for the safety and security of redundancy.

Asking multiple peers is a good and fine thing, but it has _serious_ limitations. Attackers can often control your network connectivity, so they can simply make themselves all your peers an you have no way to detect this. Even without network control a sibyl attack can be quite effective: by running thousands of fake peers on a few hundred networks (e.g. just by having simple proxies on those networks) you will very likely end up as all the peers of quite a few nodes. Querying many also results in N-fold increases in network load.

If a client can use the blockchain proof of work to prove most of their required data, then it only needs to consult multiple peers to obtain confidence about the true identity of the highest block, and attacks performed against this clients are limited to attacks about the content of very recent blocks.

bji

member

Activity: 112

Merit: 10

Quote from: Mike Hearn on June 24, 2011, 11:31:13 AM

The memory pools of each node are not synchronized, so you can't require the coinbase to include proof of containment.

If you can't trust the remote nodes at all, and you need to know the state of every transaction, you have no choice but to run a full node and do the bookkeeping yourself. For Namecoin this is likely not a big problem. You can just download a signed snapshot of the database from the same place you downloaded the software.

I don't quite see what the problem is. If you don't trust a peer to deliver accurate data, just make sure that you ask lots of peers and that you exclude the outliers. It's already true that the client has to trust that it will get a complete, accurate, and up-to-date complete block chain from its peers. If it's getting pre-pruned blocks from some other peer, then it should likewise not rely on just one peer for this information but get it from many. And if one of those peers reports knowing about the block chain to a certain height but doesn't give the right answer given that height, then the peer can stop trusting it and just drop its connection, leaning on peers that it does trust.

Or maybe I am misundertanding the problem?

I 100% believe that eventually end users will not be downloading the whole block chain as they do now and will have to utilize service providers that can answer queries about the block chain (themselves keeping the entire block chain); I already wrote up a proposal for this on this forum. I'm even working on my own bitcoin client that will have support for this. And I believe that the only way for it to work is for the protocol to be 'open' so that anybody can implement it, because for the reasons that the O.P. brings up and that I point out in this response, a peer just cannot trust a single other peer to provide accurate information and has to have the ability to discover and talk to many peers for the safety and security of redundancy.

Mike Hearn

legendary

Activity: 1526

Merit: 1134

The memory pools of each node are not synchronized, so you can't require the coinbase to include proof of containment.

If you can't trust the remote nodes at all, and you need to know the state of every transaction, you have no choice but to run a full node and do the bookkeeping yourself. For Namecoin this is likely not a big problem. You can just download a signed snapshot of the database from the same place you downloaded the software.

gmaxwell

staff

Activity: 4326

Merit: 8951

As I understand it, in lite mode you can validate that a transaction is in a particular block, but you can't validate that it hasn't been redeemed by a subsequent block.

This has a couple nontrivial consequences. For example, a lite mode client can't tell if a transaction its been told about is even _possibly_ valid or if it is completely jibberish until someone mines it.

This also means that you can't make a zero trust lite mode namecoin resolver because the only way to be sure that a later transaction didn't transfer the ownership of a name to someone else is to have reviewed the entire blockchain and made your own summary of the open transactions. You can't ask a peer to do this for you unless you trust them not to lie. I think this is not just a namecoin issue, I think it also impacts various distributed contract proposals that require you to be able to see that a transaction is mined but not yet redeemed.

What if the coinbase TXN included the merkle root for a tree over all open transactions, and this was required by the network to be accurate if it is provided.

Then if you wanted to validate that a transaction was really open, you would obtain the block headers, then obtain the N most recent blocks, then ask an untrusted peer for the pruned open-transaction merkle trees leading to that the transaction in question. You are now sufficiently confident that the open transaction is still open as of the height of the blockchain you have access to.

Since full nodes must already keep track of all open transactions for validation, maintaining such a tree would be relatively cheap (log2(n) for single updates, fewer per update if more than one txn changes at a time).

Beyond the namecoin/funky distributed contract case/ this would also permit semi-lite clients which track only the headers and open transactions and are allowed to forget about very old open transactions because they can be reminded of them by untrusted peers but can still do full validation. This might be helpful for network health in the future (e.g. keep track of only recent, likely to be quickly redeemed transactions, and give them forwarding priority. If you can't validate a txn give it low priority, if its children show up again and keep bothering you, then ask a full peer to prove to you that they are are valid and then add them to your cache)

So, can someone please correct my understanding of the system if I'm wrong?

Topic: Merkle tree of open transactions for lite mode? (Read 9894 times)