Pages:
Author

Topic: PIR / Private Information Retrieval - Privacy for light wallets? - page 3. (Read 833 times)

hero member
Activity: 882
Merit: 5834
not your keys, not your coins!
Deterministic light wallet software is going to ask for information about addresses that have never been used, but will be used in the future. You can ask about every address that has ever received a transaction to hide your previously used addresses, but if you ask about an address that has never been used, and that address subsequently receives a transaction, chances are that address belongs to you. The only way to avoid asking for information about addresses that have never been used but will be used in the future is to ask for information about every address ever used, but if you do this, the server will essentially return the entire blockchain.
That's a very good point. That's why to perform traditional multi-server PIR, you would need to query for on average half of all possible addresses per server and get an XOR of the balances from all the servers so from each server you just receive a few bytes. This XOR makes the difference between downloading just a few bytes and downloading the entire blockchain. However, I'm not sure there exists a way for Bitcoin Core and Electrum to quickly get millions of address balances at once. Probably, the right indexes for this don't exist yet. However, other PIR schemes could be used which require less lookups, I think.

This isn't hard fundamentally, -- the rocket science programming has already been done: http://percy.sourceforge.net/
I know, right, that's why it would be so cool to see it 'in the field'! Cheesy
I'll definitely look into the schemes that were used in that implementation; maybe they address some of the issues in the first proposed, relatively simple scheme.

But the problem is that the computational cost the server is just astronomical.
Riight, I see. It's really a pity, because in cryptography we work in the 'ideal world' where XOR is free, databases are fast, etc... but then the challenge is really to implement it with good efficiency.

Unfortunately, already pretty much the only incentive for people to run electrum servers is to spy on the clients, and this eliminates that "benefit". Smiley
So you mean, if Electrum servers can't spy anymore due to such mechanisms, they lose the whole incentive to run such a server?
staff
Activity: 4284
Merit: 8808
This isn't hard fundamentally, -- the rocket science programming has already been done: http://percy.sourceforge.net/

Just make a database of transactions with spv proofs.  And also make another databases of addresses to tx-database-indexes.  Maybe staple a cleartext index of the address database to it to reduce the number of bisections needed.  Feed that to percy++.

Bitcoin-core has rpcs that will import txn with their proofs into the wallet-- created partially with this usage in mind!

But the problem is that the computational cost the server is just astronomical.  Under most reasonable cost metrics it's "cheaper" to just send the client the whole database.  This only becomes reasonable if you're assuming extremely expensive wireless data rates or what not, and then you still have the problem of who is going to pay for the server.

Unfortunately, already pretty much the only incentive for people to run electrum servers is to spy on the clients, and this eliminates that "benefit". Smiley
legendary
Activity: 2870
Merit: 7490
Crypto Swap Exchange
Electrum's use of bloom filters does not even address the issue without having an error rate that results in massive amounts of data being sent to the light node.

1. Electrum doesn't use bloom filter, it simply send list of address to Electrum server.
2. Bloom Filter (BIP 37) is broken and now succeed by BIP 157/158.
copper member
Activity: 2996
Merit: 2374
In this scheme, the privacy is guaranteed by the fact that the client asks various servers for on average half of the DB entries for each server. Since the client queries multiple servers, on some servers this set of entries does include the item of interest, while on others it does not. The distribution is again 50:50 so no server gains any information (not even a 50% probability reduction when guessing) about which item is actually queried.

Deterministic light wallet software is going to ask for information about addresses that have never been used, but will be used in the future. You can ask about every address that has ever received a transaction to hide your previously used addresses, but if you ask about an address that has never been used, and that address subsequently receives a transaction, chances are that address belongs to you. The only way to avoid asking for information about addresses that have never been used but will be used in the future is to ask for information about every address ever used, but if you do this, the server will essentially return the entire blockchain.

The number of addresses that have never been used is simply too large to hide addresses that have never received any transactions. There could be 100,000 light nodes each make queries to full nodes that take a year to transmit fully, and if proper RNG is used, it would be possible to deduct which addresses belong to which light node that receive transactions after the query is made.

Electrum's use of bloom filters does not even address the issue without having an error rate that results in massive amounts of data being sent to the light node. It is incorrect to say that electrum uses bloom filters.
legendary
Activity: 3472
Merit: 10611
IMO downloading a bunch of random blocks and discarding them after the addresses were extracted is a lot cheaper.
Surely the problem with this is if I send a query to a server asking for the balance of 1000 addresses last used in block X, 1000 addresses last used in block Y, and one address last used in block Z, then it is entirely obvious which address is mine.

Would a better option just be to connect to 20 different Electrum servers and ask each server to send you 1000 random addresses with a balance? (Change thr numbers as you see fit.)
We could connect to 20 different Electrum nodes too.

But to improve my original idea you could download 3000 blocks and not waste that much traffic either.
The SPV client sends the full node a SendCmpct message.
Then by having the headers already sends a GetData message for certain blocks.
The full node will return a compact block with header and short IDs (much smaller than sending the block itself).
The SPV client can now ask for any number of [full] transactions from that block by sending a GetBlockTxn message. (could be 1 tx per block).

The only issue with this (apart from putting pressure on the node to look up these blocks, constructing compact message and sending) is that I believe bitcoin core implementation of BIP-152 doesn't return old blocks (depth=5) as compact.
legendary
Activity: 4592
Merit: 1851
Linux since 1997 RedHat 4
Well, looking up balances of any address is possible in core if the balance is non-zero, since that's the UTX0 set.

But indeed there's no address index in bitcoin core, if you want to find the transaction history - they have a transaction index, but oddly won't do an address index. Meh.
legendary
Activity: 2268
Merit: 18711
It's not practical option because Electrum server have request limit to prevent DDoS attack.
Well sure, but to implement any of this suggestion will require lots of new functionality to be added to Electrum servers, so there is no reason the DDoS protection couldn't be addressed too. Given that some people run Electrum wallets with >1000 addresses, asking a server for a one off batch of 1000 random addresses doesn't seem too onerous, especially since you wouldn't need a new batch of addresses everybtime you opened the wallet.
legendary
Activity: 2268
Merit: 18711
IMO downloading a bunch of random blocks and discarding them after the addresses were extracted is a lot cheaper.
Surely the problem with this is if I send a query to a server asking for the balance of 1000 addresses last used in block X, 1000 addresses last used in block Y, and one address last used in block Z, then it is entirely obvious which address is mine.

Would a better option just be to connect to 20 different Electrum servers and ask each server to send you 1000 random addresses with a balance? (Change thr numbers as you see fit.)
legendary
Activity: 3472
Merit: 10611
An idea... The UTXO set is only around 4-5GB. I would easily allow a mobile wallet to use 5GB out of my... 256? 512? GB of storage, which I don't use anyway, if that's required to get perfect privacy.
The main point of using a SPV client is to reduce the amount of data downloaded and stored. I don't think users would go for something like that which they also have to update every block. There is also the additional issue of trust which can be mitigated to some extent if multiple nodes were used but it still is a big issue.

IMO downloading a bunch of random blocks and discarding them after the addresses were extracted is a lot cheaper.

Quote
I'm not sure; can't it do a single DB request which asks for e.g. 1 million items instead of doing 1M requests at one item each? It's a while that I worked with large databases, but I'm pretty sure such 'aggregated' queries are pretty fast. DMBS could actually even allow to query an XOR of the specified items directly, which should be even more performant.
Someone else should comment on this or it has to be tested, my knowledge of databases is very limited.
hero member
Activity: 882
Merit: 5834
not your keys, not your coins!
For example: user wants to query the balance of their address 'B'. (this could be transaction information or any other blockchain information as well)
As @ymgve2 the biggest problem with this idea is that a SPV client doesn't have the blockchain so it doesn't know which addresses were used already so it can not produce addresses A and C and if they are produced randomly the whole process becomes pointless.
Well, in standard 'linear summation PIR' though, the clients also don't have the database. So they also don't yet know about which item has which ID. The theoretical examples use indexes of datasets as a simplification; of course when implementing, it has to be possible to query for addresses (in Bitcoin scenario) or generic SQL queries (in database scenario) under PIR.
Maybe however, it requires a more sophisticated PIR scheme, that's possible as well... Smiley

An idea... The UTXO set is only around 4-5GB. I would easily allow a mobile wallet to use 5GB out of my... 256? 512? GB of storage, which I don't use anyway, if that's required to get perfect privacy.

It's not enough to ask 20 addresses, that's the issue.
The idea is that you could ask for thousands or millions of addresses per server. XOR is insanely performant on most databases.
You are forgetting that the SPV client doesn't just need the balance of its address, it needs the history. Meaning if you query 1 million used addresses the Electrum node has to send back at least a million transactions. Assuming 200 bytes on average, that is at least 200 MB data which is ~4 times bigger that the entire block headers file (some addresses have more transactions and some transactions are much bigger).
Yes, it's true; I have to think about this.

This is also a big burden on the Electrum node as it would have to look up inside its database a million times per request.
I'm not sure; can't it do a single DB request which asks for e.g. 1 million items instead of doing 1M requests at one item each? It's a while that I worked with large databases, but I'm pretty sure such 'aggregated' queries are pretty fast. DMBS could actually even allow to query an XOR of the specified items directly, which should be even more performant.

Was something like this maybe attempted already?
BIP 157 & 158 have similar goal, but it works differently. One of wallet which use it is Wasabi Wallet which also use Tor by default.
That sounds very interesting; I'm happy about any other kind of approach as well! As long as SPV clients don't need to continue sending cleartext queries to servers forever... Cheesy
legendary
Activity: 2870
Merit: 7490
Crypto Swap Exchange
Was something like this maybe attempted already?

BIP 157 & 158 have similar goal, but it works differently. One of wallet which use it is Wasabi Wallet which also use Tor by default.
legendary
Activity: 3472
Merit: 10611
For example: user wants to query the balance of their address 'B'. (this could be transaction information or any other blockchain information as well)
As @ymgve2 the biggest problem with this idea is that a SPV client doesn't have the blockchain so it doesn't know which addresses were used already so it can not produce addresses A and C and if they are produced randomly the whole process becomes pointless.
One solution could be to download one or more random blocks from (regular) full nodes by sending a getdata message (SPV client already has the headers) then use a bunch of random addresses from there.

It's not enough to ask 20 addresses, that's the issue.
The idea is that you could ask for thousands or millions of addresses per server. XOR is insanely performant on most databases.
You are forgetting that the SPV client doesn't just need the balance of its address, it needs the history. Meaning if you query 1 million used addresses the Electrum node has to send back at least a million transactions. Assuming 200 bytes on average, that is at least 200 MB data which is ~4 times bigger that the entire block headers file (some addresses have more transactions and some transactions are much bigger).

This is also a big burden on the Electrum node as it would have to look up inside its database a million times per request.
hero member
Activity: 882
Merit: 5834
not your keys, not your coins!
How does a brand new client with no information about the chain know which "dummy" addresses to query for? It can't ask for the content of randomly generated addresses, because then it would be pretty obvious which one is the real one since only one of the queried addresses were ever seen on the blockchain.
I am not sure yet on the implementation; I was asking mostly about the idea / concept and I'm happy to hear any suggestions! In the theoretical / classical PIR examples, the queries are just bit vectors. All the indexes with value '1', are elements to be queried. Of course, Bitcoin isn't a database like this, where every utxo has a specific database index. However, in classical databases we also don't just query with the item ID either, but instead queries consist of various properties that need to match, and PIR does work on such databases. So I'm sure it can be implemented in Bitcoin.

Also the performance degradation from querying the balance of 20x or 50000x or whatever addresses for each real one would make this useless in practice.
Well, on regular databases at least, querying one or multiple datasets doesn't incur much overhead over querying just one item; and database XOR is also quite fast. If it wasn't, the whole research field of PIR wouldn't exist, after all. I wouldn't say it's useless without even having tried it. Especially multi-server PIR would be quite easy to setup and could be experimented with to really get tangible results.
full member
Activity: 161
Merit: 230
How does a brand new client with no information about the chain know which "dummy" addresses to query for? It can't ask for the content of randomly generated addresses, because then it would be pretty obvious which one is the real one since only one of the queried addresses were ever seen on the blockchain.

Also the performance degradation from querying the balance of 20x or 50000x or whatever addresses for each real one would make this useless in practice.
hero member
Activity: 882
Merit: 5834
not your keys, not your coins!
Maybe, due to my tiredness today, but can you explain me a little bit what's this private information retrieval or how it works in simple terms, cause I make no sense from the wikipedia article. The only thing I've understood is that you can retrieve an item from a server without having to reveal this server-side; I'm not sure for this either.
Excellent! Smiley You can retrieve a database entry without revealing to the server which entry you are looking for. That's exactly what PIR is all about!

I'm trying to acknowledge this with common sense and fail at the how the hell part. How can you really be sure that what you're sending to the server cannot be seen by it?
Yep, some schemes are secure against computationally unbounded attackers, while others are only secure against computationally bounded attackers, but in practice they are all pretty good. The more sophisticated ones use OT etc., which is often needed to overcome the issue of having multiple non-colluding servers. The cool thing in Bitcoin is that we DO already have such servers in place. So we can use one of the much simpler schemes like the 'original' PIR concept from '95.

In this scheme, the privacy is guaranteed by the fact that the client asks various servers for on average half of the DB entries for each server. Since the client queries multiple servers, on some servers this set of entries does include the item of interest, while on others it does not. The distribution is again 50:50 so no server gains any information (not even a 50% probability reduction when guessing) about which item is actually queried.

Each server returns an XOR of the queried items; when XORing all results together, the result is the one single 'item of interest', since the queries are constructed such that all other items cancel each other out through XOR.

For example, in a database with 8 elements, where element 3 is the one of desire, each server gets queries as follows:
  • A: 01101000 (random 8-bit string)
  • B: 10010001 (other random 8-bit string)
  • C: 11011001 (calculated like: 00100000 + (A + B))

This gives very high security (I think information-theoretical) because the number of 1's and 0's are on average equally distributed.
So in theory querying 2 servers like this would result in a result 'download' that has the size of the balances of all Bitcoin addresses (on average half of all addresses per server) Grin

Maybe better explanation of the above example:
Code:
A: 01101000 => server A is queried for the second, third and 5th database item
B: 10010001 => server B is queried for items 1, 4, 8
C: 11011001 => server C is queried for items 1, 2, 4, 5, 8

01101000 was chosen uniformly at random, 10010001 as well.
We want item 3, so the query would be 00100000.

Construct C:
    01101000
XOR 10010001
XOR 00100000
------------------
    11011001

The servers return the queried item, but after XORing them. In a database with 8 items it doesn't become quite obvious yet, but think of it having 100.000 items. On average, every server is queried for ~50.000 items, performs an XOR of them and returns that to the client.

XORing the files (or address balances) is fine because the query C is constructed so that XORing all of the queries (or all of the resulting datasets) nets to just 'item 3'.
legendary
Activity: 1512
Merit: 7340
Farewell, Leo
Maybe, due to my tiredness today, but can you explain me a little bit what's this private information retrieval or how it works in simple terms, cause I make no sense from the wikipedia article. The only thing I've understood is that you can retrieve an item from a server without having to reveal this server-side; I'm not sure for this either.

The article requires you to know terms such as oblivious transfer or what's the quadratic residuosity problem which I've no idea.

I'm trying to acknowledge this with common sense and fail at the how the hell part. How can you really be sure that what you're sending to the server cannot be seen by it? Let's assume you want to know the balance of an address. In order to find out, you'll have to send it to a server and the server will return you the amount based on the search it has made. If it doesn't know the address how will it search its database?
hero member
Activity: 882
Merit: 5834
not your keys, not your coins!
Could this not be done at the client side?
Electrum has a list of servers, it asks 20 servers for information 20 addresses and just gives the user the information for the 5 they need. (made up numbers)
It's not enough to ask 20 addresses, that's the issue.
The idea is that you could ask for thousands or millions of addresses per server. XOR is insanely performant on most databases.

In theory, in true multi-server-PIR you actually construct fully random queries for all but one server, then build that query so that the results of all servers XORed together result in the 'actual query'.

For example, in a database with 8 elements, where element 3 is the one of desire, each server gets queries as follows:
  • A: 01101000 (random 8-bit string)
  • B: 10010001 (other random 8-bit string)
  • C: 11011001 (calculated like: 00100000 + (A + B))

This gives very high security (I think information-theoretical) because the number of 1's and 0's are on average equally distributed.
So in theory querying 2 servers like this would result in a result 'download' that has the size of the balances of all Bitcoin addresses (on average half of all addresses per server) Grin

I'm not sure if it is plausible to query each server for on average half of all Bitcoin addresses (even if just the XOR of them has to be transmitted). So maybe another PIR scheme may be better suited, or it may be deemed 'good enough' to just query each server for e.g. max. of 10.000 addresses or something like that.

But, it does generate a lot more traffic for the servers. Which most (all?) people are providing for free so there is that.
Well, think about the mobile wallet users Grin I don't think refreshing your wallet balance and waiting for a 5GB download (made up number Wink) would be much fun and cheap on the monthly data plan ^^
legendary
Activity: 3500
Merit: 6320
Crypto Swap Exchange
Could this not be done at the client side?
Electrum has a list of servers, it asks 20 servers for information 20 addresses and just gives the user the information for the 5 they need. (made up numbers)

Then it does internally what it needs to do a transaction and sends that through a 21st server.

The local desktop keeps a record of what server it asked what and what server it sent what thought.

Not as robust as full PIR but with enough clients asking enough questions it really makes data mining harder.

But, it does generate a lot more traffic for the servers. Which most (all?) people are providing for free so there is that.

-Dave
hero member
Activity: 882
Merit: 5834
not your keys, not your coins!
So, I was thinking a lot about privacy for users that don't run a personal Full Node + Electrum Server themselves.
Of course, with only ~10.000 nodes and millions of Bitcoin users globally, it means the vast majority of Bitcoin users is sacrificing privacy by querying (sometimes fixed, built-in) Electrum servers.

In cryptography there is the concept of PIR / Private Information Retrieval.
The trivial PIR scheme is the one where the client downloads the whole database and performs the query locally, which would equate to running their own full node.
However, since '95 with PIR there are much smarter ways to achieve private information retrieval which I think would highly benefit the majority of Bitcoin users.

The earliest, simpler (but more sophisticated than trivial PIR) PIR concepts actually rely on multiple, non-colluding servers with exact copies of the database. This is exactly what we have in Bitcoin already. Tons of nodes with an exact copy of the blockchain. This means it might be possible to implement this mostly client-side with maybe little changes to ElectrumX. Many recent improvements in PIR are mostly aimed to reduce the number of needed servers; however, we don't need to worry about this at all which would make the whole thing much simpler.

The core idea of multi-server PIR is as follows: the client doesn't query just one (Electrum) server, but instead multiple, for various different files (addresses). Only one of these files (addresses) is the one actually interesting for the client. That way any one server cannot assume that any of the datasets queried actually belong to the client.

For example: user wants to query the balance of their address 'B'. (this could be transaction information or any other blockchain information as well)
Instead of querying one server for the balance of address B, it queries multiple servers like this:
  • 'Server 1, give me balance of addresses A, B, C'
  • 'Server 2, give me balance of addresses A, C'.

Now, mostly to save on communication(I think), the servers XOR the binary representation of the balances (in the example: balanceA+balanceB+balanceC and balanceA+balanceC) locally before sending; then the client will XOR the results: (balanceA+balanceB+balanceC) + (balanceA+balanceC). The unwanted, randomly chosen addresses' balances cancel each other out and they gain the balance of address B.

To implement it into ElectrumX, and retain backwards compatibility to non-PIR queries, a PIR query would just need some kind of extra indicator that makes it clear the client isn't just querying multiple addresses for which they want the separate balances in clear; instead, that the client wants the XOR of those balances.
It would be possible to just get the balances and discard the ones a client is not actually interested in, as well, but especially with larger queries the communication overhead would get quite big. XOR of the query results solves this issue.

What do you guys think about implementing such thing in Bitcoin / ElectrumX? Was something like this maybe attempted already? Other suggestions for improvements of the idea? Or is it all superfluous for some reason after all? Glad for any input on the topic!
Pages:
Jump to: