Currently 76 bytes? Not using the getwork protocol. A getwork response without http headers is already close to 600 bytes... to transfer 76 bytes of data.
Good to know. Never looked inside a getwork request. I just knew it was 76 bytes of actual header information. 600 for 76 seems kinda "fat" but then again as you point out even at 600b per header the bandwidth is trivial so likely there was no concern with making the getwork more bandwidth efficient.
Nice analysis on complete bandwidth economy. Really shows bandwidth is a non-issue. We will hit a wall on larger pools computational power long before bandwidth even becomes a topic of discussion. I think a 64bit nonce solves the pool efficiency problem more elegantly but the brute force method is just to convert a pool server into an associated collection of independent pool servers (i.e. deepbit goes to a.deepbit, b.deepbit, c.deepbit ... z.deepbit) each processing a portion of pool requests.
Still had Satoshi imagines just a few years into this experiment miners would be getting up to 3GH per machine he likely would have gone with a 64bit nonce. When he started a high end CPU got what 100KH/s? 4 billion nonce range lasts a long time with sub MH performance.
That also shouldnt become a major issue, merely the current implementation scaling badly there.
To increment the extraNonce in the coinbase transaction, we rehash every transaction in the block and rebuild the whole merkle tree, so the whole thing ends up scaling with (blocksize * requestrate). Adding the obvious optimization of storing the opposite side for the merkle branch and only rehashing coinbase and its merkle branch, we need an additional (log2(#tx in block) * 32) bytes of memory but scale roughly with (log2(#tx in block) * requestrate) for getwork.
At something like a current average block (~10kB in ~20 transactions), that comes out to ~240 vs. 8 sha 256 operations for 4Ghps worth of work.
Scaling to visa-level 10 kTX/sec (that'd be 3GB blocks containing ~6M transactions ...), it's ... about 50 sha256 operations for 4Ghps worth of work.
So for a pool roughly the size of the current network that'd be... for current tx volume 24k sha256/sec vs 150k sha256/sec for 10k tx/s.
And using something like "miners increment block time themselves", this can be cut down by another factor of 60.
Scaling this for increasing hashrates due to Moore's law... well... that applies to both sides.
So for the getwork+PoW side, I just don't see any hard issues coming up.
I expect to see way bigger problems on the transaction handling side of things scaling to such massive levels, assuming every tx has 2 inputs + outputs on average, you'd be verifying about 20k ECDSA sigs/second and on every block you're marking ~12M outputs as spent and storing ~12M new outputs in some kind of transactional fashion, probably just the list of current unspent outputs would be on the order of 10s of GB ... ugh.