The timestamp at offset 4 (bytes 4 - 7 in the last block) either can't move, or must become an added nonce field, because existing hardware is modifying that location in order to extend the nonce range already, resulting in timestamps on mined blocks that are inaccurate by up to 256 seconds. Nothing else (except another nonce field) could be modified in that way without destroying the validity of the block.
That is a good point. I assumed that the time field was being incremented outside of the hardware (i.e. in cgminer) per ntime rolling but if it is being incremented internally then yes this would need to be a nonce field. Although less clear it would be possible to extend the nonce to be a 64 bit number using these bytes as the upper 32 bit. Time could be stored properly in a prior block (using sha definition not bitcoin definition) to avoid the "hack" of using wrong timestamps as a extra nonce.
The difficulty field at offset 8 (bytes 8 - 12 in the last block) can't move because that's the location that existing ASICs read in order to find out whether the hashes they're coming up with are low enough to win or not.
This is not correct. If it were the case then a hardware could only be used for solo mining and would never return any result except one that solves a block. The hardware is fed a lower share difficulty (not part of the actual block structure) and the hardware returns results smaller than that (which by definition also includes all results smaller than the block target).
The nonce field at offset 12 (bytes 12 - 16 in the last block) can't move because ASICs are testing images of the blocks with all possible values there, and will cough out the value to overwrite that location with.
Correct.
Your point that the nonce field at offset 12 could compatible be an 8 or 16-byte field instead of a 4-byte field is a good one. Future generations of ASICs could use that, as you say, to support faster hashing without messing with the timestamp field.
Based on your input about the timestamp field if that is being incremented internally in hardware I would split the 64 bit nonce to be two 32 bit numbers at offset 4 and 12. This is because we have a limited amount of space in the final block and that would preserve the most bits. It would give us 8 bytes to improve the security of mining. Two possibilities for using those bytes would be a partial hash of the prior block which would allow miners to verify their hashpower isn't being used for malicious purpose and/or the elements required for opaque mining (a technique to prevent withholding attacks) however I haven't looked closely to see if that would be possible and still remain backwards compatible.
Final block
Offset 0 - 4 bytes AVAILABLE
Offset 4 - 4 bytes upper 32 bits of nonce64
Offset 8 - 4 bytes AVAILABLE
Offset 12 - 4 bytes lower 32 bits of nonce64
So we gain a larger nonce range and 8 bytes for additional security features while maintaining backwards compatibility with existing hardware.
But your answer to my original question is that the second 48 bytes of the second block could be used to hold data. For example, if we wanted to put the (32 byte) Merkle root of the current UTXO set in the header, we could put it there.
Correct. The padding is just used to ensure that the format of the last block remains compatible by ensuring the 64 byte boundaries are observed. It would be possible to add an entire additional block if you needed more space.
Also, now I have a second question. Since current ASICs presume they are working on the second=last block, won't they be putting the length of the current rather than the revised block header in at offset 56? And therefore won't the results deviate from "real" SHA256 when they are used on the third=last block?
That is a good question. If that is being appended internally then you are probably right. This would make Bitcoin a nonstandard variant of SHA-256.