Question 1: If these are indeed FPGAs (I.E., reprogrammable), then is it possible to extrapolate/guess performance figures for other hashing algorithms (MD5, RIPEMD, Whirlpool, etc...) based on stated performance figures for SHA256? It seems to me that these could indeed be useful in other applications, when reprogrammed to something other than SHA256 - for instance a nice little portable MD5 cracker-in-a-box.
If they are FPGA then they could be reprogrammed. Someone would need to design a bitstream to perform whatever task you want. SHA-256 hashing is about 5x as computationally intensive as MD5 and Bitcoin involves a double hash so ballpark you could get ~10 billion MD5 hashes per second if the board really can do 1 billion double SHA-256 hashes. The exact level of performance would depend on how good the bitstream programmer is and how effectively they can utilize all the LUTs to squeeze every hash of performance out of the chip.
Question 2: Also assuming these are reprogrammable, what would the expected performance loss be from flashing them for a single round of SHA256 (instead of double) and then using external software and/or hardware take care of doubling the hashes and fiddling with the nonces? Is this at all feasible, or is having everything all done at once in a single chip the only way to do it?
It is certainly possible but IMHO not economical. We think of hashing as hard because it takes a long time to find a block but in reality a single hash is very easy. For example if each chip can perform 1B hashes per second then on average one hash takes only a single nano second (actually that isn't true it takes a couple however each chip is working on multiple hashes at once). To use two chips working together requires significant bandwidth and introduces latency issues. The board would have to be designed to handle that and it is improbable that a multi-chip solution (working on a single hash) would be faster or cheaper.
Hashing is an almost perfectly parallel task that is somewhat rare in computer science. Usually you have more dependencies between tasks which require intra-chip communication to solve the problem. Generally speaking the more intra-chip or intra-node communication necessary the more overhead in parallel processing (i.e. 4 chips aren't 4x as fast they are 3.2x as fast, 16 chips aren't 16x as fast they are only 12x as fast, etc).
Since each hash is independent and can be solved quickly if you want more hashing power just use more chips, or boards, or rigs.
Example: chip #1 does the first SHA256 round, and then chip #2 does the second SHA256 round, and some external software/hardware increments the nonce and sends it back to chip #1. Perhaps I don't grok how this all works, but there are several experts here and I would like to know whether I have the idea down pat.
As above that would require 1 billion x 512 bytes x 8 bits = 4 tbps (terrabits per second) of intrachip bandwidth. You could potential do that but it would be a lot of work, complexity, and potential performance bottlenecks for less hashing power. Generally you want to minimize intra-element communication UNLESS a single element (SIMD group, chip, server, cluster, etc) can't solve the problem in a reasonable amount of time.
TL/DR version:
You use intra-chip processing when you have to because the problem is too complex or has too many dependencies on other results to process in a single chip. Bitcoin doesn't have that problem. It is just about as perfectly parallel as problems come in computer science.