As a solo miner, I switched one of my FPGA hosts to bfgminer 2.9.3 in order to mine with getblocktemplate against bitcoind 0.7.1. This particular host has 37 FPGA devices (mixture of bfl singles, cm1's, icarus's) and I notice that bfgminer is sub-optimal: every 2-3 minutes, it does about 20-40 getblocktemplate calls in a row (1 for each FPGA device?). These 20-40 getblocktemplate calls amount to about 3-4MB total. So on average, getblocktemplate generates about 1-2 MB of network traffic per minute.
On the other hand, when mining with getwork, these 37 FPGA devices amount to about 15 Ghash/sec, so it generates about 4 getwork calls per second, or about 600 kB/minute as measured by a packet sniffer.
Bottom line, getblocktemplate as it is implemented in bfgminer causes my host to generate more network traffic than getwork (but fewer RPC calls). I haven't taken the time to read the code yet, but, Luke, isn't there something trivial to optimize to reduce the number of getblocktemplate calls?
As you note, above matters with bitcoind aside, bfgminer does not implement GBT optimally at this time, since it is using code originally built around getwork; cgminer has bypassed the getwork paths for its GBT support, but also intentionally designed the GBT path to use more bandwidth (conman wants to make GBT look bad). Rewriting the pool networking code in bfgminer has been on my todo list for a while (mainly due to other problematic bugs in it), and hopefully I'll get to that before 3.0. But even without it, each template is still able to scale per device, so the problem of ASICs hashing much faster is still dealt with in practice.