Did you miss the 'interleaved' word somehow? You need all 4 hashes to have last 4 bits zero to match the target 0x0000FFFF.. I'd suggest just reading the code.
However, using only last 64 bits of each of 4 hashes (and effectively only last 8-10 bits for PoW at current difficulty) kills the math behind their cryptographic security proofs.
The bit interleave part actually has the opposite effect in been ASIC resist. It literally takes zero cost do to fixed bit interleave in ASIC. Been an ASIC designer myself for almost 10 years,I would say this is less effective than dark/quark algorithm in terms of ASIC resist.
The worst part of bit interleave is that the difficulty target is directly mapped to each hash function, which makes parallel calculation possible and simple. If one of hashes result is less than partial target, the rest calc can be skipped.
If I were to implement this in Fpga, I would do it in 4 stages: hefty1+keccak, sha256, Blake, and groestl. The overall hash-per-sec is determined by hefty1+keccak. Reviewing the source code, my estimation is that complexity of hefty1 is in the same magnitude of sha256. So overall throughput would be similar to sha256. Considering cost of all hashes, 1/10 hash throughput of existing bitcoin Fpga miner is very easy to achieve.
Your assessment is right. We're dominated by HEFTY1+SHA256 now in our cudaminer HVC branch. And all the other hash algos do not contribute significantly to the total amount of computation anymore.
We're doing 7 MHash/s on a GTX 780 Ti, nearly 4Mhash/s on GT 750Ti (and both at very low power utilization), which is better than the currently available AMD miners. We're aiming for a release some time mid week. But I am sure the AMD miners will improve too.
Christian