Each SetBip32() method consists of 2048 HMACSHA512 + (path depth * 1 HMACSHA512). That is 2050 HMACSHA512 for brute forcing m/0'/0'. The 2:44 min is to compute 536,636,700 HMACs in total (it's actually a lot less due to using "specialized" code).
The specialized part is that FinderOuter isn't using the general HMAC functions, everything is specialized to compute only what it's supposed to. For example each HMAC consists of computing at least 2 SHA512 and each SHA512 has at least 2 blocks to compress. PBKDF2 (the 2048 round) repeats this in a loop where roughly 50% of it (4094 block compressions) is skipped on each call which greatly improves the speed.
The only reason why it takes a much longer time (hours) to recover using an address is because of issue #9. ECC on its own is very slow and my implementation of it turns out to be terribly slow.
Additionally when the path is something like m/0/0 the final round (after the PBKDF2) is to compute public keys (so there is an ECMultiply) which is a slow process itself. As a result the recovery process becomes a lot slower and the slowness of FinderOuter on top of it makes it take that long.