I very much agree that this is one of the most well written whitepapers around the alt community. This is the caliber of work that I always expect to see and am usually very disappointed.
I've been over the whitepaper a few times and over the implementation details a few times. Unlike Graham, I have spotted a few possibly "obvious" things that lead to some questions and comments.
Most importantly, what prevents a pool from using indistinguishability obfuscation of garbled circuits or Gentry style FHE to deliver (each block) a circuit to a worker that allows the worker to sign valid work, but nothing else? This would add quite a bit of overhead to the signing step for the worker (hitting 1kh/s might even be optimistic, heh) but pooling can still occur and that overhead will only come down in time.
(This is generally an open question going back to at least 2011. My understanding is it is commonly held that avoiding pooling in general is just mathematically impossible, and specifically because of secure function evaluation.)
Similarly, it seems like a *blinded* multisig escrow (as discussed at https://bitcointalksearch.org/topic/blind-signatures-using-bitcoin-compatible-ecdsa-440572 and elsewhere) could be employed for a decentralized pool, assuming a specific participant is trusted with redistribution of funds. (However I may have missed a check that would preclude this. It does seem like it would be possible to avoid, and even not all that difficult, unlike the SFE problem.)
Of lesser concern, I haven't seen any check during address generation or signing that the address is actually able to be extracted from signatures. Not all addresses will be. It isn't immediately clear to me what happens in this case. I suspect it just means stales that die somewhere under processblock and never get broadcast. (?)
Finally, it seems like the goal of avoiding immense gpu/fpga/asic gains over CPU is somewhat precluded by the scheme itself. By ignoring nonce (other than perhaps the 64-way parallelism you can get "for free" from the masking) and instead iterating on r point values (parallel deterministic selection of k) you can maintain an efficient single-point "wave-front" (and as such a minimum of memory bandwidth utilization) against only one block instance in memory by having each hasher just fill in its distinct signature of the same nonce. By churning just signature+hash instead of nonce&hash+signature+hash you would eliminate the secondary bottleneck for parallelism, and get performance much more comparable to "just" x11 hashing.
This is some rhetoric that I'm somewhat surprised to see here. This sort of statement is usually something I see more often from "shadier" development camps, not the people attempting any "actual" potential innovations. Most of us doing real work usually *want* to get that work passed by peers as early as we can!
In any case, I wish you the best of luck with your noble endeavor. I hope that I am wrong and that what you are trying to do is not simply impossible.