Hi Skeeter,
I've played around with parallelizing verification (though not as part of Bitcoin Core), and once past 100k blocks I can pin all eight cores of my CPU constantly. So there is definitely benefit for parallel verification using a fairly large number of cores. There are a few complications though. Here are my thoughts:
Transactions refer to earlier transactions' outputs, so it is difficult to verify out-of-order unless you can do fast lookups (so memory bandwidth is gonna wreck your performance). Further, within a block, transactions are allowed to refer to earlier transactions' outputs but not vice-versa. So to be consensus-correct you need to verify them in order. You can solve this by doing two passes: one to make sure that all transaction references are valid (this can be done in about 30 minutes on my CPU with my crappy Rust code), and the other to actually do the signature verifications. However, this is actually what Bitcoin Core is already doing for the majority of the chain (full signature verification could take a couple of days), so you wouldn't be gaining anything vs the Bitcoin Core behaviour.
One thing you could do (and what I do in my own code) is to break consensus-compatibility and verify all transactions in each block in parallel. For blocks with hundreds or thousands of transactions there is a big benefit here. The way to do it is:
1. Scan the block and blindly add all transaction outputs to your output set in a linear pass. This makes sure that backreferences will work, but will also allow forward references. (It occurred to me while typing this that it's easy to label the outputs with an index into the block, so you can still easily detect forward refs. Oops
I will fix my code to do this.)
2. Split all transactions in the block into blocks of N_TRANSACTIONS/N_CORES transactions, and pass each one to a new thread for verification. Each thread will need to be able to lookup old outputs, so you will need to make sure your output set is immutable during this phase.
3. Scan the block and delete all spent outputs from your output set in a linear pass.
I expect that by using a parallel data structure for your output set you can merge some or all of these passes. There is room for experimentation and measurement here.
Andrew