Perhaps the benchmark methodology could be like
a) online (measuring network transfers as well)
b) offline (measuring validation of disks in the data)
further broken down to
a) real-time (getting RL metrics of blocks as they are added in terms of network speeds, cpu time required, I/O time required), including "catchup mode" where you open the -qt client, you are 1 day behind, and you can measure that as well in terms of how much cpu or I/O is used per mb of block data etc - although that is not 100% accurate due to differences in the way txs are included in blocks.
b) non-real time where you could have something like --bench=height 350000-351000 where you'd measure I/O and CPU speeds for validating a pre-selected range of 1000 blocks that are already stored in your hard disk.
...Any alterations on metrics that are useful could also be included.
I made a custom build with march=native flags and I don't really have any serious tool to measure whether my binary is faster than the official one, or my distribution's binary. I also want to experiment with various GCC versions, Intel and AMD C compilers, clang etc etc to see which gets the best performance, but I'm lacking benchmarking tools.
Using -O2 seems to get pretty close to the best variant with most things, but maybe some specific vector optimizations could make a noticeable difference.
And without a benchmark we'll never know
From what I have seen RAM, DB slowness and serialization issues are the main bottlenecks now.
+
The serial nature creates a lot of dropouts from the saturated bandwidth ideal, so definitely parallel sync is needed for fastest performance.
RAM can be traded with CPU with something like ZRAM where you increase the data that can fit into it, by compressing them in real time. It's pretty handy. With LZ4 I'm getting 2.5-3x RAM compression ratios and can easily avoid swapping to disk which is very expensive in terms of I/O times.
In theory Bitcoin could use its own ram compression scheme with an off the shelf algorithm for things like its caching system or other subsystems.
Same with disk compression which is an I/O tradeoff with CPU. I tried installing the blockchain in a BTRFS compressed partition, it was slightly faster in terms of throughput and also saved 10gb+ in size. From what I remember Windows 7 also have compressed folders support, so it should be doable in windows too.
Serialization is indeed an issue but if you can get a serial process to get on with it 10-20% faster due to custom compilation, then it's worth it - until these things are fixed in the code.
For sync, I make sure network bandwidth is as saturated as possible for as long as possible. This is a shortcut, but practical. If the code cant process the data at full speed, then it can be optimized. Well at least there is a chance to.
Makes sense.