Author

Topic: --bench (?) (Read 1352 times)

legendary
Activity: 1708
Merit: 1049
February 28, 2016, 10:08:27 AM
#14
I think it's -Ofast that can break things, -O3 has been used extensively from what I know in various situations (even building the kernel without issues Cheesy).

Anyway I did a RL benchmark today that was as follows:

1st binary = x64 binary from core
2nd binary = x64 custom binary / -ofast -march=native with gcc 5.3.1 and sec256 build with clang -ofast -march=native instead of -O2 and gcc.

I'm synced like ~two days behind with my bitcoin-qt, so I copy the dir in /test1 and /test2 locations (test performed in mechanical disk btw).

then I test the syncing of the last 2 days with each binary on /test1 and /test2

From start to inside the wallet: Normal binary = 04'41m / custom binary = 04'42m
(it was somewhere around 4002xx block)

Until block 400300: Normal binary = 06'14 / custom binary = 06'04
Until block 400400: Normal binary = 08'44 / custom binary = 08'05
Until block 400420: Normal binary = 09'12 / custom binary = 08'32
legendary
Activity: 1176
Merit: 1134
February 27, 2016, 06:41:11 PM
#13
Thanks. very useful info.

If only I could try a similar bench with more bitcoin functions (?) Cheesy

Quote
I am assuming default uses -O2?

Yep... O2

Quote
I have found that in rare cases, -O3 and others beyond -O2 change behavior of some types of code. Not sure what class of code is broken as I didnt have time to isolate it, especially when -O2 results are so close to the best times.

Yeah I've heard some times it happens so one must check the integrity of the binary.

For me the best gains usually come from different compilers altogether. I remember back in the cpu mining days of darkcoin I had created a cpu miner that, through manual makefile tampering, was combining 3-4 different compilers for different hashing steps to max out performance. I don't remember how fast it was, probably somewhere in the 10% range compared to the best single-compiler benchmark, but still, it's 10% out of nowhere.

I might give it a go and create a bitcoin-qt with sec256 as clang and the rest as gcc.

Btw, there's probably more time to be found beyond o2/o3/ofast with profiling a test run of the code for the compiler and then building by using the profile generated. ICC was very good at that several years ago, but I haven't tried since. GCC also supports it I think. If I'm not mistaken, mozilla does firefox builds with profile optimizations.
the profile based optimizations can lead to giant wins, especially if the compiler chose the wrong default.

to get double digit gains, you would need to find ways to take full advantage of vectorizing and we rapidly get CPU specific. and usually required refactoring an algorithm to simd model, which could easily make things slower depending on the exact algo.

if you are willing go that far, then nothing beats hand tuned assembler using whatever is best for each line of code. kind of an extreme analogy to using different compilers for different portions of the project

If you are using beyond -O2, probably a good idea to only to it for the purely algorithmic files and not the networking or other system oriented ones. There isnt that much time spent there anyway and the risk of over aggressive compiler optimizations changing the behavior enough to become incompatible just doesnt seem worth it.

James
legendary
Activity: 1708
Merit: 1049
February 27, 2016, 09:18:59 AM
#12
Thanks. very useful info.

If only I could try a similar bench with more bitcoin functions (?) Cheesy

Quote
I am assuming default uses -O2?

Yep... O2

Quote
I have found that in rare cases, -O3 and others beyond -O2 change behavior of some types of code. Not sure what class of code is broken as I didnt have time to isolate it, especially when -O2 results are so close to the best times.

Yeah I've heard some times it happens so one must check the integrity of the binary.

For me the best gains usually come from different compilers altogether. I remember back in the cpu mining days of darkcoin I had created a cpu miner that, through manual makefile tampering, was combining 3-4 different compilers for different hashing steps to max out performance. I don't remember how fast it was, probably somewhere in the 10% range compared to the best single-compiler benchmark, but still, it's 10% out of nowhere.

I might give it a go and create a bitcoin-qt with sec256 as clang and the rest as gcc.

Btw, there's probably more time to be found beyond o2/o3/ofast with profiling a test run of the code for the compiler and then building by using the profile generated. ICC was very good at that several years ago, but I haven't tried since. GCC also supports it I think. If I'm not mistaken, mozilla does firefox builds with profile optimizations.
full member
Activity: 182
Merit: 107
February 27, 2016, 01:02:13 AM
#11
The problem is how to come up with a sensible benchmark.

For example the block & transaction storage was obsessively benchmarked with the time it took for the initial load to certain block height. The end result is that it is over-tuned for that operation but performs rather badly for the regular everyday "add one block worth of transactions to the indices" operations. The original BerkeleyDB code was capable of finishing that task faster than the current LevelDB code. The LevelDB code gets bogged down in premature reorganizations of its storage levels that will be redone in few more blocks, without amortizing the cost.

It is kinda like file system benchmarked only for chkdsk/fsck and ignoring most of the everyday operations with frequent mounting and unmounting between relatively rare integrity checks.


Like all those web sites optimizing their images that can be cached by the browser while serving gobs and gobs and gobs on inline JavaScript that can't be cached by the browser because it is inline in a dynamic page.

It drives me nuts the stupidity of many webmasters.

But it's the same concept. They read smaller images means faster page loads but don't understand the concepts of what really makes their pages slow, and it is rarely the images being 12% bigger than the reduced size. It's the effing render blocking JavaScript and the gobs and gobs of inline JavaScript.
legendary
Activity: 1176
Merit: 1134
February 26, 2016, 07:26:10 PM
#10
I went to the

https://github.com/bitcoin/bitcoin/tree/master/src/secp256k1

directory and did a ./configure --enable-benchmark.

Anyway, I did some measurements on bench_internal with gcc default, gcc -O3 -march=native and clang -O3 -march=native.



I also tried an old Intel's ICC (ver 12.1) but it failed to compile.

bench_sign / bench_verify were the same (123 & 187 us) in all configurations (asm?)

System is a quad core q8200 running at 2.4.

edit: Updated with more


Thanks. very useful info. I am assuming default uses -O2?

I have found that in rare cases, -O3 and others beyond -O2 change behavior of some types of code. Not sure what class of code is broken as I didnt have time to isolate it, especially when -O2 results are so close to the best times.

I think there is some very aggressive control flow optimizations with some optimization modes that might change the time sequence of some side effects in the code sequence. I think it might have been networking code that was affected, so pure algorithmic code without any external side effects should be fine with whatever valid optimizations, but of course new -O settings code output should be fully validated again without assuming the code behavior is unchanged

James
legendary
Activity: 1708
Merit: 1049
February 26, 2016, 01:13:56 PM
#9
I went to the

https://github.com/bitcoin/bitcoin/tree/master/src/secp256k1

directory and did a ./configure --enable-benchmark.

Anyway, I did some measurements on bench_internal with gcc default, gcc -O3 -march=native and clang -O3 -march=native.



I also tried an old Intel's ICC (ver 12.1) but it failed to compile.

bench_sign / bench_verify were the same (123 & 187 us) in all configurations (asm?)

System is a quad core q8200 running at 2.4.

edit: Updated with more

legendary
Activity: 2128
Merit: 1074
February 24, 2016, 03:08:31 PM
#8
You guys have wrong methodology.

You think that one could just add "--bench" option to Apache web server and have a sensible benchmarking setup.

The correct methodology for benchmarking bitcoind is very similar to web server benchmarking: you need to write a load generator and come up with a representative data set exercising the client-server combination.

Other than that you could just micro-benchmark some fragmentary code paths in the bitcoind.

For a while the bootstrap.dat torrent published by jgarzik was a decent data set for loading in isolation. But apparently it predates the spam attacks on Bitcoin, so it ceased to be representative even for the isolated bitcoind that isn't serving any clients.
legendary
Activity: 1708
Merit: 1049
February 24, 2016, 12:44:38 PM
#7
Perhaps the benchmark methodology could be like

a) online (measuring network transfers as well)
b) offline (measuring validation of disks in the data)

further broken down to

a) real-time (getting RL metrics of blocks as they are added in terms of network speeds, cpu time required, I/O time required), including "catchup mode" where you open the -qt client, you are 1 day behind, and you can measure that as well in terms of how much cpu or I/O is used per mb of block data etc - although that is not 100% accurate due to differences in the way txs are included in blocks.

b) non-real time where you could have something like --bench=height 350000-351000 where you'd measure I/O and CPU speeds for validating a pre-selected range of 1000 blocks that are already stored in your hard disk.

...Any alterations on metrics that are useful could also be included.

I made a custom build with march=native flags and I don't really have any serious tool to measure whether my binary is faster than the official one, or my distribution's binary. I also want to experiment with various GCC versions, Intel and AMD C compilers, clang etc etc to see which gets the best performance, but I'm lacking benchmarking tools.
Using -O2 seems to get pretty close to the best variant with most things, but maybe some specific vector optimizations could make a noticeable difference.

And without a benchmark we'll never know Tongue

Quote
From what I have seen RAM, DB slowness and serialization issues are the main bottlenecks now.
+
The serial nature creates a lot of dropouts  from the saturated bandwidth ideal, so definitely parallel sync is needed for fastest performance.

RAM can be traded with CPU with something like ZRAM where you increase the data that can fit into it, by compressing them in real time. It's pretty handy. With LZ4 I'm getting 2.5-3x RAM compression ratios and can easily avoid swapping to disk which is very expensive in terms of I/O times.

In theory Bitcoin could use its own ram compression scheme with an off the shelf algorithm for things like its caching system or other subsystems.

Same with disk compression which is an I/O tradeoff with CPU. I tried installing the blockchain in a BTRFS compressed partition, it was slightly faster in terms of throughput and also saved 10gb+ in size. From what I remember Windows 7 also have compressed folders support, so it should be doable in windows too.

Serialization is indeed an issue but if you can get a serial process to get on with it 10-20% faster due to custom compilation, then it's worth it - until these things are fixed in the code.

Quote
For sync,  I make sure network bandwidth is as saturated as possible for as long as possible. This is a shortcut, but practical. If the code cant process the data at full speed, then it can be optimized. Well at least there is  a chance to.

Makes sense.
squashfs cut the size of the "DB" files from 25GB to ~15GB

so that automatically takes advantage of disk compression without slowing down high bandwidth disk usage where the HDD is the bottleneck.

When streaming at 100MB/sec, there is no time for a compression stage. There isnt even time for mallocs. The compression needs to be after all the final files are created and it takes about the same half hour it takes to do the sync

James

It depends on the algorithm. LZ4 can do GBytes/s...

edit: https://github.com/Cyan4973/lz4
scroll down to "Benchmarks"... quoted numbers are for single thread @ 1.9 GHz. So a modern system with 4 cores, better architecture, higher mem throughput and higher clockspeeds could probably do 5-8x.
legendary
Activity: 1176
Merit: 1134
February 24, 2016, 11:22:12 AM
#6
Perhaps the benchmark methodology could be like

a) online (measuring network transfers as well)
b) offline (measuring validation of disks in the data)

further broken down to

a) real-time (getting RL metrics of blocks as they are added in terms of network speeds, cpu time required, I/O time required), including "catchup mode" where you open the -qt client, you are 1 day behind, and you can measure that as well in terms of how much cpu or I/O is used per mb of block data etc - although that is not 100% accurate due to differences in the way txs are included in blocks.

b) non-real time where you could have something like --bench=height 350000-351000 where you'd measure I/O and CPU speeds for validating a pre-selected range of 1000 blocks that are already stored in your hard disk.

...Any alterations on metrics that are useful could also be included.

I made a custom build with march=native flags and I don't really have any serious tool to measure whether my binary is faster than the official one, or my distribution's binary. I also want to experiment with various GCC versions, Intel and AMD C compilers, clang etc etc to see which gets the best performance, but I'm lacking benchmarking tools.
Using -O2 seems to get pretty close to the best variant with most things, but maybe some specific vector optimizations could make a noticeable difference.

And without a benchmark we'll never know Tongue

Quote
From what I have seen RAM, DB slowness and serialization issues are the main bottlenecks now.
+
The serial nature creates a lot of dropouts  from the saturated bandwidth ideal, so definitely parallel sync is needed for fastest performance.

RAM can be traded with CPU with something like ZRAM where you increase the data that can fit into it, by compressing them in real time. It's pretty handy. With LZ4 I'm getting 2.5-3x RAM compression ratios and can easily avoid swapping to disk which is very expensive in terms of I/O times.

In theory Bitcoin could use its own ram compression scheme with an off the shelf algorithm for things like its caching system or other subsystems.

Same with disk compression which is an I/O tradeoff with CPU. I tried installing the blockchain in a BTRFS compressed partition, it was slightly faster in terms of throughput and also saved 10gb+ in size. From what I remember Windows 7 also have compressed folders support, so it should be doable in windows too.

Serialization is indeed an issue but if you can get a serial process to get on with it 10-20% faster due to custom compilation, then it's worth it - until these things are fixed in the code.

Quote
For sync,  I make sure network bandwidth is as saturated as possible for as long as possible. This is a shortcut, but practical. If the code cant process the data at full speed, then it can be optimized. Well at least there is  a chance to.

Makes sense.
squashfs cut the size of the "DB" files from 25GB to ~15GB

so that automatically takes advantage of disk compression without slowing down high bandwidth disk usage where the HDD is the bottleneck.

When streaming at 100MB/sec, there is no time for a compression stage. There isnt even time for mallocs. The compression needs to be after all the final files are created and it takes about the same half hour it takes to do the sync

James
legendary
Activity: 1708
Merit: 1049
February 24, 2016, 11:04:21 AM
#5
Perhaps the benchmark methodology could be like

a) online (measuring network transfers as well)
b) offline (measuring validation of disks in the data)

further broken down to

a) real-time (getting RL metrics of blocks as they are added in terms of network speeds, cpu time required, I/O time required), including "catchup mode" where you open the -qt client, you are 1 day behind, and you can measure that as well in terms of how much cpu or I/O is used per mb of block data etc - although that is not 100% accurate due to differences in the way txs are included in blocks.

b) non-real time where you could have something like --bench=height 350000-351000 where you'd measure I/O and CPU speeds for validating a pre-selected range of 1000 blocks that are already stored in your hard disk.

...Any alterations on metrics that are useful could also be included.

I made a custom build with march=native flags and I don't really have any serious tool to measure whether my binary is faster than the official one, or my distribution's binary. I also want to experiment with various GCC versions, Intel and AMD C compilers, clang etc etc to see which gets the best performance, but I'm lacking benchmarking tools.
Using -O2 seems to get pretty close to the best variant with most things, but maybe some specific vector optimizations could make a noticeable difference.

And without a benchmark we'll never know Tongue

Quote
From what I have seen RAM, DB slowness and serialization issues are the main bottlenecks now.
+
The serial nature creates a lot of dropouts  from the saturated bandwidth ideal, so definitely parallel sync is needed for fastest performance.

RAM can be traded with CPU with something like ZRAM where you increase the data that can fit into it, by compressing them in real time. It's pretty handy. With LZ4 I'm getting 2.5-3x RAM compression ratios and can easily avoid swapping to disk which is very expensive in terms of I/O times.

In theory Bitcoin could use its own ram compression scheme with an off the shelf algorithm for things like its caching system or other subsystems.

Same with disk compression which is an I/O tradeoff with CPU. I tried installing the blockchain in a BTRFS compressed partition, it was slightly faster in terms of throughput and also saved 10gb+ in size. From what I remember Windows 7 also have compressed folders support, so it should be doable in windows too.

Serialization is indeed an issue but if you can get a serial process to get on with it 10-20% faster due to custom compilation, then it's worth it - until these things are fixed in the code.

Quote
For sync,  I make sure network bandwidth is as saturated as possible for as long as possible. This is a shortcut, but practical. If the code cant process the data at full speed, then it can be optimized. Well at least there is  a chance to.

Makes sense.
legendary
Activity: 1176
Merit: 1134
February 24, 2016, 10:47:11 AM
#4
Perhaps the benchmark methodology could be like

a) online (measuring network transfers as well)
b) offline (measuring validation of disks in the data)

further broken down to

a) real-time (getting RL metrics of blocks as they are added in terms of network speeds, cpu time required, I/O time required), including "catchup mode" where you open the -qt client, you are 1 day behind, and you can measure that as well in terms of how much cpu or I/O is used per mb of block data etc - although that is not 100% accurate due to differences in the way txs are included in blocks.

b) non-real time where you could have something like --bench=height 350000-351000 where you'd measure I/O and CPU speeds for validating a pre-selected range of 1000 blocks that are already stored in your hard disk.

...Any alterations on metrics that are useful could also be included.

I made a custom build with march=native flags and I don't really have any serious tool to measure whether my binary is faster than the official one, or my distribution's binary. I also want to experiment with various GCC versions, Intel and AMD C compilers, clang etc etc to see which gets the best performance, but I'm lacking benchmarking tools.
Using -O2 seems to get pretty close to the best variant with most things, but maybe some specific vector optimizations could make a noticeable difference. Though without any compiler optimations, truly horrible performance is possible.

From what I have seen RAM, DB slowness and serialization issues are the main bottlenecks now.

For sync,  I make sure network bandwidth is as saturated as possible for as long as possible. This is a shortcut, but practical. If the code cant process the data at full speed, then it can be optimized. Well at least there is  a chance to.

The serial nature creates a lot of dropouts  from the saturated bandwidth ideal, so definitely parallel sync is needed for fastest performance.

James
legendary
Activity: 1708
Merit: 1049
February 24, 2016, 09:55:07 AM
#3
Perhaps the benchmark methodology could be like

a) online (measuring network transfers as well)
b) offline (measuring validation of disks in the data)

further broken down to

a) real-time (getting RL metrics of blocks as they are added in terms of network speeds, cpu time required, I/O time required), including "catchup mode" where you open the -qt client, you are 1 day behind, and you can measure that as well in terms of how much cpu or I/O is used per mb of block data etc - although that is not 100% accurate due to differences in the way txs are included in blocks.

b) non-real time where you could have something like --bench=height 350000-351000 where you'd measure I/O and CPU speeds for validating a pre-selected range of 1000 blocks that are already stored in your hard disk.

...Any alterations on metrics that are useful could also be included.

I made a custom build with march=native flags and I don't really have any serious tool to measure whether my binary is faster than the official one, or my distribution's binary. I also want to experiment with various GCC versions, Intel and AMD C compilers, clang etc etc to see which gets the best performance, but I'm lacking benchmarking tools.
legendary
Activity: 2128
Merit: 1074
February 12, 2016, 12:57:27 PM
#2
The problem is how to come up with a sensible benchmark.

For example the block & transaction storage was obsessively benchmarked with the time it took for the initial load to certain block height. The end result is that it is over-tuned for that operation but performs rather badly for the regular everyday "add one block worth of transactions to the indices" operations. The original BerkeleyDB code was capable of finishing that task faster than the current LevelDB code. The LevelDB code gets bogged down in premature reorganizations of its storage levels that will be redone in few more blocks, without amortizing the cost.

It is kinda like file system benchmarked only for chkdsk/fsck and ignoring most of the everyday operations with frequent mounting and unmounting between relatively rare integrity checks.
legendary
Activity: 1708
Merit: 1049
February 12, 2016, 11:54:20 AM
#1
Is it possible to have a --bench option (perhaps even a GUI frontend in the qt) for the (non-mining) CPU-intensive tasks in order to evaluate custom builds, software improvements and regressions as the software evolves? Performance bugs and cross-platform performance issues will also be spotted faster.

It will also allow hardware sites to include bitcoin related benchmarks in their sites as well as allowing people to choose the right hardware for the task of running a full node.
Jump to: