Pages:

Author

Topic: Some 'technical commentary' about Core code esp. hardware utilisation - page 2. (Read 4922 times)

Troll Buster

newbie

Activity: 42

Merit: 0

Quote from: cr1776 on July 06, 2017, 02:55:40 PM

As someone who has 30 years of experience plus a BS in CS and CE, and an MS in CS (from top 10 US CS/CE programs), this kind of language isn't a way to (a) make your point, and (b) get anyone to listen to you with any degree of respect.

In open source projects, if you have something like your --i and ++i change, open a pull request or at minimum link to the specific code you are talking about. Most well written, non-student compilers will handle cases like that and there will be no different between things like ++i and i++ and the code generated except perhaps in a class that obfuscates the operation in some extremely obscure way. But, as I said, if it is that easy, please point out what you are talking about.

If greg wants to be treated with respect, he shouldn't begin and end a reply with insults.

This --i and ++i is basic stuff and you want to argue about it? wtf have you been doing for the past 30 years?

And it's not just the speed, it's the smaller byte code which allow you to pack more code into the tiny L0 instruction cache and reduce cache miss, which still costs you 4cycles when you re-fetch it from L1 to L0.

It also means you can fit more code in that tiny 32kb L1 instruction cache, so your other loops/threads can run faster by not being kicked out of the cache by other codes. It also saves power on embedded systems.

This is what I was talking about, the world is flooded with "experts" with "30 years experience" and "50 alphabet soup titles" but still have absolutely no idea wtf actually happens inside a CPU.

Only talentless coders talk about credentials instead of the code.

This is not some super advanced stuff, this is entry level knowledge that's not even up for debate.
The information is everywhere, this took 1 second to find, look:

Quote

https://stackoverflow.com/questions/2823043/is-it-faster-to-count-down-than-it-is-to-count-up/2823164#2823164

Which loop has better performance? Increment or decrement?

What your teacher have said was some oblique statement without much clarification. It is NOT that decrementing is faster than incrementing but you can create much much faster loop with decrement than with increment.

int i;
for (i = 0; i < 10; i++){
   //something here
}

after compilation (without optimisation) compiled version may look like this (VS2015):

-------- C7 45 B0 00 00 00 00 mov dword ptr ,0
-------- EB 09 jmp labelB
labelA 8B 45 B0 mov eax,dword ptr
-------- 83 C0 01 add eax,1
-------- 89 45 B0 mov dword ptr ,eax
labelB 83 7D B0 0A cmp dword ptr ,0Ah
-------- 7D 02 jge out1
-------- EB EF jmp labelA

The whole loop is 8 instructions (26 bytes). In it - there are actually 6 instructions (17 bytes) with 2 branches. Yes yes I know it can be done better (its just an example).

Now consider this frequent construct which you will often find written by embedded developer:

i = 10;
do{
   //something here
} while (--i);

It also iterates 10 times (yes I know i value is different compared with shown for loop but we care about iteration count here). This may be compiled into this:

00074EBC C7 45 B0 01 00 00 00 mov dword ptr ,1
00074EC3 8B 45 B0 mov eax,dword ptr
00074EC6 83 E8 01 sub eax,1
00074EC9 89 45 B0 mov dword ptr ,eax
00074ECC 75 F5 jne main+0C3h (074EC3h)

5 instructions (18 bytes) and just one branch. Actually there are 4 instruction in the loop (11 bytes).

The best thing is that some CPUs (x86/x64 compatible included) have instruction that may decrement a register, later compare result with zero and perform branch if result is different than zero. Virtually ALL PC cpus implement this instruction. Using it the loop is actually just one (yes one) 2 byte instruction:

00144ECE B9 0A 00 00 00 mov ecx,0Ah
label:
   // something here
00144ED3 E2 FE loop label (0144ED3h) // decrement ecx and jump to label if not zero

Do I have to explain which is faster?

Here is more on the L0 and uops instruction cache:

Quote

http://www.realworldtech.com/haswell-cpu/2/

Sandy Bridge made tremendous strides in improving the front-end and ensuring the smooth delivery of uops to the rest of the pipeline. The biggest improvement was a uop cache that essentially acts as an L0 instruction cache, but contains fixed length decoded uops. The uop cache is virtually addressed and included in the L1 instruction cache. Hitting in the uop cache has several benefits, including reducing the pipeline length by eliminating power hungry instruction decoding stages and enabling an effective throughput of 32B of instructions per cycle. For newer SIMD instructions, the 16B fetch limit was problematic, so the uop cache synergizes nicely with extensions such as AVX.

The Haswell uop cache is the same size and organization as in Sandy Bridge. The uop cache lines hold upto 6 uops, and the cache is organized into 32 sets of 8 cache lines (i.e., 8 way associative). A 32B window of fetched x86 instructions can map to 3 lines within a single way. Hits in the uop cache can deliver 4 uops/cycle and those 4 uops can correspond to 32B of instructions, whereas the traditional front-end cannot process more than 16B/cycle. For performance, the uop cache can hold microcoded instructions as a pointer to microcode, but partial hits are not supported. As with the instruction cache, the decoded uop cache is shared by the active threads.

tspacepilot

legendary

Activity: 1456

Merit: 1083

I may write code in exchange for bitcoins.

@TrollBuster

You replied with a lot of "translations", but I think gmaxwell put it pretty clearly:

Quote from: gmaxwell on July 06, 2017, 06:28:18 AM

Here is the straight dope: If the comments had merit and the author were qualified to apply them-- where is the patch? Oh look at that, no patches.

Some of your "translations" are really questionable:

Quote from: Troll Buster on July 06, 2017, 02:23:41 PM

Quote from: gmaxwell on July 06, 2017, 06:28:18 AM

Some of these pieces of advice are just a bit outdated as well-- it makes little sense to bake in an optimization that a compiler will reliably perform on its own at the expense of code clarity and maintainability; especially in the 99% of code that isn't hot or on a latency critical path. (Examples being loop invariant code motion and use of conditional moves instead of branching).

Translation: My code is great, everyone else is wrong, nobody else can possibly improve it.

That doesn't seem right. My reading of gmaxwell was a very strongly worded invitation for you to go ahead and improve it.

Quote from: gmaxwell on July 06, 2017, 06:28:18 AM

Which wouldn't even hold a candle to the multiple orders of magnitude speedup we've produced so far cumulatively through the life of the project-- exactly my point about micro-optimizations. Of course, contributions are welcome. But it's a heck of a lot easier to wave your arms and insult people who've produced hundred fold improvements, because you think a laundry list of magic moves is going to get another couple times (and they might-- but at what cost?)

If you'd like to help out it's open and ready-- though you'll be held to the same high standard of review and validation and not just given a pass because a micro-benchmark got 1% faster-- reliability is the first concern... but 2x-level improvements in latency or throughput critical paths would be very very welcome even if they were a bit painful to review.

If you're not interested or able-- well then maybe you're just another drunken sports fan throwing concessions from the stand convinced that you could do so much better than the team, though you won't ever take to the field yourself. Tongue It doesn't impress, quite the opposite: because you're effectively exploiting the fact that we don't self-promote much, and so you can get away with slinging some rubbish about how terrible we are just to try to make yourself look impressive. It's a low blow against some very hard working people whom owe nothing to you.

All this bullshit talk is meaningless when your basic level silly choices are all over the place.

Couldn't you, like, fix a few of the 'basic level silly choices' in order to strengthen your argument?

As far as I can tell you've been invited to offer improvements rather than just insults, but it seems that you chose to reply with further insults.

If, for some reason, you can't provide a patch but can provide some helpful discussion which might lead to improvements then it seems like you might need to alter your approach.

I'm not worshipping at anyone's "church" here, I'm just noticing the dynamic: you've been invited to prove the worth of your assumptions, but your reply doesn't seem to be headed in that direction.

cr1776

legendary

Activity: 4284

Merit: 1316

Quote from: Troll Buster on July 06, 2017, 02:23:41 PM

Quote from: gmaxwell on July 06, 2017, 06:28:18 AM

What you're seeing here is someone trying to pump his ego by shitting on other things and show off to impress you with how uber technical he is-- not the first or the last one of those we'll see.

What you're seeing here is someone trying to defend obvious bad design choices.
...
--i instead of ++i
...
Fix your silly shit instead of keep talking about it.

Troll Buster

newbie

Activity: 42

Merit: 0

Quote from: gmaxwell on July 06, 2017, 06:28:18 AM

What you're seeing here is someone trying to pump his ego by shitting on other things and show off to impress you with how uber technical he is-- not the first or the last one of those we'll see.

What you're seeing here is someone trying to defend obvious bad design choices.

Quote from: gmaxwell on July 06, 2017, 06:28:18 AM

A quarter of the items in the list like "Lack of inline assembly in critical loops." are both untrue and also show up in other abusive folks lists as things Bitcoin Core is doing and is awful for doing because its antithetical to portability reliability or the posters idea of code aesthetics (or because MSVC stopped supporting inline assembly thus anyone who uses it is a "moron").

What aesthetics? Your code is ugly anyway, wait, you think your code has aesthetics? Shit.

You know decades ago people invented this little thing call #ifdef right?

Just use #ifdef _MSC_VER/#else/#endif around the inline assembly if you want to bypass MSVC.

This is basic stuff, anyone who pretend to be an expert and doesn't know this is also a "moron".

Quote from: gmaxwell on July 06, 2017, 06:28:18 AM

Here is the straight dope: If the comments had merit and the author were qualified to apply them-- where is the patch? Oh look at that, no patches.

You ignored the part that was already explained to you:
There are plenty of gurus out there who can make Core's code run two to four times faster without even trying. But most of them won't bother, if they're going to work for the bankers they'd expect to get paid handsomely for it.

Like any self respecting coder is going to clean up your crap and let you claim all the credits.

Quote from: gmaxwell on July 06, 2017, 06:28:18 AM

Many of the of the people working on the project have a long term experience with low level programming

And many people on the project quit because they didn't like working with you, what's your point?

Quote from: gmaxwell on July 06, 2017, 06:28:18 AM

(for example I spend many years building multimedia codecs; wladimir does things like video drivers and IIRC used to work in the semiconductor industry), and the codebase reflects many points of optimization with micro-architectural features in mind. But _most_ of the codebase is not a hot-path and _all_ of the codebase must be optimized for reliability and reviewability above pretty much all else.

gmaxwell must think I am the only one who knows the code sucks.

Why don't you walk outside your little church and look around once in a while.

It's not just the micro optimizations that's in question, even the basic design choices are obviously flawed.

People have been laughing at your choices for years and here you are defending it because you wrote some codec to watch porn with higher fps some years ago.

Quote from: gmaxwell on July 06, 2017, 06:28:18 AM

Translation: My code is great, everyone else is wrong, nobody else can possibly improve it.

It's your style and your choices, it tells people you don't understand performance at the instinctive level.

Even simple crap like switching to --i instead of ++i, will reduce assembly instructions regardless of what optimization flags you use on the compiler.

Quote from: gmaxwell on July 06, 2017, 06:28:18 AM

Similarly, some are true for generic non-hot-path code: E.g. it's pretty challenging in idiomatic, safe C++ to avoid some amount of superfluous memory copying (especially prior to C++11 which we were only able to upgrade to in the last year due to laggards in the userbase), but in the critical path for validation there is virtually none (though there are an excess of small allocations, help improving that would be very welcome). Though, you're not likely to know that if you're just tossing around insults on the internet instead of starting up a profiler.

Translation: My code is great, everyone else is wrong.

Quote from: gmaxwell on July 06, 2017, 06:28:18 AM

And of course, we're all quite busy keeping things running reliably and improving-- and pulling out the big tens of percent performance improvements that come from high level algorithmic improvements. Eeking out the last percent in micro-optimizations isn't always something that we have the resources to do even where they make sense from a maintainability perspective. But, instead we're off building the fastest ECC validation code that exists out there bar none; because thats simply more important.

Could there be more micro-optimizations: Absolutely. So step on up and get your hands dirty because there is 10x as much work needed as there are resources are. There is almost no funding (unlike the millions poured into BU just to crank out crashware); and we can't have basically any failures-- at least not in the consensus critical parts. Oh yea, anonymous people will be abusive to you on the internet too. It's great fun.

Again, it's not just the micro-optimizations, it's the big fat bad design choices.

We didn't come here to bash you or bash your code, the topic just came up, and in that 1 page people were already mocking your choices.

Like this one from ComputerGenie:

Quote from: ComputerGenie on July 05, 2017, 11:11:25 PM

Quote from: Troll Buster on July 05, 2017, 09:32:48 PM

Why the hell is Core still stuck on LevelDB anyway?

The same reason BDB hasn't ever been replaced, because even after a softtfork and a hard fork, new wallets must still be backwards-compatible with already nonfunctional 2011 wallets. Roll Eyes

That should tell you something.

Quote from: gmaxwell on July 06, 2017, 06:28:18 AM

Inefficient data storage Oh please. Cargo cult bullshit at its worst. Do you even know what leveldb is used for in Bitcoin? What reason do you believe that $BUZZWORD_PACKAGE_DEJURE is any better for that? Did it occur to you that perhaps people have already benchmarked other options? Rocks has a lot of feature set which is completely irrelevant for our very narrow use of leveldb-- I see in your other posts that you're going on about superior compression in rocksdb: Guess what: we disable compression and rip out out of leveldb, because it HURTS PERFORMANCE for our use case. It turns out that cryptographic hashes are not very compressible.

Everyone knows compression costs performance, it's for space efficiency, wtf are you even on about.

Most CPU is running idle most of the time, and SSD is still expensive.

So just use RocksDB, or just toss in a lz4 lib, add an option in the config and let people with decent CPU enable compression and save 20G and more.

I just copied the entire bitcoind dir (blocks, index, exec, everything) onto a ZFS pool with lz4 compression enabled and at 256k record size it saved over 20G for me.

Works just fine, and ZFS isn't even known for its performance.

Quote from: gmaxwell on July 06, 2017, 06:28:18 AM

(And as CK pointed out, no the blockchain isn't stored in it-- that would be pretty stupid)

That was not what CK said, what CK said was: I'm not a fan of its performance either [#1058]

Do you have difficulty reading or are you just being intentionally dishonest?

Quote from: gmaxwell on July 06, 2017, 06:28:18 AM

The regular contributors who have written most of the code are the same people pretty much through the entire life of the project; and they're professionals with many years of experience. Perhaps you'd care to share with use your lovely and impressive works?

Doesn't matter how many life time people spent on it, when you see silly shit like sha256() twice, you know it's written by amateurs.

Quote from: gmaxwell on July 06, 2017, 06:28:18 AM

All this bullshit talk is meaningless when your basic level silly choices are all over the place.

Here, your internal sha256 lib, the critical hashing function all encode/decode operation relies on, the one that hasn't been updated since 2014:

https://github.com/bitcoin/bitcoin/blob/master/src/crypto/sha256.cpp

SHA256 is one of the key pieces of Bitcoin operations, the blocks use it, the transactions use it, the addresses even use it twice.

So what's your excuse for not making use of SSE/AVX/AVX2 and the Intel SHA extension? Aesthetics? Portability? Pfft.

There are mountains of accelerated SHA2 libs out there, like this one,
Supports Intel SHA extension, supports ARMv8, even has MVSC headers:

Quote

https://github.com/noloader/SHA-Intrinsics/blob/master/sha256-x86.c

SHA-1, SHA-224 and SHA-256 compression functions using Intel SHA intrinsics and ARMv8 SHA intrinsics

For AVX2, here is one from Intel themselves:

Quote

https://patchwork.kernel.org/patch/2343841/

Optimized sha256 x86_64 routine using AVX2's RORX instructions

Provides SHA256 x86_64 assembly routine optimized with SSE, AVX and
AVX2's RORX instructions. Speedup of 70% or more has been
measured over the generic implementation.

Signed-off-by: Tim Chen <[email protected]>

There is your 70% speed up for a single critical operation on your hot-path.

This isn't some advanced shit, that Intel patch was created in March 26, 2013, your sha256 lib was last updated in Dec 19, 2014, so the patch existed over a year before your last update. We have even faster stuff now using Intel SHA intrinsics.

You talk a lot of shit but your code and choices are like they're made by amateurs.

Working in "cryptocurrency" doesn't automatically make you an expert because the word has "crypto" in it.

Fix your silly shit instead of keep talking about it.

Last of the V8s

legendary

Activity: 1652

Merit: 4393

Be a bank

^thank you so much

gmaxwell

staff

Activity: 4326

Merit: 8951

What you're seeing here is someone trying to pump his ego by crapping on the work of others and trying to show off to impress you with how uber technical he is-- not the first or the last one of those we'll see.

A quarter of the items in the list like "Lack of inline assembly in critical loops." are both untrue and also show up in other abusive folks lists as things Bitcoin Core is doing and is awful for doing because its antithetical to portability reliability or the posters idea of code aesthetics (or because MSVC stopped supporting inline assembly thus anyone who uses it is a "moron").

Here is the straight dope: If the comments had merit and the author were qualified to apply them-- where is the patch? Oh look at that, no patches.

Many of the of the people working on the project have a long term experience with low level programming (for example I spend many years building multimedia codecs; wladimir does things like video drivers and IIRC used to work in the semiconductor industry), and the codebase reflects many points of optimization with micro-architectural features in mind. But _most_ of the codebase is not a hot-path and _all_ of the codebase must be optimized for reliability and reviewability above pretty much all else.

Some of these pieces of advice are just a bit outdated as well-- it makes little sense to bake in an optimization that a compiler will reliably perform on its own at the expense of code clarity and maintainability; especially in the 99% of code that isn't hot or on a latency critical path. (Examples being loop invariant code motion and use of conditional moves instead of branching).

Similarly, some are true for generic non-hot-path code: E.g. it's pretty challenging in idiomatic, safe C++ to avoid some amount of superfluous memory copying (especially prior to C++11 which we were only able to upgrade to in the last year due to laggards in the userbase), but in the critical path for validation there is virtually none (though there are an excess of small allocations, help improving that would be very welcome). Though, you're not likely to know that if you're just tossing around insults on the internet instead of starting up a profiler.

And of course, we're all quite busy keeping things running reliably and improving-- and pulling out the big tens of percent performance improvements that come from high level algorithmic improvements. Eeking out the last percent in micro-optimizations isn't always something that we have the resources to do even where they make sense from a maintainability perspective. But, instead we're off building the fastest ECC validation code that exists out there bar none; because thats simply more important.

Could there be more micro-optimizations: Absolutely. So step on up and get your hands dirty because there is 10x as much work needed as there are resources are. There is almost no funding (unlike the millions poured into BU just to crank out crashware); and we can't have basically any failures-- at least not in the consensus critical parts. Oh yea, anonymous people will be abusive to you on the internet too. It's great fun.

Quote

Inefficient data storage

Oh please. Cargo cult bullshit at its worst. Do you even know what leveldb is used for in Bitcoin? What reason do you believe that $BUZZWORD_PACKAGE_DEJURE is any better for that? Did it occur to you that perhaps people have already benchmarked other options? Rocks has a lot of feature set which is completely irrelevant for our very narrow use of leveldb-- I see in your other posts that you're going on about superior compression in rocksdb: Guess what: we disable compression and rip out out of leveldb, because it HURTS PERFORMANCE and actually makes the database larger-- for our use case. It turns out that cryptographic hashes are not very compressible. (And as CK pointed out, no the blockchain isn't stored in it-- that would be pretty stupid)

Pretty sad that you feel qualified to through out that long list of insults without having much of an idea about the architecture of the software.

Quote

Since inception, Core was written by amateurs or semi-professionals, picked up by other amateurs or semi-professionals

Quote

run two to four times faster without even trying.

It doesn't impress, quite the opposite: because you're effectively exploiting the fact that we don't self-promote much, and so you can get away with slinging some rubbish about how terrible we are just to try to make yourself look impressive. It's a low blow against some very hard working people whom owe nothing to you.

If you do a really outstanding job perhaps you'll be able to overcome the embarrassment of:

Quote

2) Say what you will about Craig, he's still a mathematician, the math checks out.

(Hint: Wright's output is almost all pure gibberish; though perhaps you were too busy having fuck screamed at you to notice little details like his code examples for quadratic signature hashing being code from a testing harness that has nothing to do with validation, his fix being a total no op, his false claims that quadratic sighashing is an implementation issue, false claims about altstack having anything to do with turing completeness, false claims that segwit makes the system quadratically slower, false claim that Bitcoin Core removed opcode, yadda yadda. )
I for one an not impressed. Show us some contributions if you want to show that you know something useful, not hot air.

Last of the V8s

legendary

Activity: 1652

Merit: 4393

Be a bank

thought this was worth preserving - was thoroughly off topic in another thread and may get deleted.
haven't edited the quote so there's lots of political stuff which would be off topic here (but it is your board!)
please see @-ck's thread for some initial commentary on point 13. otherwise,
invite dev and tech regulars to comment

Quote from: Troll Buster on July 05, 2017, 07:53:20 PM

Quote from: DooMAD on July 05, 2017, 02:55:47 PM

The thing to bear in mind is that Core have an exemplary record for testing, bugfixing and just generally having an incredibly stable and reliable codebase. So while people may run SegWit2x code in the interim to make sure it's activated, I envision many of them would switch back to Core the moment Core release compatible code. As such, any loss in Core's dominance would probably only be temporary.

In short, I agree there's probably enough support to active a 2MB fork, but I disagree that Core will lose any significant market share over the long term, even if the 2MB fork creates the longest chain and earns the Bitcoin mantle.

Nokia was also good at testing and reliability, where are they now?

And Core code is shit, anyone experienced in writing kernels/drivers, or ultra low latency communication/financial/military/security systems would instantly notice:

1. The general lack of regards for L0/L1/TLB/L2/L3/DRAM latency and data locality.
2. Lack of cache line padding and alignment.
3. Lack of inline assembly in critical loops.
4. Lack of CPU and platform specific speed ups.
5. Inefficient data structures and data flow.
6. Not replacing simple if/else with branchless operations.
7. Not using __builtin_expect() to make branch predictions more accurate.
8. Not breaking bigger loops into smaller loops to make use of L0 cache (Loop tiling).
9. Not coding in a way that deliberately helps CPU prefetcher cheats time.
10. Unnecessary memory copying.
11. Unnecessary pointer chasing.
12. Using pointers instead of registers in performance sensitive areas.
13. Inefficient data storage (LevelDB? Come on, the best LevelDB devs moved onto RocksDB years ago)
14. Lack of simplicity.
15. Lack of clear separation of concerns.
16. The general pile-togetherness commonly seen in projects involving too many people of different skill levels.

The bottleneck of performance today is memory, the CPU register is 150-400 times faster than main memory, 10x that if you use the newest CPUs and code in a way to make use of all the execution units parallelly and make use of SIMD (out-of-order execution window size, 168 in Sandy Bridge, 192 in Haswell, 224 in Skylake).

One simple cache miss and you end up wasting the time for 30-400 CPU instructions. Even moving 1 byte from one core to another takes 40 nanoseconds, that's enough time for 160 instructions on a 4GHz CPU.

You take one look at Core's code and you know instantly most of the people who wrote it knows only software but not hardware, they know how to write the logic, they know how to allocate and release memory, but they don't understand the hardware they're running the code on, they don't know how electrons are being moved from one place to another inside the CPU at the nanometer level, if you don't have instinctive knowledge of hardware, you'll never be able to write great codes, good maybe, but not great.

Since inception, Core was written by amateurs or semi-professionals, picked up by other amateurs or semi-professionals, it works, there are small nugget of good code here and there, contributed by people who knew what they were doing, but over all the code is nowhere near good, not even close, really just a bunch of slow crap code written by people of different skill levels.

There are plenty of gurus out there who can make Core's code run two to four times faster without even trying. But most of them won't bother, if they're going to work for the bankers they'd expect to get paid handsomely for it.

Quote from: DooMAD on July 05, 2017, 02:55:47 PM

So while people may run SegWit2x code in the interim to make sure it's activated, I envision many of them would switch back to Core the moment Core release compatible code. As such, any loss in Core's dominance would probably only be temporary.

In short, I agree there's probably enough support to active a 2MB fork, but I disagree that Core will lose any significant market share over the long term, even if the 2MB fork creates the longest chain and earns the Bitcoin mantle.

So even a Core fan boy have to agree that Core must fall in line to stay relevant.

A fan boy can fantasize everyone flocking back to Core after they lose the first to market advantage.

But the key is even if Core decide to fall in line to stay relevant, they can no longer play god like before.

So what's your point.