As someone who has 30 years of experience plus a BS in CS and CE, and an MS in CS (from top 10 US CS/CE programs), this kind of language isn't a way to (a) make your point, and (b) get anyone to listen to you with any degree of respect.
In open source projects, if you have something like your --i and ++i change, open a pull request or at minimum link to the specific code you are talking about. Most well written, non-student compilers will handle cases like that and there will be no different between things like ++i and i++ and the code generated except perhaps in a class that obfuscates the operation in some extremely obscure way. But, as I said, if it is that easy, please point out what you are talking about.
If greg wants to be treated with respect, he shouldn't begin and end a reply with insults.
This --i and ++i is basic stuff and you want to argue about it? wtf have you been doing for the past 30 years?
And it's not just the speed, it's the smaller byte code which allow you to pack more code into the tiny L0 instruction cache and reduce cache miss, which still costs you 4cycles when you re-fetch it from L1 to L0.
It also means you can fit more code in that tiny 32kb L1 instruction cache, so your other loops/threads can run faster by not being kicked out of the cache by other codes. It also saves power on embedded systems.
This is what I was talking about, the world is flooded with "experts" with "30 years experience" and "50 alphabet soup titles" but still have absolutely no idea wtf actually happens inside a CPU.
Only talentless coders talk about credentials instead of the code.
This is not some super advanced stuff, this is entry level knowledge that's not even up for debate.
The information is everywhere, this took 1 second to find, look:
Which loop has better performance? Increment or decrement?
What your teacher have said was some oblique statement without much clarification. It is NOT that decrementing is faster than incrementing but you can create much much faster loop with decrement than with increment.
int i;
for (i = 0; i < 10; i++){
//something here
}
after compilation (without optimisation) compiled version may look like this (VS2015):
-------- C7 45 B0 00 00 00 00 mov dword ptr ,0
-------- EB 09 jmp labelB
labelA 8B 45 B0 mov eax,dword ptr
-------- 83 C0 01 add eax,1
-------- 89 45 B0 mov dword ptr ,eax
labelB 83 7D B0 0A cmp dword ptr ,0Ah
-------- 7D 02 jge out1
-------- EB EF jmp labelA
The whole loop is 8 instructions (26 bytes). In it - there are actually 6 instructions (17 bytes) with 2 branches. Yes yes I know it can be done better (its just an example).
Now consider this frequent construct which you will often find written by embedded developer:
i = 10;
do{
//something here
} while (--i);
It also iterates 10 times (yes I know i value is different compared with shown for loop but we care about iteration count here). This may be compiled into this:
00074EBC C7 45 B0 01 00 00 00 mov dword ptr ,1
00074EC3 8B 45 B0 mov eax,dword ptr
00074EC6 83 E8 01 sub eax,1
00074EC9 89 45 B0 mov dword ptr ,eax
00074ECC 75 F5 jne main+0C3h (074EC3h)
5 instructions (18 bytes) and just one branch. Actually there are 4 instruction in the loop (11 bytes).
The best thing is that some CPUs (x86/x64 compatible included) have instruction that may decrement a register, later compare result with zero and perform branch if result is different than zero. Virtually ALL PC cpus implement this instruction. Using it the loop is actually just one (yes one) 2 byte instruction:
00144ECE B9 0A 00 00 00 mov ecx,0Ah
label:
// something here
00144ED3 E2 FE loop label (0144ED3h) // decrement ecx and jump to label if not zero
Do I have to explain which is faster?
Here is more on the L0 and uops instruction cache:
Sandy Bridge made tremendous strides in improving the front-end and ensuring the smooth delivery of uops to the rest of the pipeline. The biggest improvement was a uop cache that essentially acts as an L0 instruction cache, but contains fixed length decoded uops. The uop cache is virtually addressed and included in the L1 instruction cache. Hitting in the uop cache has several benefits, including reducing the pipeline length by eliminating power hungry instruction decoding stages and enabling an effective throughput of 32B of instructions per cycle. For newer SIMD instructions, the 16B fetch limit was problematic, so the uop cache synergizes nicely with extensions such as AVX.
The Haswell uop cache is the same size and organization as in Sandy Bridge. The uop cache lines hold upto 6 uops, and the cache is organized into 32 sets of 8 cache lines (i.e., 8 way associative). A 32B window of fetched x86 instructions can map to 3 lines within a single way. Hits in the uop cache can deliver 4 uops/cycle and those 4 uops can correspond to 32B of instructions, whereas the traditional front-end cannot process more than 16B/cycle. For performance, the uop cache can hold microcoded instructions as a pointer to microcode, but partial hits are not supported. As with the instruction cache, the decoded uop cache is shared by the active threads.