Yeah, what I described is clearly Step 3 or later. Intel also seems to have finally sold a "step 3" type of device in the Phi, depending on what it actually can do.
Intel seems more interested in incorporating the CPU into the GPU, while AMD is incorporating the GPU into the CPU. Totally different mindsets/endgames/results.
They both want branch/loop happy highly parallel computation. The Radeon's biggest "problem" (and I'm using the term loosely) is that wavefronts are ran in lockstep: both sides of a branch are the same length, even if it requires inserting no-ops, and loops that have lengths that are set at runtime (instead of static/compile time set) are just as nasty.
CPUs, otoh, can't do highly parallel calculations because of all the hardware dedicated dealing with branching, branch prediction, cache prediction, etc etc etc takes up a lot of room, produces a lot of heat, and uses a lot of power. I wonder how much stuff Intel removed to put 50 cores on a card.