I did incorporate that change into my kernel. It turns out that even though my hardware reports 4 as the preferred vector width, it's faster with 2. I assume many people have experienced the same. So I've made the default to be 2 when the hardware says its preferred vector width is anything larger than 1.
It's due to the high GPR usage, it is high enough to balance the poorer ALUPacking coming from uint2, not uint4 vectors. In fact I found out 3-component vectors to work best and they should be supported by opencl 1.1 standart, but the OpenCL compiler is buggy and generates bad code with uint3. Interlacing uint2 and uint works though