the salsa20 function is a long string of data dependent 4*32-bit vector integer operations (i.e. output of one operation is used as input to the next).
And the execution latencies for the most used instructions in the salsa20 core (shift r/l immediate, add, xor) are all 2 clocks on K8/K10, all 1 clock on Atom/Core/Core2/Nehalem/SB.
End result ... sse2 salsa20 needs roughly twice the clocks/round on AMD compared to any modern intel.
Thank you for your insight, ArtForz!
Yes, I think I have read somewhere that since the Core architecture Intel CPUs can actually handle SSE registers 128 bits at a time.
I have never been too fond of Intel, but it's nice to see that sometimes you get what you pay for!