From the reviews Bulldozer seemed capable with heavily multithreaded apps in some cases and it boasts much more cache than the anemic cache on the Phenom II.
Total cache can be deceptive.As I indicated in earlier speculation thread scrypt is VERY L1 cache dependent.
While the Bulldozer has more total cache (L1+L2+L3) it has less L1 data cache (L1 cahce is divided into discrete data & instruction caches).
Phenom II has 64KB of L1 data cache per core.
Bulldozer has 16KB of L1 data cache per integer core.
The hypothesis I proposed in the speculation thread was that Bulldozer would do better (8 cores vs 6 cores) if scrypt lookup table would fit in the L1 cache of Bulldozer. Your benchmark just answered that question.
The larger L2 & L3 cache of Bulldozer is immaterial. There is a 3 clock cycle latency to L2 cache and 20 (IIRC) clock cycle latency to L3 cache. It would appear that the scrypt lookup tables can't fit in 16KB thus the CPU is being idled thousands of times per hash waiting for data to SLLLLLLLOOOOOOOOOWWWWWWWWWWWWLLLLLLLLLYYYYYYYYY make it way from L2 -> L1. L3 cache is likely completely unused for the datasets used by scrypt.
The nice thing is you have shown 16KB of L1 cache is likely insufficient. We know 64KB is sufficient. That gives us an upper and lower bounds.
The i5 series CPU have 32KB of L1 cache. Clock for clock they tend to underperform the Phenom II but still do ok. My guess is that there may be some cache misses but not too many which allows decent performance.
On edit: looks like I was incorrect. Clock for clock i3/5/7 series outperforms Pheom II. Phenom II has higher overall performance but that is due to more cores & higher overclock. That would indicate 32KB is sufficient.
The Fermi GF100 series Tesla cards can be configured to use 48K L1 cache per SM (stream module)*. Thus it *may* be possible that scrypt could be GPU accelerated. Granted the high cost of Tesal cards makes benchmarking a very expensive test. The Teslas in Amazon EC2 are low end w/ only 16KB L1 cache so aren't very interesting.
*The 48KB of L1 Cache on M2050/M2070/M2090 series Teslas is per SM (stream module) a group of 32 SP. That likely means the card likely isn't parallel enough to justify it's cost. 448 SP is effectively 16 independent SM but the cost ($1400+) doesn't justify really justify only 16x performance increase.
Since you got me thinking I looked up some L1 data cache sizes:AMD 5xxx/6xxx GPU - 8KB per SIMD (group of 8 SP)
NVidia Fermi based GPU - 16KB per SM (group of 32 SP)
AMD Bulldozer - 16KB per integer core
Intel Core i3/i5/i7 - 32KB per core
NVidia GF100 series Tesla cards - 48KB per SM (group fo 32 SP)
Phenom II - 64KB per core