There's some coarse-grained sharing between threads. I haven't found it to be a problem on a 2 socket machine, but on a huge AMD with 4 sockets and 8 different NUMA domains, things got bad.
Interesting that it works on windows. Gives me hope that it's a mingw or something bug, not a "my use of critical sections or cond vars is wrong" bug. This is *not* mingw, right? Is there some way I should be doing a compile for windows peeps other than mingw on linux? I'd love to provide an official windows binary.
Msys is a unix-like environment for windows that uses mingw compilers, so my build isn't a cross-compile. (Still working fine for over 40 minutes now, seems very slightly faster with 4cores). Hopefully the mingw linux cross-compile environment will be updated soon after gcc4.9 is released, which might fix whatever bug you're running into.
It's using only 30% of the memory that b14 does with the same -s setting ??