Pages:
Author

Topic: tcatm's 4-way SSE2 for Linux 32/64-bit is in 0.3.10 - page 2. (Read 24802 times)

legendary
Activity: 1540
Merit: 1002
So is it accurate to say that, so far, only Intel Core i7 processors and certain (Phenom?) AMD processors enjoy a speed bump from -4way?


And i5, at least on my macbookpro
member
Activity: 111
Merit: 10
So is it accurate to say that, so far, only Intel Core i7 processors and certain (Phenom?) AMD processors enjoy a speed bump from -4way?
newbie
Activity: 2
Merit: 0
64-bit Gentoo / Intel Core i7

W/O 4way: 4324294
With 4way: 7649415



32-bit Ubuntu VM on XP host /  Intel Core 2 Duo

W/O 4way: 1751518
With 4way: 793100
legendary
Activity: 1148
Merit: 1001
Radix-The Decentralized Finance Protocol
Model: Intel Atom n330 (2 cores, 4 virtual).

OS: Ubuntu 10.04 64bit

Using the -4way option I get half the speed than using no option.
sr. member
Activity: 520
Merit: 253
555
I wrapped sha256.cpp in
#ifdef FOURWAYSSE2
#endif // FOURWAYSSE2

try it now.

Thanks, works fine now.
newbie
Activity: 44
Merit: 0
model name      : AMD Athlon(tm) 64 X2 Dual Core Processor 5600+

w/o -4way  "hashespersec" : 2539397

with -4way  "hashespersec" : 2108791

Linux, Debian, 32 bit.
founder
Activity: 364
Merit: 7553
I wrapped sha256.cpp in
#ifdef FOURWAYSSE2
#endif // FOURWAYSSE2

try it now.
sr. member
Activity: 520
Merit: 253
555
On a Core 2 Duo T7200, the default code gives about 1.8 Mhash/s, and 4way is slower at 1.0 Mhash/s. It has 4 MB of L2 cache, so it is probably not a question of cache size, as suggested at some point.

Unfortunately, the code (from svn) no longer compiles on ARM, as it now has SSE intrinsics hardcoded. I have removed the -msse2 and -DFOURWAYSSE2 flags from the makefile, and it still produces errors like this

Code:
sha256.cpp:8:23: error: xmmintrin.h: No such file or directory
sha256.cpp:34: error: ‘__m128i’ does not name a type

but hopefully this is easy to fix.
sr. member
Activity: 337
Merit: 285
@satoshi: Oops, I meant -march=amdfam10. Sorry.

@everyone confused about improvement on Phenoms: I developed the code on a Phenom (940) and verified it (at least in 64bit mode) and the improvement you see is real.

Concerning Hyperthreading: It seems to give a little performance gain, maybe from running load/store instructions in parallel with aritmethic instructions. There's only a tiny bit of plain x86 instructions for glueing the function into the ABI. They take less than ~2% of the total CPU time (measured with gprof).
sr. member
Activity: 252
Merit: 268
More importantly, about how long should it take 10 Mhash/sec to verify difficulty 1 blocks?

After the 64-bit Linux hashing bug was fixed I generated a block or two in short order, but since that one or two blocks, I have not generated a single block. It's starting to seem a little fishy.

I'm currently testing Bitcoin on two Linux 64-bit computers. Is there anything in the code blocking early block verification?

Edit: Never mind. I used the Bitcoin Generation Calculator and divided out the difficulty. Everything is fine here, I've generated a couple blocks with 4way. About to start testing without 4way.

Another Edit: My test only verifies that hashing works. It does not verify whether I'm really getting the displayed speed.
legendary
Activity: 1596
Merit: 1100
Code:
cpu family : 6
model : 26
model name : Genuine Intel(R) CPU             000  @ 3.20GHz
stepping : 4
cpu family 6 model 26 stepping 4 is an Intel Core i7.
That's a 23% speedup with -4way, 63% total speedup with -4way + hyperthreading.
33% faster with hyperthreading than without it.


Does bitcoin perform any self-tests at startup, to verify that hashing is working?


newbie
Activity: 35
Merit: 0
I have two quadcore Phenom II 64-bit linux machines (ubuntu 9.10 both) and the -4way option increases my hashing speed so much I'm suspicious. I get about 5-6khash/sec on these boxes previously and without -4way option. With -4way I get over 11khash/sec! In other words, the -4way switch almost DOUBLES the reported hashing speed. This level of improvement seems more than expected and makes me wonder if my boxes are really doing the hashing that much faster or if there could possible be an issue where the math operations are actually being skipped over for some reason, causing illusory speed and an inability to actually generate blocks?
o_O... good luck hashing, you're gonna need it!
I guess that should read either mhash/sec or THOUSANDS of khash/sec...but hey, what's 3 orders of magnitude among friends?

Perhaps that typographical error is why nobody has answered whether or not a nearly 100% speeded from the -4way option is at all realistic? I'm not convinced the crypto hashing is really taking place at the rate of 11000khash/sec on my desktop box.
sr. member
Activity: 252
Merit: 268
I have two quadcore Phenom II 64-bit linux machines (ubuntu 9.10 both) and the -4way option increases my hashing speed so much I'm suspicious. I get about 5-6khash/sec on these boxes previously and without -4way option. With -4way I get over 11khash/sec! In other words, the -4way switch almost DOUBLES the reported hashing speed. This level of improvement seems more than expected and makes me wonder if my boxes are really doing the hashing that much faster or if there could possible be an issue where the math operations are actually being skipped over for some reason, causing illusory speed and an inability to actually generate blocks?
o_O... good luck hashing, you're gonna need it!

With 4way, I get significantly better performance when I have all my virtual cores enabled. I think I get about the same amount of hashes when hyper threading is turned off with or without 4way.
Hey, you may be onto something!

hyperthreading didn't help before because all the work was in the arithmetic and logic units, which the hyperthreads share.

tcatm's SSE2 code must be a mix of normal x86 instructions and SSE2 instructions, so while one is doing x86 code, the other can do SSE2.

How much of an improvement do you get with hyperthreading?

Some numbers?  What CPU is that?
Here are the results from my very poor memory on an i7 860 2.8 GHz with Ubuntu 10.04 amd64. Some of the numbers may be a bit off.

Without 4way, with HT, 4/8 virtual cores, 4.5-5 Mhash/sec
Without 4way, with HT, 8/8 virtual cores, a bit less than above, but basically the same

With 4way, with HT, 8/8 virtual cores, 6.5-8 Mhash/sec (It may be my imagination, but it seems noticeably more variable.)
With 4way, with HT, 4/8 virtual cores, 5-6 Mhash/sec

Without 4way, without HT, 4/4 physical cores, 4.5-5 Mhas/sec (But a bit slower than the first result.)
With 4way, without HT, 4/4 physical cores, 5-6 Mhash/sec
founder
Activity: 364
Merit: 7553
Code:
cpu family : 6
model : 26
model name : Genuine Intel(R) CPU             000  @ 3.20GHz
stepping : 4
cpu family 6 model 26 stepping 4 is an Intel Core i7.
That's a 23% speedup with -4way, 63% total speedup with -4way + hyperthreading.
33% faster with hyperthreading than without it.
member
Activity: 111
Merit: 10
No winners for 4way in my other three Intel machines either:

Intel(R) Core(TM)2 Duo CPU     E8500 @ 3.16GHz (64-bit Linux)
4way: 1565  std: 3002

Intel(R) Xeon(TM) CPU 3.00GHz (32-bit Linux)
4way: 1243  std: 2048

Intel(R) Core(TM)2 CPU          6300  @ 1.86GHz
4way: 932   std: 1733

(All running 0.3.10, -1 proclimit)
Experiments with proclimit weren't any better.

legendary
Activity: 1596
Merit: 1100

Update for
Code:
cpu family : 6
model : 26
model name : Genuine Intel(R) CPU             000  @ 3.20GHz
stepping : 4

Machine has 4 cores, each with 2 hyperthreads.  /proc/cpuinfo shows 8 virtual processors.

without -4way, setgen 4:    5.7 Mhash/sec
without -4way, setgen 8:    5.0 Mhash/sec

with -4way, setgen 4:   7.0 Mhash/sec
with -4way, setgen 8:   9.3 Mhash/sec

So, the old wisdom of "hyperthreading slows things down" is now shattered, on this machine.
lfm
full member
Activity: 196
Merit: 104
model name      : Intel(R) Core(TM)2 Quad  CPU   Q9450  @ 2.66GHz,   linux 64

no difference at about 4950 khash/s


newbie
Activity: 55
Merit: 0
http://www.google.com/search?q=amdfamk10

I think he misremembered it since AMD arches are K#.
founder
Activity: 364
Merit: 7553
try -march=amdfam10
That works.

That's strange...  are we sure that's the same thing?  tcatm, try amdfam10 and make sure you get the same speed measurement.
newbie
Activity: 55
Merit: 0
I propose to compile sha256.cpp with -O3 -march=amdfamk10 (will work on 32bit and 64bit) as only CPUs supporting this instruction set (AMD Phenom, Intel i5 and newer) benefit from -4way and it'll improve performance by ~9%.
GCC 4.3.3 doesn't support -march=amdfamk10.  I get:
sha256.cpp:1: error: bad value (amdfamk10) for -march= switch
try -march=amdfam10
Pages:
Jump to: