Pages:
Author

Topic: tcatm's 4-way SSE2 for Linux 32/64-bit is in 0.3.10 - page 3. (Read 24773 times)

newbie
Activity: 35
Merit: 0
I have two quadcore Phenom II 64-bit linux machines (ubuntu 9.10 both) and the -4way option increases my hashing speed so much I'm suspicious. I get about 5-6khash/sec on these boxes previously and without -4way option. With -4way I get over 11khash/sec! In other words, the -4way switch almost DOUBLES the reported hashing speed. This level of improvement seems more than expected and makes me wonder if my boxes are really doing the hashing that much faster or if there could possible be an issue where the math operations are actually being skipped over for some reason, causing illusory speed and an inability to actually generate blocks?
lfm
full member
Activity: 196
Merit: 104
model name      : AMD Phenom(tm) II X4 940 Processor  at 3.0 ghz  linux 64

with -4way     "hashespersec" : 11132770

without      "hashespersec" : 5877668

founder
Activity: 364
Merit: 7248
I propose to compile sha256.cpp with -O3 -march=amdfamk10 (will work on 32bit and 64bit) as only CPUs supporting this instruction set (AMD Phenom, Intel i5 and newer) benefit from -4way and it'll improve performance by ~9%.
GCC 4.3.3 doesn't support -march=amdfamk10.  I get:
sha256.cpp:1: error: bad value (amdfamk10) for -march= switch


With 4way, I get significantly better performance when I have all my virtual cores enabled. I think I get about the same amount of hashes when hyper threading is turned off with or without 4way.
Hey, you may be onto something!

hyperthreading didn't help before because all the work was in the arithmetic and logic units, which the hyperthreads share.

tcatm's SSE2 code must be a mix of normal x86 instructions and SSE2 instructions, so while one is doing x86 code, the other can do SSE2.

How much of an improvement do you get with hyperthreading?

Some numbers?  What CPU is that?
legendary
Activity: 1596
Merit: 1100

My -4way results:  slower for two older boxes, faster for newer one.


("model name" comes from Linux's /proc/cpuinfo, which reports directly from CPU)

1) model name   : Intel(R) Pentium(R) D CPU 3.00GHz

total cores: 2
without -4way:    0.999 Mhash/sec
with -4way: 0.850 Mhash/sec

2) model name   : Dual Core AMD Opteron(tm) Processor 280

total cores: 4
without -4way:   4.6 Mhash/sec
with -4way:    4.0 Mhash/sec

3) model name   : Genuine Intel(R) CPU             000  @ 3.20GHz

total cores: 4
without -4way:   5.7 Mhash/sec
with -4way:    7.0 Mhash/sec

full member
Activity: 307
Merit: 102
I created a wiki page so we can keep track of the results: http://www.bitcoin.org/wiki/doku.php?id=4-way_sse2
You might want to add columns for whether hyper-threading is enabled, number of physical cores and how many cores Bitcoin is using. Without 4way, I get very slightly better results when I have half of my virtual cores hashing. With 4way, I get significantly better performance when I have all my virtual cores enabled. I think I get about the same amount of hashes when hyper threading is turned off with or without 4way.

I've updated the page with your suggestions, I've also added footnotes to explain some of the fields.
sr. member
Activity: 252
Merit: 268
I created a wiki page so we can keep track of the results: http://www.bitcoin.org/wiki/doku.php?id=4-way_sse2
You might want to add columns for whether hyper-threading is enabled, number of physical cores and how many cores Bitcoin is using. Without 4way, I get very slightly better results when I have half of my virtual cores hashing. With 4way, I get significantly better performance when I have all my virtual cores enabled. I think I get about the same amount of hashes when hyper threading is turned off with or without 4way.
sr. member
Activity: 337
Merit: 285
I propose to compile sha256.cpp with -O3 -march=amdfamk10 (will work on 32bit and 64bit) as only CPUs supporting this instruction set (AMD Phenom, Intel i5 and newer) benefit from -4way and it'll improve performance by ~9%.
Good Cheesy
Will this also work on Windows OS?
Didn't try it, but CFLAGS are not OS dependent at all so I guess it'll work.
staff
Activity: 4270
Merit: 1209
I support freedom of choice
I propose to compile sha256.cpp with -O3 -march=amdfamk10 (will work on 32bit and 64bit) as only CPUs supporting this instruction set (AMD Phenom, Intel i5 and newer) benefit from -4way and it'll improve performance by ~9%.
Good Cheesy
Will this also work on Windows OS?
sr. member
Activity: 337
Merit: 285
I propose to compile sha256.cpp with -O3 -march=amdfamk10 (will work on 32bit and 64bit) as only CPUs supporting this instruction set (AMD Phenom, Intel i5 and newer) benefit from -4way and it'll improve performance by ~9%.
full member
Activity: 307
Merit: 102
I created a wiki page so we can keep track of the results: http://www.bitcoin.org/wiki/doku.php?id=4-way_sse2
newbie
Activity: 16
Merit: 0
Running 32-bit Linux on an AMD Athlon 64 X2, I get the following results:

  normal: 2850 khash/s
  with -4way: 1708 khash/s

I haven't checked if the hashes are correct, just the speed.
sr. member
Activity: 337
Merit: 285
Did anyone verify it to produce correct results on 32 bit hosts?
sr. member
Activity: 337
Merit: 285
-4way: 12518 khash/s
without: 6550 khash/s

It's a little bit slower than my patch (~14000kash/s).

edit: I ran the binary on an older AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ with the same effect we see on older intel cpus:
-4way: 1120khash/s
without: 2012khash/s
member
Activity: 111
Merit: 10
5,911 khash with -4way
11,260 without
(Dual Xeon E5450, 64-bit, 8 threads)
sr. member
Activity: 337
Merit: 285
If I didn't know better, I would say the key is the CPU cache size. Seems all the CPU that run slower have 2 MB or less onboard cache, where as the Core i5 starts with at least 3MB of onboard CPU cache.

That's unlikely. The loop accesses 432 bytes of data. That should fit in most caches.
member
Activity: 61
Merit: 10
Okay, makes sense. I have an i7 930 I'll try and test out with too.
founder
Activity: 364
Merit: 7248
I just uploaded a quick build so testers can check if I built it right.  (I don't have an i5 or AMD)  If it checks out, I'll put together the full package and do all the release stuff.
member
Activity: 61
Merit: 10
Where is the code for this? I'm on a CentOS 5.5 box and need to build it myself. Once I do that I will report back with linux 32-bit and 1MB cache Xeon.
founder
Activity: 364
Merit: 7248
I hope someone can test an i5 or AMD to check that I built it right.  I don't have either to test with.

I'm also curious if it performs much worse on 32-bit linux vs 64-bit.
sr. member
Activity: 308
Merit: 258
I did a quick test, will report back when I try it on more machines.

Pentium E5300 Dual-Core 2.6 GHz (2MB cache, FSB 800MHz)
Processor info: http://en.wikipedia.org/wiki/Wolfdale_%28microprocessor%29
Stock = 2261 khash/s
4-way = 1103 khash/s (64 bit)

Pentium 4 - 3.0GHz (hyper-threading off) 1MB Cache, FSB 800MHz
Processor info: http://en.wikipedia.org/wiki/NetBurst_%28microarchitecture%29
Stock = 1024 khash/s (32 bit)
4-way = 658 khash/s (32 bit)

Pentium 4 - 2.8GHz (hyper-threading off) 1MB Cache, FSB 800MHz
Processor info: http://en.wikipedia.org/wiki/NetBurst_%28microarchitecture%29
Stock = 917 khash/s (64 bit)
4-way = 747 khash/s (64 bit)


If I didn't know better, I would say the key is the CPU cache size. Seems all the CPU that run slower have 2 MB or less onboard cache, where as the Core i5 starts with at least 3MB of onboard CPU cache.
Pages:
Jump to: