tcatm's 4-way SSE2 for Linux 32/64-bit is in 0.3.10 - page 3.

gridecon

newbie

Activity: 35

Merit: 0

I have two quadcore Phenom II 64-bit linux machines (ubuntu 9.10 both) and the -4way option increases my hashing speed so much I'm suspicious. I get about 5-6khash/sec on these boxes previously and without -4way option. With -4way I get over 11khash/sec! In other words, the -4way switch almost DOUBLES the reported hashing speed. This level of improvement seems more than expected and makes me wonder if my boxes are really doing the hashing that much faster or if there could possible be an issue where the math operations are actually being skipped over for some reason, causing illusory speed and an inability to actually generate blocks?

lfm

full member

Activity: 196

Merit: 104

model name : AMD Phenom(tm) II X4 940 Processor at 3.0 ghz linux 64

with -4way "hashespersec" : 11132770

without "hashespersec" : 5877668

satoshi

founder

Activity: 364

Merit: 7553

Quote from: tcatm on August 15, 2010, 07:43:39 PM

I propose to compile sha256.cpp with -O3 -march=amdfamk10 (will work on 32bit and 64bit) as only CPUs supporting this instruction set (AMD Phenom, Intel i5 and newer) benefit from -4way and it'll improve performance by ~9%.

GCC 4.3.3 doesn't support -march=amdfamk10. I get:
sha256.cpp:1: error: bad value (amdfamk10) for -march= switch

Quote from: NewLibertyStandard on August 15, 2010, 08:49:01 PM

With 4way, I get significantly better performance when I have all my virtual cores enabled. I think I get about the same amount of hashes when hyper threading is turned off with or without 4way.

Hey, you may be onto something!

hyperthreading didn't help before because all the work was in the arithmetic and logic units, which the hyperthreads share.

tcatm's SSE2 code must be a mix of normal x86 instructions and SSE2 instructions, so while one is doing x86 code, the other can do SSE2.

How much of an improvement do you get with hyperthreading?

Some numbers? What CPU is that?

jgarzik

legendary

Activity: 1596

Merit: 1100

My -4way results: slower for two older boxes, faster for newer one.

("model name" comes from Linux's /proc/cpuinfo, which reports directly from CPU)

1) model name   : Intel(R) Pentium(R) D CPU 3.00GHz

total cores: 2
without -4way: 0.999 Mhash/sec
with -4way: 0.850 Mhash/sec

2) model name   : Dual Core AMD Opteron(tm) Processor 280

total cores: 4
without -4way: 4.6 Mhash/sec
with -4way: 4.0 Mhash/sec

3) model name   : Genuine Intel(R) CPU 000 @ 3.20GHz

total cores: 4
without -4way: 5.7 Mhash/sec
with -4way: 7.0 Mhash/sec

aceat64

full member

Activity: 307

Merit: 102

Quote from: NewLibertyStandard on August 15, 2010, 08:49:01 PM

Quote from: aceat64 on August 15, 2010, 07:37:54 PM

I created a wiki page so we can keep track of the results: http://www.bitcoin.org/wiki/doku.php?id=4-way_sse2

You might want to add columns for whether hyper-threading is enabled, number of physical cores and how many cores Bitcoin is using. Without 4way, I get very slightly better results when I have half of my virtual cores hashing. With 4way, I get significantly better performance when I have all my virtual cores enabled. I think I get about the same amount of hashes when hyper threading is turned off with or without 4way.

I've updated the page with your suggestions, I've also added footnotes to explain some of the fields.

NewLibertyStandard

sr. member

Activity: 252

Merit: 268

Quote from: aceat64 on August 15, 2010, 07:37:54 PM

I created a wiki page so we can keep track of the results: http://www.bitcoin.org/wiki/doku.php?id=4-way_sse2

You might want to add columns for whether hyper-threading is enabled, number of physical cores and how many cores Bitcoin is using. Without 4way, I get very slightly better results when I have half of my virtual cores hashing. With 4way, I get significantly better performance when I have all my virtual cores enabled. I think I get about the same amount of hashes when hyper threading is turned off with or without 4way.

tcatm

sr. member

Activity: 337

Merit: 285

Quote from: HostFat on August 15, 2010, 07:47:23 PM

Quote from: tcatm on August 15, 2010, 07:43:39 PM

I propose to compile sha256.cpp with -O3 -march=amdfamk10 (will work on 32bit and 64bit) as only CPUs supporting this instruction set (AMD Phenom, Intel i5 and newer) benefit from -4way and it'll improve performance by ~9%.

Good

Will this also work on Windows OS?

Didn't try it, but CFLAGS are not OS dependent at all so I guess it'll work.

HostFat

staff

Activity: 4270

Merit: 1209

I support freedom of choice

Quote from: tcatm on August 15, 2010, 07:43:39 PM

I propose to compile sha256.cpp with -O3 -march=amdfamk10 (will work on 32bit and 64bit) as only CPUs supporting this instruction set (AMD Phenom, Intel i5 and newer) benefit from -4way and it'll improve performance by ~9%.

Good

Will this also work on Windows OS?

tcatm

sr. member

Activity: 337

Merit: 285

I propose to compile sha256.cpp with -O3 -march=amdfamk10 (will work on 32bit and 64bit) as only CPUs supporting this instruction set (AMD Phenom, Intel i5 and newer) benefit from -4way and it'll improve performance by ~9%.

aceat64

full member

Activity: 307

Merit: 102

I created a wiki page so we can keep track of the results: http://www.bitcoin.org/wiki/doku.php?id=4-way_sse2

gebler

newbie

Activity: 16

Merit: 0

Running 32-bit Linux on an AMD Athlon 64 X2, I get the following results:

normal: 2850 khash/s
with -4way: 1708 khash/s

I haven't checked if the hashes are correct, just the speed.

tcatm

sr. member

Activity: 337

Merit: 285

Did anyone verify it to produce correct results on 32 bit hosts?

tcatm

sr. member

Activity: 337

Merit: 285

-4way: 12518 khash/s
without: 6550 khash/s

It's a little bit slower than my patch (~14000kash/s).

edit: I ran the binary on an older AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ with the same effect we see on older intel cpus:
-4way: 1120khash/s
without: 2012khash/s

Ground Loop

member

Activity: 111

Merit: 10

5,911 khash with -4way
11,260 without
(Dual Xeon E5450, 64-bit, 8 threads)

tcatm

sr. member

Activity: 337

Merit: 285

Quote from: knightmb on August 15, 2010, 12:02:16 PM

If I didn't know better, I would say the key is the CPU cache size. Seems all the CPU that run slower have 2 MB or less onboard cache, where as the Core i5 starts with at least 3MB of onboard CPU cache.

That's unlikely. The loop accesses 432 bytes of data. That should fit in most caches.

sgtstein

member

Activity: 61

Merit: 10

Okay, makes sense. I have an i7 930 I'll try and test out with too.

satoshi

founder

Activity: 364

Merit: 7553

I just uploaded a quick build so testers can check if I built it right. (I don't have an i5 or AMD) If it checks out, I'll put together the full package and do all the release stuff.

sgtstein

member

Activity: 61

Merit: 10

Where is the code for this? I'm on a CentOS 5.5 box and need to build it myself. Once I do that I will report back with linux 32-bit and 1MB cache Xeon.

satoshi

founder

Activity: 364

Merit: 7553

I hope someone can test an i5 or AMD to check that I built it right. I don't have either to test with.

I'm also curious if it performs much worse on 32-bit linux vs 64-bit.

knightmb

sr. member

Activity: 308

Merit: 258

I did a quick test, will report back when I try it on more machines.

Pentium E5300 Dual-Core 2.6 GHz (2MB cache, FSB 800MHz)
Processor info: http://en.wikipedia.org/wiki/Wolfdale_%28microprocessor%29
Stock = 2261 khash/s
4-way = 1103 khash/s (64 bit)

Pentium 4 - 3.0GHz (hyper-threading off) 1MB Cache, FSB 800MHz
Processor info: http://en.wikipedia.org/wiki/NetBurst_%28microarchitecture%29
Stock = 1024 khash/s (32 bit)
4-way = 658 khash/s (32 bit)

Pentium 4 - 2.8GHz (hyper-threading off) 1MB Cache, FSB 800MHz
Processor info: http://en.wikipedia.org/wiki/NetBurst_%28microarchitecture%29
Stock = 917 khash/s (64 bit)
4-way = 747 khash/s (64 bit)

If I didn't know better, I would say the key is the CPU cache size. Seems all the CPU that run slower have 2 MB or less onboard cache, where as the Core i5 starts with at least 3MB of onboard CPU cache.

Topic: tcatm's 4-way SSE2 for Linux 32/64-bit is in 0.3.10 - page 3. (Read 24802 times)