tcatm's 4-way SSE2 for Linux 32/64-bit is in 0.3.10 - page 2.

nelisky

legendary

Activity: 1540

Merit: 1002

Quote from: Ground Loop on August 18, 2010, 06:00:08 PM

So is it accurate to say that, so far, only Intel Core i7 processors and certain (Phenom?) AMD processors enjoy a speed bump from -4way?

And i5, at least on my macbookpro

Ground Loop

member

Activity: 111

Merit: 10

So is it accurate to say that, so far, only Intel Core i7 processors and certain (Phenom?) AMD processors enjoy a speed bump from -4way?

denaje

newbie

Activity: 2

Merit: 0

64-bit Gentoo / Intel Core i7

W/O 4way: 4324294
With 4way: 7649415

32-bit Ubuntu VM on XP host / Intel Core 2 Duo

W/O 4way: 1751518
With 4way: 793100

hugolp

legendary

Activity: 1148

Merit: 1001

Radix-The Decentralized Finance Protocol

Model: Intel Atom n330 (2 cores, 4 virtual).

OS: Ubuntu 10.04 64bit

Using the -4way option I get half the speed than using no option.

teknohog

sr. member

Activity: 520

Merit: 253

555

Quote from: satoshi on August 16, 2010, 08:38:01 AM

I wrapped sha256.cpp in
#ifdef FOURWAYSSE2
#endif // FOURWAYSSE2

try it now.

Thanks, works fine now.

tommy

newbie

Activity: 44

Merit: 0

model name : AMD Athlon(tm) 64 X2 Dual Core Processor 5600+

w/o -4way "hashespersec" : 2539397

with -4way "hashespersec" : 2108791

Linux, Debian, 32 bit.

satoshi

founder

Activity: 364

Merit: 7553

I wrapped sha256.cpp in
#ifdef FOURWAYSSE2
#endif // FOURWAYSSE2

try it now.

teknohog

sr. member

Activity: 520

Merit: 253

555

On a Core 2 Duo T7200, the default code gives about 1.8 Mhash/s, and 4way is slower at 1.0 Mhash/s. It has 4 MB of L2 cache, so it is probably not a question of cache size, as suggested at some point.

Unfortunately, the code (from svn) no longer compiles on ARM, as it now has SSE intrinsics hardcoded. I have removed the -msse2 and -DFOURWAYSSE2 flags from the makefile, and it still produces errors like this

Code:

sha256.cpp:8:23: error: xmmintrin.h: No such file or directory
sha256.cpp:34: error: ‘__m128i’ does not name a type

but hopefully this is easy to fix.

tcatm

sr. member

Activity: 337

Merit: 285

@satoshi: Oops, I meant -march=amdfam10. Sorry.

@everyone confused about improvement on Phenoms: I developed the code on a Phenom (940) and verified it (at least in 64bit mode) and the improvement you see is real.

Concerning Hyperthreading: It seems to give a little performance gain, maybe from running load/store instructions in parallel with aritmethic instructions. There's only a tiny bit of plain x86 instructions for glueing the function into the ABI. They take less than ~2% of the total CPU time (measured with gprof).

NewLibertyStandard

sr. member

Activity: 252

Merit: 268

More importantly, about how long should it take 10 Mhash/sec to verify difficulty 1 blocks?

After the 64-bit Linux hashing bug was fixed I generated a block or two in short order, but since that one or two blocks, I have not generated a single block. It's starting to seem a little fishy.

I'm currently testing Bitcoin on two Linux 64-bit computers. Is there anything in the code blocking early block verification?

Edit: Never mind. I used the Bitcoin Generation Calculator and divided out the difficulty. Everything is fine here, I've generated a couple blocks with 4way. About to start testing without 4way.

Another Edit: My test only verifies that hashing works. It does not verify whether I'm really getting the displayed speed.

jgarzik

legendary

Activity: 1596

Merit: 1100

Quote from: satoshi on August 15, 2010, 11:36:59 PM

Quote from: jgarzik on August 15, 2010, 10:35:28 PM

Code:

cpu family : 6
model : 26
model name : Genuine Intel(R) CPU 000 @ 3.20GHz
stepping : 4

cpu family 6 model 26 stepping 4 is an Intel Core i7.
That's a 23% speedup with -4way, 63% total speedup with -4way + hyperthreading.
33% faster with hyperthreading than without it.

Does bitcoin perform any self-tests at startup, to verify that hashing is working?

gridecon

newbie

Activity: 35

Merit: 0

Quote from: NewLibertyStandard on August 16, 2010, 12:02:31 AM

Quote from: gridecon on August 15, 2010, 10:15:44 PM

I have two quadcore Phenom II 64-bit linux machines (ubuntu 9.10 both) and the -4way option increases my hashing speed so much I'm suspicious. I get about 5-6khash/sec on these boxes previously and without -4way option. With -4way I get over 11khash/sec! In other words, the -4way switch almost DOUBLES the reported hashing speed. This level of improvement seems more than expected and makes me wonder if my boxes are really doing the hashing that much faster or if there could possible be an issue where the math operations are actually being skipped over for some reason, causing illusory speed and an inability to actually generate blocks?

o_O... good luck hashing, you're gonna need it!

I guess that should read either mhash/sec or THOUSANDS of khash/sec...but hey, what's 3 orders of magnitude among friends?

Perhaps that typographical error is why nobody has answered whether or not a nearly 100% speeded from the -4way option is at all realistic? I'm not convinced the crypto hashing is really taking place at the rate of 11000khash/sec on my desktop box.

NewLibertyStandard

sr. member

Activity: 252

Merit: 268

Quote from: gridecon on August 15, 2010, 10:15:44 PM

I have two quadcore Phenom II 64-bit linux machines (ubuntu 9.10 both) and the -4way option increases my hashing speed so much I'm suspicious. I get about 5-6khash/sec on these boxes previously and without -4way option. With -4way I get over 11khash/sec! In other words, the -4way switch almost DOUBLES the reported hashing speed. This level of improvement seems more than expected and makes me wonder if my boxes are really doing the hashing that much faster or if there could possible be an issue where the math operations are actually being skipped over for some reason, causing illusory speed and an inability to actually generate blocks?

o_O... good luck hashing, you're gonna need it!

Quote from: satoshi on August 15, 2010, 09:57:57 PM

Quote from: NewLibertyStandard on August 15, 2010, 08:49:01 PM

With 4way, I get significantly better performance when I have all my virtual cores enabled. I think I get about the same amount of hashes when hyper threading is turned off with or without 4way.

Hey, you may be onto something!

hyperthreading didn't help before because all the work was in the arithmetic and logic units, which the hyperthreads share.

tcatm's SSE2 code must be a mix of normal x86 instructions and SSE2 instructions, so while one is doing x86 code, the other can do SSE2.

How much of an improvement do you get with hyperthreading?

Some numbers? What CPU is that?

Here are the results from my very poor memory on an i7 860 2.8 GHz with Ubuntu 10.04 amd64. Some of the numbers may be a bit off.

Without 4way, with HT, 4/8 virtual cores, 4.5-5 Mhash/sec
Without 4way, with HT, 8/8 virtual cores, a bit less than above, but basically the same

With 4way, with HT, 8/8 virtual cores, 6.5-8 Mhash/sec (It may be my imagination, but it seems noticeably more variable.)
With 4way, with HT, 4/8 virtual cores, 5-6 Mhash/sec

Without 4way, without HT, 4/4 physical cores, 4.5-5 Mhas/sec (But a bit slower than the first result.)
With 4way, without HT, 4/4 physical cores, 5-6 Mhash/sec

satoshi

founder

Activity: 364

Merit: 7553

Quote from: jgarzik on August 15, 2010, 10:35:28 PM

Code:

cpu family : 6
model : 26
model name : Genuine Intel(R) CPU 000 @ 3.20GHz
stepping : 4

cpu family 6 model 26 stepping 4 is an Intel Core i7.
That's a 23% speedup with -4way, 63% total speedup with -4way + hyperthreading.
33% faster with hyperthreading than without it.

Ground Loop

member

Activity: 111

Merit: 10

No winners for 4way in my other three Intel machines either:

Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz (64-bit Linux)
4way: 1565 std: 3002

Intel(R) Xeon(TM) CPU 3.00GHz (32-bit Linux)
4way: 1243 std: 2048

Intel(R) Core(TM)2 CPU 6300 @ 1.86GHz
4way: 932 std: 1733

(All running 0.3.10, -1 proclimit)
Experiments with proclimit weren't any better.

jgarzik

legendary

Activity: 1596

Merit: 1100

Update for

Code:

cpu family : 6
model : 26
model name : Genuine Intel(R) CPU 000 @ 3.20GHz
stepping : 4

Machine has 4 cores, each with 2 hyperthreads. /proc/cpuinfo shows 8 virtual processors.

without -4way, setgen 4: 5.7 Mhash/sec
without -4way, setgen 8: 5.0 Mhash/sec

with -4way, setgen 4: 7.0 Mhash/sec
with -4way, setgen 8: 9.3 Mhash/sec

So, the old wisdom of "hyperthreading slows things down" is now shattered, on this machine.

lfm

full member

Activity: 196

Merit: 104

model name : Intel(R) Core(TM)2 Quad CPU Q9450 @ 2.66GHz, linux 64

no difference at about 4950 khash/s

Vasiliev

newbie

Activity: 55

Merit: 0

http://www.google.com/search?q=amdfamk10

I think he misremembered it since AMD arches are K#.

satoshi

founder

Activity: 364

Merit: 7553

Quote from: Vasiliev on August 15, 2010, 10:17:07 PM

try -march=amdfam10

That works.

That's strange... are we sure that's the same thing? tcatm, try amdfam10 and make sure you get the same speed measurement.

Vasiliev

newbie

Activity: 55

Merit: 0

Quote from: satoshi on August 15, 2010, 09:57:57 PM

Quote from: tcatm on August 15, 2010, 07:43:39 PM

I propose to compile sha256.cpp with -O3 -march=amdfamk10 (will work on 32bit and 64bit) as only CPUs supporting this instruction set (AMD Phenom, Intel i5 and newer) benefit from -4way and it'll improve performance by ~9%.

GCC 4.3.3 doesn't support -march=amdfamk10. I get:
sha256.cpp:1: error: bad value (amdfamk10) for -march= switch

try -march=amdfam10

Topic: tcatm's 4-way SSE2 for Linux 32/64-bit is in 0.3.10 - page 2. (Read 24802 times)