4 hashes parallel on SSE2 CPUs for 0.3.6

hugolp

legendary

Activity: 1148

Merit: 1001

Radix-The Decentralized Finance Protocol

Tryed Bitcoin 3.10 on Ubuntu Lucid 64 bit on Intel Atom 330.

Using the option -4way produces half the hash/s than not using the option. I tried using 1 to 4 (virtual) cores and -4way option produces less than no option always (arround half). Its probably due to the Intel thing.

satoshi

founder

Activity: 364

Merit: 7553

On both MinGW GCC 4.4.1 and 4.5.0 I have it working with test.cpp but SIGSEGV when called by BitcoinMiner. So now it doesn't look like it's the version of GCC, it's something else, maybe just the luck of how the stack is aligned.

I have it working fine on GCC 4.3.3 on Ubuntu 32-bit.

I found the problem with Crypto++ on MinGW 4.5.0. Here's the patch for that:

Code:

--- \old\sha.cpp	Mon Jul 26 13:31:11 2010
+++ \new\sha.cpp	Sat Aug 14 20:21:08 2010
@@ -336,7 +336,7 @@
 	ROUND(14, 0, eax, ecx, edi, edx)
 	ROUND(15, 0, ecx, eax, edx, edi)
 
-	ASL(1)
+    ASL(label1)   // Bitcoin: fix for MinGW GCC 4.5
 	AS2(add WORD_REG(si), 4*16)
 	ROUND(0, 1, eax, ecx, edi, edx)
 	ROUND(1, 1, ecx, eax, edx, edi)
@@ -355,7 +355,7 @@
 	ROUND(14, 1, eax, ecx, edi, edx)
 	ROUND(15, 1, ecx, eax, edx, edi)
 	AS2(	cmp		WORD_REG(si), K_END)
-	ASJ(	jne,	1, b)
+    ASJ(    jne,    label1,  )   // Bitcoin: fix for MinGW GCC 4.5
 
 	AS2(	mov		WORD_REG(dx), DATA_SAVE)
 	AS2(	add		WORD_REG(dx), 64)

sgtstein

member

Activity: 61

Merit: 10

Well, reporting back.

I got it to compile by specifying -msse and -msse2 to gcc when compiling. I first was hashing about 692kh/s (50% of SVN r130[1400kh/s]) but recompiled and am now receiving about ~1120kh/s. This is currently the equivalent of using both of my CPUs without HyperThreading, though I can verify that it IS using HyperThreading. With HyperThreading turned off, I get ~1350kh/s. Pretty close to the stock build.

Also, does the git contain the patched and updated code?

Code:

// SVN r130 Using HT.
08/14/10 19:02 hashmeter   4 CPUs   1392 khash/s
08/14/10 19:32 hashmeter   4 CPUs   1387 khash/s
08/14/10 20:02 hashmeter   4 CPUs   1386 khash/s
08/14/10 20:32 hashmeter   4 CPUs   1380 khash/s
08/14/10 21:02 hashmeter   4 CPUs   1363 khash/s
// With -msse -msse2, first run. Using HT.
08/14/10 21:32 hashmeter   4 CPUs    692 khash/s
08/14/10 22:06 hashmeter   4 CPUs   1011 khash/s
08/14/10 22:11 hashmeter   4 CPUs   1104 khash/s
08/14/10 22:16 hashmeter   4 CPUs   1120 khash/s
// NOT using HT.
08/14/10 22:21 hashmeter   2 CPUs   1359 khash/s
08/14/10 22:26 hashmeter   2 CPUs   1340 khash/s

Just wanted to tell my story and help with whatever information I could.

satoshi

founder

Activity: 364

Merit: 7553

MinGW GCC 4.5.0:
Crypto++ doesn't work, X86_SHA256_HashBlocks() never returns
I only got 4-way working with test.cpp but not when called by BitcoinMiner

MinGW GCC 4.4.1:
Crypto++ works
4-way SIGSEGV

GCC is definitely not aligning __m128i.

Even if we align our own __m128i variables, the compiler may decide to use a __m128i behind the scenes as a temporary variable.

By making our __m128i variables aligned and changing these inlines to defines, I was able to get it to work on 4.4.1 with -O0 only:
#define Ch(b, c, d) ((b & c) ^ (~b & d))
#define Maj(b, c, d) ((b & c) ^ (b & d) ^ (c & d))
#define ROTR(x, n) (_mm_srli_epi32(x, n) | _mm_slli_epi32(x, 32 - n))
#define SHR(x, n) _mm_srli_epi32(x, n)

But that's with -O0.

satoshi

founder

Activity: 364

Merit: 7553

Got the test working on 32-bit with MinGW GCC 4.5. Exactly 50% slower than stock with Core 2.

satoshi

founder

Activity: 364

Merit: 7553

If you haven't already, try aligning thash. It might matter. Couldn't hurt.

Quote from: tcatm on August 13, 2010, 07:53:07 PM

Looks like we're triggering a compiler bug in the tree optimizer. Can you try to compile it -O0?

No help from -O0, same error.

MinGW is GCC 3.4.5. Probably the problem.

I'll see if I can get a newer version of MinGW.

tcatm

sr. member

Activity: 337

Merit: 285

Quote from: satoshi on August 13, 2010, 07:49:18 PM

MinGW on Windows has trouble compiling it:

g++ -c -mthreads -O2 -w -Wno-invalid-offsetof -Wformat -g -D__WXDEBUG__ -DWIN32 -D__WXMSW__ -D_WINDOWS -DNOPCH -I"/boost" -I"/db/build_unix" -I"/openssl/include" -I"/wxwidgets/lib/gcc_lib/mswud" -I"/wxwidgets/include" -msse2 -O3 -o obj/sha256.o sha256.cpp

sha256.cpp: In function `long long int __vector__ Ch(long long int __vector__, long long int __vector__, long long int __vector__)':
sha256.cpp:31: internal compiler error: in perform_integral_promotions, at cp/typeck.c:1454
Please submit a full bug report,
with preprocessed source if appropriate.
See for instructions.
make: *** [obj/sha256.o] Error 1

Looks like we're triggering a compiler bug in the tree optimizer. Can you try to compile it -O0?

tcatm

sr. member

Activity: 337

Merit: 285

Quote from: sgtstein on August 13, 2010, 06:17:51 PM

1. Do we know why it doesn't work on 32bit? Is is it because it's using 128bits and if so, would it help if we dropped it to 64?

No idea, maybe some alignment problem. Someone was trying to figure it out on IRC. I don't have a SSE2 capable 32bit system. The additional registers in 64bit mode are also useful. I don't know if your PE2650 has a recent enough CPU. You might experience a performance drop of 50% if the CPU is too old.

Btw, did anyone with Intel CPU compare performance with Hyperthreading enabled/disabled? The SSE2 loop keeps the arithmetic units and pipelines pretty busy and I can imagine Hyperthreading might decrease performance.

satoshi

founder

Activity: 364

Merit: 7553

MinGW on Windows has trouble compiling it:

g++ -c -mthreads -O2 -w -Wno-invalid-offsetof -Wformat -g -D__WXDEBUG__ -DWIN32 -D__WXMSW__ -D_WINDOWS -DNOPCH -I"/boost" -I"/db/build_unix" -I"/openssl/include" -I"/wxwidgets/lib/gcc_lib/mswud" -I"/wxwidgets/include" -msse2 -O3 -o obj/sha256.o sha256.cpp

sha256.cpp: In function `long long int __vector__ Ch(long long int __vector__, long long int __vector__, long long int __vector__)':
sha256.cpp:31: internal compiler error: in perform_integral_promotions, at cp/typeck.c:1454
Please submit a full bug report,
with preprocessed source if appropriate.
See for instructions.
make: *** [obj/sha256.o] Error 1

sgtstein

member

Activity: 61

Merit: 10

Quote from: tcatm on August 13, 2010, 04:27:14 PM

1. Does not work on 32-bit (though that's not a problem with the algorithm).
2. Patch is against older SVN. There's a git repo at http://github.com/tcatm/bitcoin-cruncher
3. Compiles on every 64bit Linux.

It's not intended as a replacement for a standard client but for a dedicated bitcoinminer box. I'm planning a pluggable bitcoinminer someday. But at current difficulty it's easier to work for bitcoins than finding faster ways for mining.

1. Do we know why it doesn't work on 32bit? Is is it because it's using 128bits and if so, would it help if we dropped it to 64?

2. Thank you, I will look into implementing it on my 64bit systems.
3. Excellent to hear. I'm looking forward to using it.

I was planning on using it on a PE2650 dual proc Xeon @3.2GHz w/HT. I would really like to get this figured out to utilize that system. I am planning one as well. At current difficulty I would agree, except when the system needs to be run anyway and latency isn't an issue.

NewLibertyStandard

sr. member

Activity: 252

Merit: 268

I would really like to have this feature included in an official build sometimes soon along with an internal speed test to determine which algorithm to use. You can always remove the speed test later once you figure out how to determine whether it will be faster or slower without running the speed test.

tcatm

sr. member

Activity: 337

Merit: 285

1. Does not work on 32-bit (though that's not a problem with the algorithm).
2. Patch is against older SVN. There's a git repo at http://github.com/tcatm/bitcoin-cruncher
3. Compiles on every 64bit Linux.

It's not intended as a replacement for a standard client but for a dedicated bitcoinminer box. I'm planning a pluggable bitcoinminer someday. But at current difficulty it's easier to work for bitcoins than finding faster ways for mining.

sgtstein

member

Activity: 61

Merit: 10

Just a question for whoever, trying to wrap up the information in this thread.

Does this:

1. Work on 32-bit?
2. Patch the SVN (r130 as of current) or Git?
3. Compile on CentOS?

If anyone has any answers I would greatly appreciate them.

Cheater

newbie

Activity: 13

Merit: 0

I'll just pitch in that Phenom and Phenom II processors doubled (roughly).
No difference between the two that I can tell.

Sorry I dont have anything older than Phenoms available right now.
Might be able to access a old X2 in a week.

tcatm

sr. member

Activity: 337

Merit: 285

Would be interesting to try it out on older AMD64. There's been a change that would explain it there:
http://developer.amd.com/documentation/articles/pages/682007171.aspx

Maybe Intel did something similiar without announcing it?

vess

full member

Activity: 141

Merit: 100

My core i5 doubled in speed. My Core 2 Duo is the same speed.

satoshi

founder

Activity: 364

Merit: 7553

That big of a difference in speed, by a factor of 4 or 6, feels like it's likely to be some quirky weak spot or instruction that the old chip is slow with. Unless it's a touted feature of the i5 that they made SSE2 six times faster.

A quick summary:
Xeon Quad 41% slower
Core 2 Duo 55% slower
Core 2 Duo same (vess)
Core 2 Quad 50% slower
Core i5 200% faster (nelisky)
Core i5 100% faster (vess)
AMD Opteron 105% faster

aceat64:
My system went from ~7100 to ~4200.
This particular system has dual Intel Xeon Quad-Core CPUs (E5335) @ 2.00GHz.

impossible7:
on an Intel Core 2 Duo T7300 running x86_64 linux it was 55% slower compared to the stock version (r121)

nelisky:
My Core2Quad (Q6600) slowed down 50%,
my i5 improved ~200%,

impossible7:
on an AMD Opteron 2374 HE running x86_64 linux I got a 105% improvement (!)

nelisky

legendary

Activity: 1540

Merit: 1002

Quote from: tcatm on August 08, 2010, 06:52:53 AM

It seems to be like this: everything before Core2 will be slower, everything starting with Core2 is faster. Can anyone test the code on an older AMD64? I know there was a change in the way SSE2 instructions are executed in recent architectures.

My Core2Quad (Q6600) slowed down 50%, my i5 improved ~200%, thus I don't think what you state is accurate. Maybe starting at some specific Core2?

nimnul

sr. member

Activity: 252

Merit: 250

Can we implement a speed test, so different hashing engines are tried and the fastest is chosen?

tcatm

sr. member

Activity: 337

Merit: 285

It seems to be like this: everything before Core2 will be slower, everything starting with Core2 is faster. Can anyone test the code on an older AMD64? I know there was a change in the way SSE2 instructions are executed in recent architectures.

Topic: 4 hashes parallel on SSE2 CPUs for 0.3.6 (Read 22072 times)