Pages:
Author

Topic: 4 hashes parallel on SSE2 CPUs for 0.3.6 (Read 21945 times)

legendary
Activity: 1148
Merit: 1001
Radix-The Decentralized Finance Protocol
August 17, 2010, 02:23:11 PM
#88
Tryed Bitcoin 3.10 on Ubuntu Lucid 64 bit on Intel Atom 330.

Using the option -4way produces half the hash/s than not using the option. I tried using 1 to 4 (virtual) cores and -4way option produces less than no option always (arround half). Its probably due to the Intel thing.
founder
Activity: 364
Merit: 3077
August 14, 2010, 11:40:29 PM
#87
On both MinGW GCC 4.4.1 and 4.5.0 I have it working with test.cpp but SIGSEGV when called by BitcoinMiner.  So now it doesn't look like it's the version of GCC, it's something else, maybe just the luck of how the stack is aligned.

I have it working fine on GCC 4.3.3 on Ubuntu 32-bit.

I found the problem with Crypto++ on MinGW 4.5.0.  Here's the patch for that:
Code:
--- \old\sha.cpp Mon Jul 26 13:31:11 2010
+++ \new\sha.cpp Sat Aug 14 20:21:08 2010
@@ -336,7 +336,7 @@
  ROUND(14, 0, eax, ecx, edi, edx)
  ROUND(15, 0, ecx, eax, edx, edi)
 
- ASL(1)
+    ASL(label1)   // Bitcoin: fix for MinGW GCC 4.5
  AS2(add WORD_REG(si), 4*16)
  ROUND(0, 1, eax, ecx, edi, edx)
  ROUND(1, 1, ecx, eax, edx, edi)
@@ -355,7 +355,7 @@
  ROUND(14, 1, eax, ecx, edi, edx)
  ROUND(15, 1, ecx, eax, edx, edi)
  AS2( cmp WORD_REG(si), K_END)
- ASJ( jne, 1, b)
+    ASJ(    jne,    label1,  )   // Bitcoin: fix for MinGW GCC 4.5
 
  AS2( mov WORD_REG(dx), DATA_SAVE)
  AS2( add WORD_REG(dx), 64)
member
Activity: 61
Merit: 10
August 14, 2010, 11:19:31 PM
#86
Well, reporting back.

I got it to compile by specifying -msse and -msse2 to gcc when compiling. I first was hashing about 692kh/s (50% of SVN r130[1400kh/s]) but recompiled and am now receiving about ~1120kh/s. This is currently the equivalent of using both of my CPUs without HyperThreading, though I can verify that it IS using HyperThreading. With HyperThreading turned off, I get ~1350kh/s. Pretty close to the stock build.

Also, does the git contain the patched and updated code?

Code:
// SVN r130 Using HT.
08/14/10 19:02 hashmeter   4 CPUs   1392 khash/s
08/14/10 19:32 hashmeter   4 CPUs   1387 khash/s
08/14/10 20:02 hashmeter   4 CPUs   1386 khash/s
08/14/10 20:32 hashmeter   4 CPUs   1380 khash/s
08/14/10 21:02 hashmeter   4 CPUs   1363 khash/s
// With -msse -msse2, first run. Using HT.
08/14/10 21:32 hashmeter   4 CPUs    692 khash/s
08/14/10 22:06 hashmeter   4 CPUs   1011 khash/s
08/14/10 22:11 hashmeter   4 CPUs   1104 khash/s
08/14/10 22:16 hashmeter   4 CPUs   1120 khash/s
// NOT using HT.
08/14/10 22:21 hashmeter   2 CPUs   1359 khash/s
08/14/10 22:26 hashmeter   2 CPUs   1340 khash/s


Just wanted to tell my story and help with whatever information I could.
founder
Activity: 364
Merit: 3077
August 14, 2010, 06:06:13 PM
#85
MinGW GCC 4.5.0:
Crypto++ doesn't work, X86_SHA256_HashBlocks() never returns
I only got 4-way working with test.cpp but not when called by BitcoinMiner

MinGW GCC 4.4.1:
Crypto++ works
4-way SIGSEGV

GCC is definitely not aligning __m128i.

Even if we align our own __m128i variables, the compiler may decide to use a __m128i behind the scenes as a temporary variable.

By making our __m128i variables aligned and changing these inlines to defines, I was able to get it to work on 4.4.1 with -O0 only:
#define Ch(b, c, d)  ((b & c) ^ (~b & d))
#define Maj(b, c, d)  ((b & c) ^ (b & d) ^ (c & d))
#define ROTR(x, n) (_mm_srli_epi32(x, n) | _mm_slli_epi32(x, 32 - n))
#define SHR(x, n)  _mm_srli_epi32(x, n)

But that's with -O0.

founder
Activity: 364
Merit: 3077
August 14, 2010, 01:55:37 PM
#84
Got the test working on 32-bit with MinGW GCC 4.5.  Exactly 50% slower than stock with Core 2.
founder
Activity: 364
Merit: 3077
August 14, 2010, 12:22:29 AM
#83
If you haven't already, try aligning thash.  It might matter.  Couldn't hurt.

Looks like we're triggering a compiler bug in the tree optimizer. Can you try to compile it -O0?
No help from -O0, same error.

MinGW is GCC 3.4.5.  Probably the problem.

I'll see if I can get a newer version of MinGW.

sr. member
Activity: 337
Merit: 263
August 13, 2010, 08:53:07 PM
#82
MinGW on Windows has trouble compiling it:

g++ -c -mthreads -O2 -w -Wno-invalid-offsetof -Wformat -g -D__WXDEBUG__ -DWIN32 -D__WXMSW__ -D_WINDOWS -DNOPCH -I"/boost" -I"/db/build_unix" -I"/openssl/include" -I"/wxwidgets/lib/gcc_lib/mswud" -I"/wxwidgets/include" -msse2 -O3 -o obj/sha256.o sha256.cpp

sha256.cpp: In function `long long int __vector__ Ch(long long int __vector__, long long int __vector__, long long int __vector__)':
sha256.cpp:31: internal compiler error: in perform_integral_promotions, at cp/typeck.c:1454
Please submit a full bug report,
with preprocessed source if appropriate.
See for instructions.
make: *** [obj/sha256.o] Error 1


Looks like we're triggering a compiler bug in the tree optimizer. Can you try to compile it -O0?
sr. member
Activity: 337
Merit: 263
August 13, 2010, 08:50:28 PM
#81
1. Do we know why it doesn't work on 32bit? Is is it because it's using 128bits and if so, would it help if we dropped it to 64?

No idea, maybe some alignment problem. Someone was trying to figure it out on IRC. I don't have a SSE2 capable 32bit system. The additional registers in 64bit mode are also useful. I don't know if your PE2650 has a recent enough CPU. You might experience a performance drop of 50% if the CPU is too old.

Btw, did anyone with Intel CPU compare performance with Hyperthreading enabled/disabled? The SSE2 loop keeps the arithmetic units and pipelines pretty busy and I can imagine Hyperthreading might decrease performance.
founder
Activity: 364
Merit: 3077
August 13, 2010, 08:49:18 PM
#80
MinGW on Windows has trouble compiling it:

g++ -c -mthreads -O2 -w -Wno-invalid-offsetof -Wformat -g -D__WXDEBUG__ -DWIN32 -D__WXMSW__ -D_WINDOWS -DNOPCH -I"/boost" -I"/db/build_unix" -I"/openssl/include" -I"/wxwidgets/lib/gcc_lib/mswud" -I"/wxwidgets/include" -msse2 -O3 -o obj/sha256.o sha256.cpp

sha256.cpp: In function `long long int __vector__ Ch(long long int __vector__, long long int __vector__, long long int __vector__)':
sha256.cpp:31: internal compiler error: in perform_integral_promotions, at cp/typeck.c:1454
Please submit a full bug report,
with preprocessed source if appropriate.
See for instructions.
make: *** [obj/sha256.o] Error 1
member
Activity: 61
Merit: 10
August 13, 2010, 07:17:51 PM
#79
1. Does not work on 32-bit (though that's not a problem with the algorithm).
2. Patch is against older SVN. There's a git repo at http://github.com/tcatm/bitcoin-cruncher
3. Compiles on every 64bit Linux.

It's not intended as a replacement for a standard client but for a dedicated bitcoinminer box. I'm planning a pluggable bitcoinminer someday. But at current difficulty it's easier to work for bitcoins than finding faster ways for mining.

1. Do we know why it doesn't work on 32bit? Is is it because it's using 128bits and if so, would it help if we dropped it to 64?

2. Thank you, I will look into implementing it on my 64bit systems.
3. Excellent to hear. I'm looking forward to using it.

I was planning on using it on a PE2650 dual proc Xeon @3.2GHz w/HT. I would really like to get this figured out to utilize that system. I am planning one as well. At current difficulty I would agree, except when the system needs to be run anyway and latency isn't an issue.
sr. member
Activity: 252
Merit: 255
August 13, 2010, 06:27:25 PM
#78
I would really like to have this feature included in an official build  sometimes soon along with an internal speed test to determine which algorithm to use. You can always remove the speed test later once you figure out how to determine whether it will be faster or slower without running the speed test.
sr. member
Activity: 337
Merit: 263
August 13, 2010, 05:27:14 PM
#77
1. Does not work on 32-bit (though that's not a problem with the algorithm).
2. Patch is against older SVN. There's a git repo at http://github.com/tcatm/bitcoin-cruncher
3. Compiles on every 64bit Linux.

It's not intended as a replacement for a standard client but for a dedicated bitcoinminer box. I'm planning a pluggable bitcoinminer someday. But at current difficulty it's easier to work for bitcoins than finding faster ways for mining.
member
Activity: 61
Merit: 10
August 13, 2010, 05:10:33 PM
#76
Just a question for whoever, trying to wrap up the information in this thread.

Does this:
  • 1. Work on 32-bit?
  • 2. Patch the SVN (r130 as of current) or Git?
  • 3. Compile on CentOS?

If anyone has any answers I would greatly appreciate them.
newbie
Activity: 13
Merit: 0
August 13, 2010, 02:27:23 AM
#75
I'll just pitch in that Phenom and Phenom II processors doubled (roughly).
No difference between the two that I can tell.

Sorry I dont have anything older than Phenoms available right now.
Might be able to access a old X2 in a week.
sr. member
Activity: 337
Merit: 263
August 12, 2010, 08:42:47 PM
#74
Would be interesting to try it out on older AMD64. There's been a change that would explain it there:
http://developer.amd.com/documentation/articles/pages/682007171.aspx

Maybe Intel did something similiar without announcing it?
full member
Activity: 141
Merit: 100
August 12, 2010, 06:09:50 PM
#73
My core i5 doubled in speed. My Core 2 Duo is the same speed.
founder
Activity: 364
Merit: 3077
August 12, 2010, 06:07:23 PM
#72
That big of a difference in speed, by a factor of 4 or 6, feels like it's likely to be some quirky weak spot or instruction that the old chip is slow with.  Unless it's a touted feature of the i5 that they made SSE2 six times faster.

A quick summary:
Xeon Quad        41% slower
Core 2 Duo        55% slower
Core 2 Duo        same (vess)
Core 2 Quad      50% slower
Core i5            200% faster (nelisky)
Core i5            100% faster (vess)
AMD Opteron    105% faster

aceat64:
My system went from ~7100 to ~4200.
This particular system has dual Intel Xeon Quad-Core CPUs (E5335) @ 2.00GHz.

impossible7:
on an Intel Core 2 Duo T7300 running x86_64 linux it was 55% slower compared to the stock version (r121)

nelisky:
My Core2Quad (Q6600) slowed down 50%,
my i5 improved ~200%,

impossible7:
on an AMD Opteron 2374 HE running x86_64 linux I got a 105% improvement (!)
legendary
Activity: 1540
Merit: 1000
August 12, 2010, 08:57:58 AM
#71
It seems to be like this: everything before Core2 will be slower, everything starting with Core2 is faster. Can anyone test the code on an older AMD64? I know there was a change in the way SSE2 instructions are executed in recent architectures.

My Core2Quad (Q6600) slowed down 50%, my i5 improved ~200%, thus I don't think what you state is accurate. Maybe starting at some specific Core2?
sr. member
Activity: 252
Merit: 250
August 12, 2010, 08:18:23 AM
#70
Can we implement a speed test, so different hashing engines are tried and the fastest is chosen?
sr. member
Activity: 337
Merit: 263
August 08, 2010, 07:52:53 AM
#69
It seems to be like this: everything before Core2 will be slower, everything starting with Core2 is faster. Can anyone test the code on an older AMD64? I know there was a change in the way SSE2 instructions are executed in recent architectures.
Pages:
Jump to: