tcatm's 4-way SSE2 for Linux 32/64-bit is in 0.3.10

jwalck

newbie

Activity: 17

Merit: 0

Quote from: BeeCee1 on January 20, 2011, 09:11:10 PM

I made a couple of changes to the sse2 source that sped it up about 5 or 6%:

Nice find of data dependencies!

Get about the same increase here on my AMD Opteron 6128 server. From ~32.4M to ~34.2M with all 16 cores. Too bad it will have plenty of other things to do soon. ;)

BeeCee1

member

Activity: 115

Merit: 10

I made a couple of changes to the sse2 source that sped it up about 5 or 6%:

I changed:
#define add4(x0, x1, x2, x3) _mm_add_epi32(_mm_add_epi32(_mm_add_epi32(x0, x1), x2), x3)
to
#define add4(x0, x1, x2, x3) _mm_add_epi32(_mm_add_epi32(x0, x1),_mm_add_epi32( x2,x3))

It is just re-ordering the adds. There is a data dependency, each one depends on the result of the one before, the way I reordered it two of the adds are independent. This function is called a lot of times so that little change can add up. (On an older machine it made no difference so YMMV)

A portion of the nonce calculation is repeated over and over, even though the result is the same. I moved
nonce = _mm_set1_epi32(In[3]);
nonce = _mm_add_epi32(nonce, offset);
out of the "for(k = 0; k
Here's a diff
153c153
> __m128i nonce,preNonce;
---
< __m128i nonce;
157d156
> preNonce = _mm_add_epi32(_mm_set1_epi32(In[3]),offset);
179,182c178,180
> //nonce = _mm_set1_epi32(In[3]);
> //nonce = _mm_add_epi32(nonce, offset);
> //nonce = _mm_add_epi32(nonce, _mm_set1_epi32(k));
> nonce = _mm_add_epi32(preNonce,_mm_set1_epi32(k));
---
< nonce = _mm_set1_epi32(In[3]);
< nonce = _mm_add_epi32(nonce, offset);
< nonce = _mm_add_epi32(nonce, _mm_set1_epi32(k));

I have been running this for a couple of days on the mining pool and have generated shares.

Ground Loop

member

Activity: 111

Merit: 10

It's still not cost effective.
These are HP BL460c blades.. around $6k each. That buys a lot of fresh CUDA!

It's a fun way to do "burn in", but not a smart use of resources.
12 hyperthreading Xeon cores, though.. each.

22,500 khash/sec with -4way, and only 13,400 without, so yeah, it's not subtle.

LZ

legendary

Activity: 1722

Merit: 1072

P2P Cryptocurrency

Cool! This means that the processor in practice can catch up with the video card!

Ground Loop

member

Activity: 111

Merit: 10

4way pays off on one of the HP blade machines..
It's a 12-core Intel(R) Xeon(R) CPU X5650 @ 2.67GHz

Running 24 threads with 4way, I get 22,569 khash/sec.

Yow.

lfm

full member

Activity: 196

Merit: 104

Quote from: Ground Loop on August 30, 2010, 07:48:33 PM

Seriously? Got free electricity?

At Difficulty 623, I've shut down anything under 3000 khash/sec.

I don't have free electricity but I am running a number of electric heaters that look like computers.

The bits produced are a by-product.

Ground Loop

member

Activity: 111

Merit: 10

Seriously? Got free electricity?

At Difficulty 623, I've shut down anything under 3000 khash/sec.

lfm

full member

Activity: 196

Merit: 104

Quote from: Gespenster on August 29, 2010, 06:11:32 AM

Does anybody has any reports on 4-way SSE2 on the Pentium D (Presler)? What kind of performance can I expect? I have an old Pentium D+mobo laying around and I would fire it up as a mining server if performance would be ok. Probably won't be the most efficient khash/Watt though.

My Pentium-D died but it was generally just two P4s in one package and probably will do bitcoin like that. Yes it was terribly power hungry. The 4way code doesn't do very well on P4s in general. I only get about 900 khash/s on a 3.4ghz P4 without -4way. With -4way its in the 600s.

Gespenster

newbie

Activity: 15

Merit: 0

@sgtstein: Intel's Sandy Bridge (to be released Q4 2010) will also support AVX 256-bit SIMD registers. That means 8 simultaneous hash calculations/thread would be possible, in principle.

Does anybody has any reports on 4-way SSE2 on the Pentium D (Presler)? What kind of performance can I expect? I have an old Pentium D+mobo laying around and I would fire it up as a mining server if performance would be ok. Probably won't be the most efficient khash/Watt though.

satoshi

founder

Activity: 364

Merit: 7553

The simplification is intentional. There will only be more than one thash[7]=0 in one out of 134,217,728 cases. It only makes it 0.0000007% slower.

tcatm

sr. member

Activity: 337

Merit: 285

I just reviewed the sourcecode as I had a few ideas to optimize it further and I noticed that 4way is partly broken:

from main.cpp:

Code:

                for (int j = 0; j < NPAR; j++) 
                {    
                    if (thash[7][j] == 0)
                    {                        
                        for (int i = 0; i < sizeof(hash)/4; i++) 
                          ((unsigned int*)&hash)[i] = thash[i][j];
                        pblock->nNonce = ByteReverse(tmp.block.nNonce + j);
                    }    
                }

The code will only process one hash (the last with thash[7] == 0) out of 32 hashes even when there is more than one hash that might be a correct one.

Somethine like this should fix it but it won't be safe at higher difficulties. Also, I'm not sure whether the byte order should be reversed or not. Could someone review this?

Code:

                unsigned int min_hash = ~1;
       for (int j = 0; j < NPAR; j++) 
                {    
                    if (thash[7][j] == 0)
                    {    
                        if(thash[6][j] < min_hash) {
                          min_hash = thash[6][j];
                          for (int i = 0; i < sizeof(hash)/4; i++) 
                            ((unsigned int*)&hash)[i] = thash[i][j];
                          pblock->nNonce = ByteReverse(tmp.block.nNonce + j);
                        }    
                    }    
                }

satoshi

founder

Activity: 364

Merit: 7553

Quote from: ArtForz on August 21, 2010, 11:56:31 AM

AMD K10: 2 128bit units
intel nehalem: 3 128bit units

This probably explains why hyperthreading increases performance with -4way. If three SSE2 units is excessive, then hyperthreading would help keep them all busy.

sgtstein

member

Activity: 61

Merit: 10

Anybody catch the new AMD Bulldozer press release? If I understand correctly, it should be capable of processing 8 64bit hashes, per core, at the same time. Would be quite a speed boost using this same code design.

Slashdot has the article.
PC Perspective has the details.

Was also covered by AnandTech back in November, 2009.

Ground Loop

member

Activity: 111

Merit: 10

Intel Atom 230 @ 1.60GHz. Linux 32-bit.
(Acer Aspire Revo)

Stock: 438 khash/sec (1 proc gives 354)
4way: 254 khash/sec

So you can take this one off the powerhouse list.

satoshi

founder

Activity: 364

Merit: 7553

Thanks for clearing that up. I read the link someone posted about AMD making that change around 2007, but I didn't know what the story was for Intel.

There's no hope for Core/Core2 then. They only have half the SSE2 hardware.

Strange that Intel has 3 128bit units, but AMD with 2 128bit units is the faster one.

ArtForz

sr. member

Activity: 406

Merit: 257

The difference between new and older CPUs is pretty easy to explain.
Older microarchitectures have 64-bit mmx/sse execution units and split 128bit sse ops into 2 64bit microops.
Newer archs have 128bit sse units.

AMD K8: 2 64bit units
intel Core/Core2: 3 64bit units
AMD K10: 2 128bit units
intel nehalem: 3 128bit units

K10 = Opterons with 4 or more cores, Phenom, Phenom II, Athlon II
nehalem = xeon 34xx/35xx/36xx/55xx/56xx/65xx/75xx, i3/i5/i7

nelisky

legendary

Activity: 1540

Merit: 1002

Quote from: satoshi on August 19, 2010, 02:07:43 PM

Quote from: nelisky on August 18, 2010, 06:02:25 PM

And i5, at least on my macbookpro

Good, so I take it that's a confirmation that it's working on Mac as well?

Laszlo told me he did compile in the -4way stuff on Mac, so the -4way switch is also available to try on Mac. I don't think makefile.osx on SVN has it yet, just the built version.

Yep, it's working all right. The number I had posted were from an old svn revision patched with tcatm's changes, but today I compiled trunk and while I had to once again tweak the makefile, after I did it works great with the numbers matching what I experienced before.

Changes I did for my system are below, and while some are cosmetic, like removing wx-config from making bitcoind, just to avoid the warnings if you don't have it installed, others are system specific, like the DEPS dir, and the fact I don't have 32bit libs which makes the link step fail if -arch i386 is there.
The bsddb changes are, I believe, a typo. Includes and Libs point to db46, but then the object list for the linker states db48. Anyway, here's the diff for what got me going:

Code:

Index: makefile.osx
===================================================================
--- makefile.osx	(revision 139)
+++ makefile.osx	(working copy)
@@ -6,29 +6,29 @@
 # Laszlo Hanyecz ([email protected])
 
 CXX=llvm-g++
-DEPSDIR=/Users/macosuser/bitcoin/deps
+DEPSDIR=/opt/local
 
 INCLUDEPATHS= \
- -I"$(DEPSDIR)/include"
+ -I"$(DEPSDIR)/include"  -I"$(DEPSDIR)/include/db46"
 
 LIBPATHS= \
- -L"$(DEPSDIR)/lib"
+ -L"$(DEPSDIR)/lib"  -L"$(DEPSDIR)/lib/db46"
 
-WXLIBS=$(shell $(DEPSDIR)/bin/wx-config --libs --static)
+WXLIBS=
 
 LIBS= -dead_strip \
- $(DEPSDIR)/lib/libdb_cxx-4.8.a \
- $(DEPSDIR)/lib/libboost_system.a \
- $(DEPSDIR)/lib/libboost_filesystem.a \
- $(DEPSDIR)/lib/libboost_program_options.a \
- $(DEPSDIR)/lib/libboost_thread.a \
+ $(DEPSDIR)/lib/db46/libdb_cxx-4.6.a \
+ $(DEPSDIR)/lib/libboost_system-mt.a \
+ $(DEPSDIR)/lib/libboost_filesystem-mt.a \
+ $(DEPSDIR)/lib/libboost_program_options-mt.a \
+ $(DEPSDIR)/lib/libboost_thread-mt.a \
  $(DEPSDIR)/lib/libcrypto.a 
 
-DEFS=$(shell $(DEPSDIR)/bin/wx-config --cxxflags) -D__WXMAC_OSX__ -DNOPCH -DMSG_NOSIGNAL=0
+DEFS=-D__WXMAC_OSX__ -DNOPCH -DMSG_NOSIGNAL=0 -DFOURWAYSSE2
 
 DEBUGFLAGS=-g -DwxDEBUG_LEVEL=0
 # ppc doesn't work because we don't support big-endian
-CFLAGS=-mmacosx-version-min=10.5 -arch i386 -arch x86_64 -O3 -Wno-invalid-offsetof -Wformat $(DEBUGFLAGS) $(DEFS) $(INCLUDEPATHS)
+CFLAGS=-mmacosx-version-min=10.5 -arch x86_64 -O3 -Wno-invalid-offsetof -Wformat $(DEBUGFLAGS) $(DEFS) $(INCLUDEPATHS)
 HEADERS=headers.h strlcpy.h serialize.h uint256.h util.h key.h bignum.h base58.h \
     script.h db.h net.h irc.h main.h rpc.h uibase.h ui.h noui.h init.h
 
@@ -42,6 +42,7 @@
     obj/rpc.o \
     obj/init.o \
     cryptopp/obj/sha.o \
+    obj/sha256.o \
     cryptopp/obj/cpu.o
 	
 
@@ -55,7 +56,7 @@
 	$(CXX) -c $(CFLAGS) -O3 -DCRYPTOPP_DISABLE_ASM -o $@ $<
 
 bitcoin: $(OBJS) obj/ui.o obj/uibase.o
-	$(CXX) $(CFLAGS) -o $@ $(LIBPATHS) $^ $(WXLIBS) $(LIBS)
+	$(CXX) $(shell $(DEPSDIR)/bin/wx-config --cxxflags) $(CFLAGS) -o $@ $(LIBPATHS) $^ $(shell $(DEPSDIR)/bin/wx-config --libs --static) $(LIBS)
 
 
 obj/nogui/%.o: %.cpp $(HEADERS)

satoshi

founder

Activity: 364

Merit: 7553

Quote from: Ground Loop on August 18, 2010, 06:14:26 PM

Any non-Mac i5 love?
Windows i5 64-bit got slower here.

That's the first I've heard anyone say i5 was slower. Everyone else has said 4way was faster on i5. Moreso with hyperthreading enabled.

Quote from: nelisky on August 18, 2010, 06:02:25 PM

And i5, at least on my macbookpro

Good, so I take it that's a confirmation that it's working on Mac as well?

Laszlo told me he did compile in the -4way stuff on Mac, so the -4way switch is also available to try on Mac. I don't think makefile.osx on SVN has it yet, just the built version.

vess

full member

Activity: 141

Merit: 100

My Core i5 laptop (Ubuntu) doubled in speed. Actually, it didn't double in speed. It stayed the same speed, but only uses half the CPU now. I can't get it to go back to full CPU usage. That said, my laptop is a lot cooler when generating blocks now. I'll post back if I see it successfully go up to 100% usage.

Ground Loop

member

Activity: 111

Merit: 10

Any non-Mac i5 love?
Windows i5 64-bit got slower here.
[correction -- not true. Windows doesn't have -4way, and the Linux machines are Xeons.]

Topic: tcatm's 4-way SSE2 for Linux 32/64-bit is in 0.3.10 (Read 24802 times)