4 hashes parallel on SSE2 CPUs for 0.3.6 - page 3.

tcatm

sr. member

Activity: 337

Merit: 285

did it crash with a segfault and can you provide a backtrace (gdb bitcoind; run; bt)?

impossible7

newbie

Activity: 18

Merit: 0

r121 from the svn patched with the patch from the post #21 running on a Opteron/x86_64

tcatm

sr. member

Activity: 337

Merit: 285

did you run it on 32 bit machines? which version of the patch did you use?

impossible7

newbie

Activity: 18

Merit: 0

I kept running the patched version on 2 machines and the following has happened 5 times: bitcoind crashes and debug.log contains the following:

Code:

proof-of-work found
  hash: 00000000001c3530e42b2c7e1a20de01436882d0c1de0b63db6be8e6194255dd
target: 00000000010c5a00000000000000000000000000000000000000000000000000
CBlock(hash=00000000001c3530, ver=1, hashPrevBlock=0000000000253ab5, hashMerkleRoot=89541f, nTime=1280867359, nBits=1c010c5a, nNonce=3915571979, vtx=2)
  CTransaction(hash=4fcb8e, ver=1, vin.size=1, vout.size=1, nLockTime=0)
    CTxIn(COutPoint(000000, -1), coinbase 045a0c011c021b04)
    CTxOut(nValue=50.00000000, scriptPubKey=0xCE5264238BAC29160CDC9C)
  CTransaction(hash=8f2466, ver=1, vin.size=1, vout.size=1, nLockTime=0)
    CTxIn(COutPoint(77aaae, 1), scriptSig=0x01F561A9044BF348CEF6F4)
    CTxOut(nValue=5.00000000, scriptPubKey=OP_DUP OP_HASH160 0xB13A)
  vMerkleTree: 4fcb8e 8f2466 89541f
08/03/10 20:29 generated 50.00
AddToWallet 4fcb8e  new
AddToBlockIndex: new best=00000000001c3530  height=72112
ProcessBlock: ACCEPTED
sending: inv

I guess this means that a new block has been generated. But when I restart bitcoind the balance is still zero. When I ask for a list of generated blocks I get the following:

Code:

$ ./bitcoind listgenerated
[
    {
        "value" : 50.00000000000000,
        "maturesIn" : -1,
        "accepted" : false,
        "confirmations" : 0,
        "genTime" : 1280867359
    }
]

(listgenerated is from the patch at http://www.alloscomp.com/bitcoin/)

I guess this means that my client produced a block but it crashed before it was able to broadcast it.

vess

full member

Activity: 141

Merit: 100

Anyone able to send me a compiled version of this for windows? I'm interested to try it out on my AMD server.

nelisky

legendary

Activity: 1540

Merit: 1002

Well, kudos to you for trying. Now if I can just get your code merged with the old cuda version on my macbook pro, I'll be a happy camper

tcatm

sr. member

Activity: 337

Merit: 285

i5 is a different architecture than Core2. Maybe SSE in Core2 is broken and was fixed in i5. That means the original client is close to the fastest you can get on Core2. It's not a compiler thing. I compared the output for different architectures and -march=amdfam10 produces the fastest and smallest code. I would be surprised if a longer loop using the same instructions was faster on an older CPU.

nelisky

legendary

Activity: 1540

Merit: 1002

SHA256 test started
70293
found solutions = 70293
total hashes = 139463136
total time = 222110 ms
average speed: 627 khash/s

So slightly better, but still far for good... As for AMD vs Intel, on my Mac, which is an intel i5, the performance boost was almost 100%, so maybe some compiler thing? I did have to remove the -arch i386 from makefile.osx to have it build on osx 10.6, but there's no such flag on linux' g++ and I'm pretty sure the 64bit g++ will not compile 32bit anyway.

tcatm

sr. member

Activity: 337

Merit: 285

Thanks for the object!

There are two things I noticed:
1) The Intel object runs at 3269khash/s on my AMD64 (vs. 3778khash/s) so it's less optimized than the AMD64 code.

2) AMD64 moves less data around and does more calculations. Sometimes it even abuses floating point instructions for integers.

Could you drop in my sha256.o from http://ul.to/2ckndx to cryptopp/obj/, delete test (not the .cpp!!) and recompile test using make -f makefile.unix test (take care it doesn't recompile sha256.cpp to sha256.o). Then run test again. It should be using AMD64 code now. Maybe it works better...

If not we've found that AMD64 is about four times faster than Intel at SSE2 integer vector arithmetic. Anyone working on a floating point SHA256 implementation? Wink

nelisky

legendary

Activity: 1540

Merit: 1002

datla@bah:~/src/bitcoin/bitcoin-cruncher$ ./test blocks.txt
SHA256 test started
70293
found solutions = 70293
total hashes = 139463136
total time = 235480 ms
average speed: 592 khash/s

I'll send you the obj file now

tcatm

sr. member

Activity: 337

Merit: 285

Quote from: nelisky on August 02, 2010, 06:52:27 PM

I've tried the git branch and results stay the same, almost half of what the vanilla svn can pump out. I'm running Intel and not AMD, but I am on 64bit:

Linux bah 2.6.32-22-server #33-Ubuntu SMP Wed Apr 28 14:34:48 UTC 2010 x86_64 GNU/Linux

Anything I can try to help and debug this?

Can you mail me a copy of cryptopp/obj/sha256.o to [email protected]? I still fear Intels microcode in their CPUs wasn't made for such tight loops of SSE code. Have you run the test program? How many khash/s does it crunch?

nelisky

legendary

Activity: 1540

Merit: 1002

I've tried the git branch and results stay the same, almost half of what the vanilla svn can pump out. I'm running Intel and not AMD, but I am on 64bit:

Linux bah 2.6.32-22-server #33-Ubuntu SMP Wed Apr 28 14:34:48 UTC 2010 x86_64 GNU/Linux

Anything I can try to help and debug this?

tcatm

sr. member

Activity: 337

Merit: 285

To use the test program download this file (or generate it yourself from the blockchain): http://ul.to/hz5wlg
The program will try to find the correct nonce in each block and detect if the hash function does work correctly. It'll also benchmark the algorithm.

From what I've heard the patch does not work on 32 bit systems. I don't know why. I've developed it on an AMD64 machine and it works fine. If it's slower on Intel, try to disable Hyperthreading. The big loop in the SSE2 code doesn't contain any "normal" x86 except for one jump at the end.

Btw, there's a git repo at http://github.com/tcatm/bitcoin-cruncher/

impossible7

newbie

Activity: 18

Merit: 0

After 52 hours of trying with no blocks generated, I give up and I am switching back to the vanilla bitcoin.

The probability of getting no blocks within 52 hours at 51,000 khash/s is 0.011%. So I conclude that the patch doesn't work and I am 99.989% confident about that. I hope that tcatm provides some explanation on how to use the supplied test program.

jgarzik

legendary

Activity: 1596

Merit: 1100

FWIW, there exists -mstackrealign and -mpreferred-stack-boundary=NUM

satoshi

founder

Activity: 364

Merit: 7553

Is it 2x fast on AMD and 1/2 fast on Intel?

Quote from: tcatm on July 31, 2010, 05:12:38 AM

Btw. Why are you using this alignup<16> function when __attribute__ ((aligned (16))) will tell the compiler to align at compiletime?

Tried that, but it doesn't work for things on the stack. I ran some tests.

It doesn't even cause an error, it just doesn't align it.

petree

newbie

Activity: 4

Merit: 0

Quote from: impossible7 on August 02, 2010, 04:31:44 AM

Quote from: Ground Loop on August 02, 2010, 04:17:07 AM

With the patch above, I was unable to build the test program. You?

Under x86 I had to include cryptopp/obj/cpu.o in the list of object files, otherwise "make test" would fail. Under x86_64 I had no such issue.

Quote from: petree on August 02, 2010, 04:22:29 AM

The original patch posted is working just fine for me (Opteron 2376), and did double my performance over the stock 0.3.6 client. I was even able to port its minor changes to 0.3.7 successfully, with the same results.

As I said above I did notice an imporvement in performace too, but I am not sure the patched version works correctly. Have you been able to generate any blocks with the patched version?

Yes, since applying this patch I've generated 2 blocks.

nelisky

legendary

Activity: 1540

Merit: 1002

Quote from: impossible7 on August 02, 2010, 04:00:55 AM

Quote from: knightmb on August 02, 2010, 03:47:04 AM

Is it a AMD only optimization perhaps?

Or a 64-bit only optimization.

I'm trying on a Q6600 running 64bit linux (ubuntu server) and it makes things slower there, so not 64bit only. And I'm running on my mac laptop which sports an Intel i5 (also 64 bit OSX 10.6), which great speed improvement there, so not AMD only.

impossible7

newbie

Activity: 18

Merit: 0

Quote from: Ground Loop on August 02, 2010, 04:17:07 AM

With the patch above, I was unable to build the test program. You?

Under x86 I had to include cryptopp/obj/cpu.o in the list of object files, otherwise "make test" would fail. Under x86_64 I had no such issue.

Quote from: petree on August 02, 2010, 04:22:29 AM

The original patch posted is working just fine for me (Opteron 2376), and did double my performance over the stock 0.3.6 client. I was even able to port its minor changes to 0.3.7 successfully, with the same results.

As I said above I did notice an imporvement in performace too, but I am not sure the patched version works correctly. Have you been able to generate any blocks with the patched version?

petree

newbie

Activity: 4

Merit: 0

The original patch posted is working just fine for me (Opteron 2376), and did double my performance over the stock 0.3.6 client. I was even able to port its minor changes to 0.3.7 successfully, with the same results.

Is there a way we can confirm that the variables are being aligned properly? I'm wondering if the Intel procs are less tolerant of misalignment than the AMD's.

Topic: 4 hashes parallel on SSE2 CPUs for 0.3.6 - page 3. (Read 22072 times)