Pages:
Author

Topic: 4 hashes parallel on SSE2 CPUs for 0.3.6 - page 3. (Read 22072 times)

sr. member
Activity: 337
Merit: 285
August 03, 2010, 05:00:35 PM
#48
did it crash with a segfault and can you provide a backtrace (gdb bitcoind; run; bt)?
newbie
Activity: 18
Merit: 0
August 03, 2010, 04:56:47 PM
#47
r121 from the svn patched with the patch from the post #21 running on a Opteron/x86_64
sr. member
Activity: 337
Merit: 285
August 03, 2010, 04:53:54 PM
#46
did you run it on 32 bit machines? which version of the patch did you use?
newbie
Activity: 18
Merit: 0
August 03, 2010, 04:49:12 PM
#45
I kept running the patched version on 2 machines and the following has happened 5 times: bitcoind crashes and debug.log contains the following:

Code:
proof-of-work found
  hash: 00000000001c3530e42b2c7e1a20de01436882d0c1de0b63db6be8e6194255dd
target: 00000000010c5a00000000000000000000000000000000000000000000000000
CBlock(hash=00000000001c3530, ver=1, hashPrevBlock=0000000000253ab5, hashMerkleRoot=89541f, nTime=1280867359, nBits=1c010c5a, nNonce=3915571979, vtx=2)
  CTransaction(hash=4fcb8e, ver=1, vin.size=1, vout.size=1, nLockTime=0)
    CTxIn(COutPoint(000000, -1), coinbase 045a0c011c021b04)
    CTxOut(nValue=50.00000000, scriptPubKey=0xCE5264238BAC29160CDC9C)
  CTransaction(hash=8f2466, ver=1, vin.size=1, vout.size=1, nLockTime=0)
    CTxIn(COutPoint(77aaae, 1), scriptSig=0x01F561A9044BF348CEF6F4)
    CTxOut(nValue=5.00000000, scriptPubKey=OP_DUP OP_HASH160 0xB13A)
  vMerkleTree: 4fcb8e 8f2466 89541f
08/03/10 20:29 generated 50.00
AddToWallet 4fcb8e  new
AddToBlockIndex: new best=00000000001c3530  height=72112
ProcessBlock: ACCEPTED
sending: inv

I guess this means that a new block has been generated. But when I restart bitcoind the balance is still zero. When I ask for a list of generated blocks I get the following:

Code:
$ ./bitcoind listgenerated
[
    {
        "value" : 50.00000000000000,
        "maturesIn" : -1,
        "accepted" : false,
        "confirmations" : 0,
        "genTime" : 1280867359
    }
]
(listgenerated is from the patch at http://www.alloscomp.com/bitcoin/)

I guess this means that my client produced a block but it crashed before it was able to broadcast it.
full member
Activity: 141
Merit: 100
August 03, 2010, 03:36:04 PM
#44
Anyone able to send me a compiled version of this for windows? I'm interested to try it out on my AMD server.
legendary
Activity: 1540
Merit: 1002
August 02, 2010, 09:23:48 PM
#43
Well, kudos to you for trying. Now if I can just get your code merged with the old cuda version on my macbook pro, I'll be a happy camper Smiley
sr. member
Activity: 337
Merit: 285
August 02, 2010, 09:21:50 PM
#42
i5 is a different architecture than Core2. Maybe SSE in Core2 is broken and was fixed in i5. That means the original client is close to the fastest you can get on Core2. It's not a compiler thing. I compared the output for different architectures and -march=amdfam10 produces the fastest and smallest code. I would be surprised if a longer loop using the same instructions was faster on an older CPU.
legendary
Activity: 1540
Merit: 1002
August 02, 2010, 09:04:22 PM
#41
SHA256 test started
70293
found solutions = 70293
total hashes = 139463136
total time = 222110 ms
average speed: 627 khash/s

So slightly better, but still far for good... As for AMD vs Intel, on my Mac, which is an intel i5, the performance boost was almost 100%, so maybe some compiler thing? I did have to remove the -arch i386 from makefile.osx to have it build on osx 10.6, but there's no such flag on linux' g++ and I'm pretty sure the 64bit g++ will not compile 32bit anyway.
sr. member
Activity: 337
Merit: 285
August 02, 2010, 08:17:53 PM
#40
Thanks for the object!

There are two things I noticed:
1) The Intel object runs at 3269khash/s on my AMD64 (vs. 3778khash/s) so it's less optimized than the AMD64 code.

2) AMD64 moves less data around and does more calculations. Sometimes it even abuses floating point instructions for integers.

Could you drop in my sha256.o from http://ul.to/2ckndx to cryptopp/obj/, delete test (not the .cpp!!) and recompile test using make -f makefile.unix test (take care it doesn't recompile sha256.cpp to sha256.o). Then run test again. It should be using AMD64 code now. Maybe it works better...

If not we've found that AMD64 is about four times faster than Intel at SSE2 integer vector arithmetic. Anyone working on a floating point SHA256 implementation? Wink
legendary
Activity: 1540
Merit: 1002
August 02, 2010, 07:46:12 PM
#39
datla@bah:~/src/bitcoin/bitcoin-cruncher$ ./test blocks.txt
SHA256 test started
70293
found solutions = 70293
total hashes = 139463136
total time = 235480 ms
average speed: 592 khash/s

I'll send you the obj file now
sr. member
Activity: 337
Merit: 285
August 02, 2010, 07:21:41 PM
#38
I've tried the git branch and results stay the same, almost half of what the vanilla svn can pump out. I'm running Intel and not AMD, but I am on 64bit:

Linux bah 2.6.32-22-server #33-Ubuntu SMP Wed Apr 28 14:34:48 UTC 2010 x86_64 GNU/Linux

Anything I can try to help and debug this?
Can you mail me a copy of cryptopp/obj/sha256.o to [email protected]? I still fear Intels microcode in their CPUs wasn't made for such tight loops of SSE code. Have you run the test program? How many khash/s does it crunch?
legendary
Activity: 1540
Merit: 1002
August 02, 2010, 06:52:27 PM
#37
I've tried the git branch and results stay the same, almost half of what the vanilla svn can pump out. I'm running Intel and not AMD, but I am on 64bit:

Linux bah 2.6.32-22-server #33-Ubuntu SMP Wed Apr 28 14:34:48 UTC 2010 x86_64 GNU/Linux

Anything I can try to help and debug this?
sr. member
Activity: 337
Merit: 285
August 02, 2010, 04:07:56 PM
#36
To use the test program download this file (or generate it yourself from the blockchain): http://ul.to/hz5wlg
The program will try to find the correct nonce in each block and detect if the hash function does work correctly. It'll also benchmark the algorithm.

From what I've heard the patch does not work on 32 bit systems. I don't know why. I've developed it on an AMD64 machine and it works fine. If it's slower on Intel, try to disable Hyperthreading. The big loop in the SSE2 code doesn't contain any "normal" x86 except for one jump at the end.

Btw, there's a git repo at http://github.com/tcatm/bitcoin-cruncher/
newbie
Activity: 18
Merit: 0
August 02, 2010, 03:49:25 PM
#35
After 52 hours of trying with no blocks generated, I give up and I am switching back to the vanilla bitcoin.

The probability of getting no blocks within 52 hours at 51,000 khash/s is 0.011%. So I conclude that the patch doesn't work and I am 99.989% confident about that. I hope that tcatm provides some explanation on how to use the supplied test program.
legendary
Activity: 1596
Merit: 1100
August 02, 2010, 02:15:23 PM
#34
FWIW, there exists -mstackrealign and -mpreferred-stack-boundary=NUM
founder
Activity: 364
Merit: 7553
August 02, 2010, 02:02:46 PM
#33
Is it 2x fast on AMD and 1/2 fast on Intel?

Btw. Why are you using this alignup<16> function when __attribute__ ((aligned (16))) will tell the compiler to align at compiletime?
Tried that, but it doesn't work for things on the stack.  I ran some tests.

It doesn't even cause an error, it just doesn't align it.
newbie
Activity: 4
Merit: 0
August 02, 2010, 11:12:33 AM
#32
With the patch above, I was unable to build the test program.  You?

Under x86 I had to include cryptopp/obj/cpu.o in the list of object files, otherwise "make test" would fail. Under x86_64 I had no such issue.


The original patch posted is working just fine for me (Opteron 2376), and did double my performance over the stock 0.3.6 client.  I was even able to port its minor changes to 0.3.7 successfully, with the same results.

As I said above I did notice an imporvement in performace too, but I am not sure the patched version works correctly. Have you been able to generate any blocks with the patched version?

Yes, since applying this patch I've generated 2 blocks.
legendary
Activity: 1540
Merit: 1002
August 02, 2010, 08:05:48 AM
#31
Is it a AMD only optimization perhaps?

Or a 64-bit only optimization.

I'm trying on a Q6600 running 64bit linux (ubuntu server) and it makes things slower there, so not 64bit only. And I'm running on my mac laptop which sports an Intel i5 (also 64 bit OSX 10.6), which great speed improvement there, so not AMD only.
newbie
Activity: 18
Merit: 0
August 02, 2010, 04:31:44 AM
#30
With the patch above, I was unable to build the test program.  You?

Under x86 I had to include cryptopp/obj/cpu.o in the list of object files, otherwise "make test" would fail. Under x86_64 I had no such issue.


The original patch posted is working just fine for me (Opteron 2376), and did double my performance over the stock 0.3.6 client.  I was even able to port its minor changes to 0.3.7 successfully, with the same results.

As I said above I did notice an imporvement in performace too, but I am not sure the patched version works correctly. Have you been able to generate any blocks with the patched version?
newbie
Activity: 4
Merit: 0
August 02, 2010, 04:22:29 AM
#29
The original patch posted is working just fine for me (Opteron 2376), and did double my performance over the stock 0.3.6 client.  I was even able to port its minor changes to 0.3.7 successfully, with the same results.

Is there a way we can confirm that the variables are being aligned properly?  I'm wondering if the Intel procs are less tolerant of misalignment than the AMD's.
Pages:
Jump to: