Author

Topic: bitcoind stops responding to RPC requests (Read 3974 times)

vip
Activity: 447
Merit: 258
April 22, 2011, 09:28:52 AM
#20
Coin{Pal,Card} are now running a nightly build including the deadlock changes.  I'll report here if bitcoind hangs again.

Was CoinPal's April 18th service issue related to this? Your post mentioned "I've restarted some server components and the site appears to be working fine now".

I'm glad you brought this up sgornick.  I should have mentioned it here.  That particular outage was caused by an error in my code causing it to leak open file handles.  It wasn't related to bitcoind.

Since upgrading to a nightly build on April 15th, I haven't had any problems with bitcoind hanging.  I almost certainly would have seen one by now if the problem were still present.  Thanks all for your help diagnosing and fixing the bug.
legendary
Activity: 2506
Merit: 1010
Coin{Pal,Card} are now running a nightly build including the deadlock changes.  I'll report here if bitcoind hangs again.

Was CoinPal's April 18th service issue related to this? Your post mentioned "I've restarted some server components and the site appears to be working fine now".
legendary
Activity: 2506
Merit: 1010
Just an FYI -- gjs278 shared a monit script to restart:

  Restart bitcoind automatically if it crashes or dies using Monit:
    https://bitcointalksearch.org/topic/guide-restart-bitcoind-automatically-if-it-crashes-or-dies-using-monit-5911
vip
Activity: 447
Merit: 258
Coin{Pal,Card} are now running a nightly build including the deadlock changes.  I'll report here if bitcoind hangs again.
legendary
Activity: 1386
Merit: 1097
this patch is already in bitcoin upstream, it looks like more people watched it Smiley. I'll try to use that in pool tomorrow...
vip
Activity: 447
Merit: 258
mndrix, did you succesfully tested jgarzik's patch?

I haven't tested the patch.  My feeble attempts to compile Bitcoin from source have failed (speaks to my ignorance not a problem with Bitcoin).  Does anyone know if the patch is available in a release candidate build for Linux yet?
legendary
Activity: 1386
Merit: 1097
Today I had similar problems as mndrix had; bitcoind freezed during payouts. It was second time in pool history, but firstly with sendmany command.

mndrix, did you succesfully tested jgarzik's patch?
legendary
Activity: 1596
Merit: 1100
Pull request: https://github.com/bitcoin/bitcoin/pull/136

Direct link to commit (patch): https://github.com/jgarzik/bitcoin/commit/4feff786546448e2c436956ad77b9081167e3124

Unfortunately the commit is larger than it should be for easy reading, because large blocks of code were un-indented.

vip
Activity: 447
Merit: 258
Well done.  Let me know when a patch makes it into a beta/nightly build and I'll run it in production to test.
sr. member
Activity: 406
Merit: 257
Another one
setaccount
    CRITICAL_BLOCK(cs_mapAddressBook)
        GetAccountAddress(strOldAccount)
            CRITICAL_BLOCK(cs_mapWallet)

processmessages:
CRITICAL_BLOCK(cs_main)
    ProcessMessage(pfrom, strCommand, vMsg)
        AddToWalletIfMine()
              AddToWallet(wtx)
                  CRITICAL_BLOCK(cs_mapWallet)
                      walletdb.WriteName(PubKeyToAddress(vchDefaultKey), "")
                          CRITICAL_BLOCK(cs_mapAddressBook)
sr. member
Activity: 360
Merit: 250
@Gavin: Document? Always a good thing. This is tricky stuff, as ArtForz has shown. My own experience goes like: 1: If you don't really have to lock, push into a serial action queue; 2: when you really do have to lock, prepare everything beforehand, then lock, alter and unlock as swiftly as possible; and 3: er, yeh, document, at least so that you can recall what the heck you were up to when you decided you needed that lock.

Obviously, this becomes real hard when we're dealing with what are essentially library primitives for manipulating the dataset.

If I were sober at the moment I'd produced a precompiler macro that would flag potential nested locks in the control flow. Fortunately, I'm not sober.
sr. member
Activity: 406
Merit: 257
well, quick manual check suggests for cs_main + cs_mapWallet only rpc.cpp sendfrom and sendmany are doing the wrong thing.
legendary
Activity: 1470
Merit: 1006
Bringing Legendary Har® to you since 1952
Oops. Should RPCs be run with the BFL held?
D'oh!

sendfrom should definitely CRITICAL_BLOCK(cs_main).  Nice catch ArtForz.


For which version will the patch be scheduled for ?
legendary
Activity: 1652
Merit: 2301
Chief Scientist
Does anybody have experience with valgrind -helgrind or other automated tools for finding potential deadlocks?

Running it on bitcoind I'm getting a huge number of false positives...

Should we just document every method that holds one or more locks?  I'm worried there are other possible deadlocks lurking.
sr. member
Activity: 360
Merit: 250
ArtForz, you've got BTC 5.00 incoming from me for spotting this. Very well done.
legendary
Activity: 1652
Merit: 2301
Chief Scientist
Oops. Should RPCs be run with the BFL held?
D'oh!

sendfrom should definitely CRITICAL_BLOCK(cs_main).  Nice catch ArtForz.
legendary
Activity: 1526
Merit: 1134
Oops. Should RPCs be run with the BFL held?
sr. member
Activity: 406
Merit: 257
I think we got a deadlock in there...

rpc:
sendfrom
    CRITICAL_BLOCK(cs_mapWallet)
        SendMoneyToBitcoinAddress(strAddress, nAmount, wtx)
            SendMoney(scriptPubKey, nValue, wtxNew, fAskFee)
                CRITICAL_BLOCK(cs_main)
                    ...

processmessages:
CRITICAL_BLOCK(cs_main)
    ProcessMessage(pfrom, strCommand, vMsg)
        AddToWalletIfMine()
              AddToWallet(wtx)
                  CRITICAL_BLOCK(cs_mapWallet)
legendary
Activity: 1596
Merit: 1100
Have you played around with -rpctimeout ?
vip
Activity: 447
Merit: 258
While operating CoinPal, I've had the bitcoin daemon hang several times.  The behavior has been the same each time.  RPC calls timeout without response.  I restart the daemon, it catches up with the blockchain and works correctly again for several hours or days before it happens again.

Here's the information I have.  When I notice the daemon has hung, the tail of debug.log always looks about like this.  I can watch the log indefinitely and see only similar messages streaming by as normal:

Code:
IRC got join
IRC got join
AddAddress()
IRC got new address
IRC got join
IRC got join

If I look backwards in the debug log to the last activity not related to addresses and IRC, I usually get something similar to this:

Code:
IRC got join
received: inv (37 bytes)
  got inventory: tx 1d95d66a217e5fbe49bd  new
askfor tx 1d95d66a217e5fbe49bd   0
sending getdata: tx 1d95d66a217e5fbe49bd
sending: getdata (37 bytes)
received: inv (37 bytes)
  got inventory: tx 1d95d66a217e5fbe49bd  new
askfor tx 1d95d66a217e5fbe49bd   1300914187000000
received: inv (37 bytes)
  got inventory: tx 1d95d66a217e5fbe49bd  new
askfor tx 1d95d66a217e5fbe49bd   1300914307000000
received: tx (617 bytes)
ThreadRPCServer method=sendfrom
IRC got join

That last "sendfrom" request never sends a response.

When the daemon hangs again, what information should I collect so that developers can diagnose the problem?
Jump to: