Pages:
Author

Topic: [20 BTC] Multithreaded Keep-alive Implementation in Bitcoind - page 4. (Read 31453 times)

legendary
Activity: 1596
Merit: 1012
Democracy is vulnerable to a 51% attack.
Thanks for helping me make these patches better. If hope I didn't come across as hostile. I know we both want the same thing -- a network than can remain stable and reliable even with exponential growth.
hero member
Activity: 755
Merit: 515
On the next version of the hub mode patches, I will reduce the number of outgoing connections further. In hub mode 2, recommended for mining controllers, I will reduce the output connections from 48 to 32. I feel this is needed because otherwise there is too much of a delay on a restart before the node has an adequate view of the network. I will reduce hub mode 3 from 96 to 32 as well. Hub mode 3 is intended for stable network nodes, and there is no particular reason to need a quick network view.
32 seems like a reasonable number of outgoing connections.  Just keep in mind (or I guess others who haven't upgraded should keep in mind) that the reason it takes so long for nodes to get a good connection is not because of a poor network, it is because nodes have not upgraded to 0.3.24.  If you connect to a pre-.24 node and are more than a day or two out of date with the blockchain, you will be disconnected before you are able to get updated, which causes so many of the recent problems.  Seriously people, when it is said that you NEED TO UPGRADE TO 0.3.24, there is a reason behind it.
legendary
Activity: 1596
Merit: 1012
Democracy is vulnerable to a 51% attack.
I wasnt aware you had dropped the outgoing cap, but 96 is still way, way, way too much.  If you want more outgoing connections, fine, but maybe make 10 or 20, not 96.  Accepting more incoming is great, and that should be encouraged. However, accepting more incoming than you make outgoing is not a solution, nor does it add more to the network than it takes away.  Keep in mind that nodes attempt to keep a diverse connection pool (something that could be implemented better, though) so accepting 1000 connections does not make up for making 100 connections because you are never going to get that many connections and because you are taking 100 slots on nodes in different /24s.
I don't think this argument is valid. It seems valid if you think about one node, but think about 100 nodes on different /24s. Those 100 nodes running in hub mode are adding 500 plus connection slots each and in total they are doing so on 100 different /24s.

Of course any one node can't add IP diversity. But so long as each node adds more capacity than it takes, there should be a net gain in IP diversity.

It was based on 0.3.23 until just recently. I absolutely agree that running in hub mode without that patch is a problem.

On the next version of the hub mode patches, I will reduce the number of outgoing connections further. In hub mode 2, recommended for mining controllers, I will reduce the outbound connections from 48 to 32. I feel this is needed because otherwise there is too much of a delay on a restart before the node has an adequate view of the network. I will reduce hub mode 3 from 96 to 32 as well. Hub mode 3 is intended for stable network nodes, and there is no particular reason to need a quick network view.

 const unsigned HubModes[4][4]=
 { // outbound connections, total connections, IP mask, multithreaded connect
  {   8, 125, 0x0000ffff, 0 }, // Normal mode
- {  32, 200, 0x0000ffff, 0 }, // Small hub mode
- {  48, 384, 0x00ffffff, 1 }, // Medium hub mode
- {  96, 640, 0xffffffff, 1 }  // Large hub mode
+ {  16, 200, 0x0000ffff, 0 }, // Small hub mode
+ {  32, 384, 0x00ffffff, 1 }, // Medium hub mode
+ {  32, 640, 0x00ffffff, 1 }  // Large hub mode
 };

Update: This change is in the current bitcoin-4diff that is up on my web server.
hero member
Activity: 755
Merit: 515
The newer versions decrease the number of outgoing connections made over the initial versions. In the largest hub mode now, the outgoing connections is capped at 96. In exchange, a much larger number of inbound connections (over 500 instead of 125) is accepted, so on balance the network gains in the number of connections it can accept.
I wasnt aware you had dropped the outgoing cap, but 96 is still way, way, way too much.  If you want more outgoing connections, fine, but maybe make 10 or 20, not 96.  Accepting more incoming is great, and that should be encouraged. However, accepting more incoming than you make outgoing is not a solution, nor does it add more to the network than it takes away.  Keep in mind that nodes attempt to keep a diverse connection pool (something that could be implemented better, though) so accepting 1000 connections does not make up for making 100 connections because you are never going to get that many connections and because you are taking 100 slots on nodes in different /24s.  Again, let me just point out that though there is a valid argument for having something to help you get a better connection to more up-to-date nodes, hub mode is probably the worst way to fix the problem possible.  
As a side note, if you are going to encourage people to patch 0.3.23, please, please include https://github.com/bitcoin/bitcoin/commit/497317453422611a077f7f195eb193d3bb597a9c as without it, you really aren't helping the network at all. Sorry, for some reason I still thought this was based on .23, not sure why I thought that.
legendary
Activity: 1596
Merit: 1012
Democracy is vulnerable to a 51% attack.
Can you explain the "hub" mode a little better?
Don't use it...period.  Hub mode creates a ton more outgoing connections in a much more aggressive manner than stock bitcoin.  This is very bad for the network as the number of nodes accepting incoming connections is relatively low and filling all of their slots that you can is really quite terrible.
This was a valid criticism against the first hub mode patches. I was not aware of this issue and have changed the patches in such a way that I believe they help solve this problem rather than making it worse.

The newer versions decrease the number of outgoing connections made over the initial versions. In the largest hub mode now, the outgoing connections is capped at 96. In exchange, a much larger number of inbound connections (over 500 instead of 125) is accepted, so on balance the network gains in the number of connections it can accept.

I 100% agree that nodes that are behind NAT or cannot accept inbound connections should never ever run in hub mode. Otherwise, I don't think this particular criticism is valid anymore. In my opinion, people who run well-maintained, stable clients on the latest build in the largest hub mode with good network connectivity are doing the bitcoin network a favor.
hero member
Activity: 755
Merit: 515
Can you explain the "hub" mode a little better?
Don't use it...period.  Hub mode creates a ton more outgoing connections in a much more aggressive manner than stock bitcoin.  This is very bad for the network as the number of nodes accepting incoming connections is relatively low and filling all of their slots that you can is really quite terrible.  That said, hub mode is an attempt at solving actual problems, namely the "islanding" of nodes.  This is largely caused by a large number of people on the network running old nodes, specifically those older than version 0.3.24 (including 0.3.23).  This causes nodes to not correctly forward recent blocks.  If you are running 0.3.23 or older (without having applied https://github.com/bitcoin/bitcoin/commit/497317453422611a077f7f195eb193d3bb597a9c) you are doing the network a great disservice and causing problems for others.
If you are running 0.3.24 and are, like some, seeing islanding and not being able to download the latest blocks, hub mode theoretically solves the issue in a very brute-force way, to the detriment of others.  The proper solution here is to -addnode the nodes of other large pools/miners/fallback nodes who are virtually guaranteed to have the latest blocks.
legendary
Activity: 1596
Merit: 1012
Democracy is vulnerable to a 51% attack.
Can you explain the "hub" mode a little better?
The bitcoin code makes few outbound connections and makes them very slowly. It tried to make the connections to numerically diverse IP addresses. And it has a few design decisions that makes it waste CPU under certain circumstances involving having larger numbers of connections.

The hub modes are aimed at two things. First, for mining controllers, they allow the client to get a larger mix of connections and to reconnect itself to more of the network more rapidly on a restart. Second, for people who want to run bitcoind nodes to provide the network with a reduced diameter, they make it possible to more easily run nodes with many hundreds of connections.

It is still a work in progress. The ultimate goal is to improve network reliability and reduce network diameter and propagation delay by greater interconnectedness between well-connected nodes.
full member
Activity: 175
Merit: 102
I've applied the patch to 0.3.24 and rebuilt, running now apparently without problems.

I'm doing a heavier load test today so I will let you know how that goes.

Can you explain the "hub" mode a little better?
member
Activity: 111
Merit: 10
★Trash&Burn [TBC/TXB]★
Oh I forgot to add. We generated our first live block with the 4-diff preview code. So all seems to be working.

There are some awesome crash test pilots miners in our irc channel that are eager to test new features on our public test pool. So if there is anything we can try to break for you, please ask!
member
Activity: 111
Merit: 10
★Trash&Burn [TBC/TXB]★
We had both bitcoin-4diff.txt (.95)+ updates.txt +  0.3.23 and just 4diff (.95) + 0.3.23 in production for about a week @ 100+ GH/s each. Both produced blocks and we had no invalids. Neither crashed and ran un-interrupted under load for the week long test. On a side note, it seemed just using the 4diff and no updates we generated more blocks. But that really doesn't mean anything.
newbie
Activity: 21
Merit: 0
I've been testing and debugging the patches against the 0.3.23 version.  I think I've got most of the problems figured out but there is still a critical problem that's made me lose out on 3 blocks so far.

First, the fast getwork parsing doesn't work if the client doesn't send the id without quotes.  This will allow it to send the id with or without quotes:

ThreadRPCServer3 in rpc.cpp:
Code:
       if ((strRequest.find("\"getwork\"")!=std::string::npos) && strRequest.find("[]")!=std::string::npos)
            std::string id;
            size_t p = strRequest.find("\"id\":");
            if (p != std::string::npos)
            {
                size_t ep = strRequest.find(" ", p+5), e;
                if((e=strRequest.find(",", p+5))                if((e=strRequest.find("}", p+5))                id = strRequest.substr(p+5, ep-p-5);
            }

Second, the locking problems seen are mostly due to the mutex being non-recursive.  Changing mGetWork to recursive_mutex instead of mutex, along with the iterators for it, should fix the problem.

Third, the CheckWork done in getwork assigns a value that should be returned but to a variable in the wrong scope.  This has been addressed earlier.

Some of the locking seems to be off.  An example would be:

CommitTransaction in main.cpp:
Code:
       // Broadcast
        if (!wtxNew.AcceptToMemoryPool())
        {
            // This must not fail. The transaction has already been signed and recorded.
            printf("CommitTransaction() : Error: Transaction not valid");
            return false;
        }
        wtxNew.RelayWalletTransaction();
+        SyncGetWork(4);
    }
-    SyncGetWork(4);

The comment for SyncGetWork says cs_main must be held but before being called, but that's not the case with the patch before the above change.

Another one would be:

AddToBlockIndex in main.cpp:
Code:
   if (pindexNew == pindexBest)
    {
        // Notify UI to display prev block's coinbase if it was ours
        static uint256 hashPrevBestCoinBase;
-        CRITICAL_BLOCK(cs_mapWallet)
+        CRITICAL_BLOCK(cs_mapWallet) {
            vWalletUpdated.push_back(hashPrevBestCoinBase);
+            hashPrevBestCoinBase = vtx[0].GetHash();
+        }
-         hashPrevBestCoinBase = vtx[0].GetHash();
    }

This one may or may not matter depending on how many threads call it (I believe only one thread runs this function but I'm not 100% sure).  However, this related to crashes I've been seeing right after writing out the block and before sending out the proof of work.  The call to vtx[0].GetHash causes a crash in the Serialize function due to the assertion that nSize >= 0.  I believe it's due to vtx[0] being corrupted but the event is rare at such high difficulties.  I don't see the problem with the test network but it happens on the live one.

One thought I had is that the live one has more transactions and appending enough of them would trigger the crash.  If this is the case, one cause might be the changes JoelKatz has made that'd allow vtx to get updated elsewhere (from another client calling getwork?).  Another one may be vtx is a vector and whenever it resizes, the structures related to each object inside vtx isn't correctly copied over (I didn't see any copy constructors).  I was able to get a block through when I commented the above out so that might be a sign that I'm on the right track but I'd like to see it get a few more blocks before I can rule out that it wasn't just a random event (initially, crashes on the test network were random and I've yet to see any on it since the fixes mentioned earlier).

These high difficulty levels are making it really troublesome to debug locking/race conditions.  It'd be nice if one of the big pools try out various things to see if they also experience the crashes (and possibly end up not getting credit for the block) and try out various fixes to see if it fixes the problem.  I haven't seen anyone mention these crashes so for all I know, it's just some compiler/machine issue.  But then again, it doesn't seem like anyone's tried it with the fixes that doesn't lock up.
legendary
Activity: 1596
Merit: 1012
Democracy is vulnerable to a 51% attack.
Do you know offhand how that CPU usage compares with what you were seeing before?
For a basis, this server is using an Intel Q8400. Looks to be around half the usage at "idle". During a LP; it hits about 40-50% usage. Bunches better than stock bitcoind, and about 10% less than the diff-4 v.95.
Great to hear. Thanks.
member
Activity: 111
Merit: 10
★Trash&Burn [TBC/TXB]★
Do you know offhand how that CPU usage compares with what you were seeing before?

For a basis, this server is using an Intel Q8400. Looks to be around half the usage at "idle". During a LP; it hits about 40-50% usage. Bunches better than stock bitcoind, and about 10% less than the diff-4 v.95.
legendary
Activity: 1596
Merit: 1012
Democracy is vulnerable to a 51% attack.
Patch at http://davids.webmaster.com/~davids/preview.diff compiles perfectly on the 0.3.24 release Ubuntu x64 10.04 + 10.10. Was working properly with load on test net and I have pushed to live with about 100 GHash. About an hour in, so far so good. Thank you so much for all your hard work Joel!
Thanks for the success report. The very first preview had a serious bug that would make it crash immediately. The final release had, other than that fix, only cosmetic changes (like fixing extra spaces and such). So you can stick with the build you've got, since it obviously doesn't have the 'makes it crash immediately' bug.

Do you know offhand how that CPU usage compares with what you were seeing before?
member
Activity: 111
Merit: 10
★Trash&Burn [TBC/TXB]★
Patch at http://davids.webmaster.com/~davids/preview.diff compiles perfectly on the 0.3.24 release Ubuntu x64 10.04 + 10.10. Was working properly with load on test net and I have pushed to live with about 100 GHash. About an hour in, so far so good. Thank you so much for all your hard work Joel!

Code:
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
19263 pool      20   0 4262m 139m 9396 S    2  2.5   2:24.12 bitcoind-24prev

Yummy!
legendary
Activity: 1596
Merit: 1012
Democracy is vulnerable to a 51% attack.
Just to clarify, is this preview a potential fix for the lockup issue I'm experiencing or is that part of number 2?
This does not contain the optimization that I believe is responsible for the lockup issue. That optimization was not contained in 3diff or 4diff but was an additional optimization (introduced in the 'update' patch) specifically to reduce the 'hiccup' in 'getwork' responses when a new transaction hits the network.
legendary
Activity: 1260
Merit: 1000
That fixed it!

Just to clarify, is this preview a potential fix for the lockup issue I'm experiencing or is that part of number 2?

legendary
Activity: 1596
Merit: 1012
Democracy is vulnerable to a 51% attack.
It is against the 0.3.24 release. I just made a small, but very critical, fix. So redownload it. (That's likely the problem in your post above.)

I've finished my own testing and auditing. So if the latest works for you, I'm ready to release it.

Performance is about 3,000 getwork's per second on my test machine (a Core 2 Quad Q9550 running 64-bit Linux).
legendary
Activity: 1260
Merit: 1000
Ok, confirmed that a fresh copy of the .24 tar file with your preview gives the above result.  Here's another debug.log output from the fresh copy:

Bitcoin version 0.3.24-beta
Default data directory /home/xxx/.bitcoin
Bound to port 8333
Loading addresses...
dbenv.open strLogDir=/home/xxx/.bitcoin/database strErrorFile=/home/xxx/.bitcoin/db.log
Loaded 357138 addresses
 addresses               942ms
Loading block index...
LoadBlockIndex(): hashBestChain=0000000000000a1fe384  height=136619
 block index            1925ms
Loading wallet...
nFileVersion = 32400
fGenerateBitcoins = 0
nTransactionFee = 0
addrIncoming = 255.255.255.255:8333
fMinimizeToTray = 0
fMinimizeOnClose = 0
fUseProxy = 0
addrProxy = 127.0.0.1:9050
 wallet                  175ms
Done loading
mapBlockIndex.size() = 136639
nBestHeight = 136619
mapKeys.size() = 172
setKeyPool.size() = 101
mapPubKeys.size() = 172
mapWallet.size() = 190
mapAddressBook.size() = 2
Loading addresses from DNS seeds (could take a while)
AddAddress(62.155.236.249:8333)
AddAddress(109.75.176.193:8333)
AddAddress(174.120.185.74:8333)
AddAddress(69.163.132.101:8333)
AddAddress(178.79.147.99:8333)
AddAddress(91.85.220.84:8333)
48 addresses found from DNS seeds
sending: version (85 bytes)
ThreadRPCServer started
ipv4 eth0: x.x.x.x
addrLocalHost = x.x.x.x:8333
ThreadSocketHandler started
ThreadIRCSeed started
ThreadOpenConnections started
ThreadMessageHandler started
IRC :pelican.heliacal.net NOTICE AUTH :*** Looking up your hostname...
IRC :pelican.heliacal.net NOTICE AUTH :*** Couldn't look up your hostname
IRC SENDING: NICK x265309750^M
IRC SENDING: USER x265309750 8 * : x265309750^M
trying connection 213.111.82.119:8333 lastseen=-0.2hrs lasttry=-364126.5hrs
IRC :pelican.heliacal.net 001 x265309750 :Welcome to the LFNet Internet Relay Chat Network x265309750
IRC :pelican.heliacal.net 002 x265309750 :Your host is pelican.heliacal.net[173.246.103.92/6667], running version hybri
d-7.2.3
IRC :pelican.heliacal.net 003 x265309750 :This server was created Jun 28 2011 at 14:26:11
IRC :pelican.heliacal.net 004 x265309750 pelican.heliacal.net hybrid-7.2.3 CDGabcdfgiklnorsuwxyz biklmnopstveI bkloveI
connected 213.111.82.119:8333
sending: version (85 bytes)


Bitcoind exits at this point.
legendary
Activity: 1260
Merit: 1000
I am going to pull a fresh copy of .24 and reapply just to be sure I didn't do something odd - but the patch applied just fine.


EDIT - whoops... I guess you are patching against the tar and not the repo?  The above report is off the tar and it applied fine.  Just tried against the repo and it didn't like that.

Pages:
Jump to: