Author

Topic: Two days of debugging for a single '&' (Read 1281 times)

hero member
Activity: 668
Merit: 501
April 13, 2012, 07:53:11 AM
#4
What is the lesson learnt then ? I think there are two many more:
i think this shows us that in any complex codebase at some point you can't simply rule out errors just by a human looking at it. therefore you need to rely on as much automated error checking as possible.
maybe its different for you, but in my experience this tells me:
  • use a statically compiled, strongly typed language with good IDE support
  • stay away from dynamic programming like bytecode manipulation or metaprogramming
  • use as many automated tools as possible for error checking (compiler, IDE inspections, findbugs..)
  • when dealing with threads a whole new class of errors are possible that are not easy to detect for tools, so be extra careful. avoid threading where it makes sense, use proven platform tools where it is neccesary. if possible, stick to an actor library.
  • garbage collection is your friend. except when writing a 3d shooter.
  • unit test things wich are critical but are not run often in production, and possibly more
  • only use libs where you can access the source code to be able to understand and debug them
again, i will certainly not force anyone to adopt this methodology. i'm just saying those are the principles i use to avoid a vast range of errors, including memory issues, buffer overflows, bad macro behavior.
for very simple problems or problems with very specific performance requirements you have to be pragmatic and drop some of those restrictions.

i am very happy that there is a very good implementation of Bitcoin in Java. Any developer who agrees with the issues i raised should consider using BitcoinJ instead of a c++ based Bitcoind. It is not perfect but at the moment the safest choice.
legendary
Activity: 1708
Merit: 1066
April 13, 2012, 05:09:03 AM
#3
I think the worst debugging nightmare I had was when I was working ages ago at an electronics company.
We used to simulate the behaviour of circuits with a language called Spice. You would hook up up the wiring in software and then simulate the signals coming in and see what happened.

I had one circuit model in a spice file that worked fine and another that just crashed.
After a few days I had whittled it down to two seemingly identical files, one of which worked fine and one which crashed.

In the end I discovered what the problem was by...

printing the two files out on the cheapest paper we had.
putting the two pieces of paper one on top of each other.
holding them up to the light to see if there were any differences.

I noticed the two sheets of paper were identical but slightly misaligned.

One had a carriage return at the start of the file, which was enough to crash the spice interpreter.

Grrrrhhhhhh!
legendary
Activity: 1890
Merit: 1086
Ian Knowles - CIYAM Lead Developer
April 13, 2012, 03:52:40 AM
#2
Macros are always evil and I have never really seen the attraction of BOOST_FOREACH myself especially now that we have "auto" types and can write our loops as follows:

for( auto i = container.begin( ); i != container.end( ); ++i )
{
...
}

[now it will be time for others to chime in with the "C++ is the root of all evil and should never be used" type stuff we get on this forum]
newbie
Activity: 26
Merit: 24
April 13, 2012, 03:38:23 AM
#1
Debugging code is both tedious and contagious. You are held in suspense between your curiosity, stubbornness and immense boredom. It is one of these needle in a haystack tasks that can last for days and during the endeavor you feel awfully unproductive. However, there are often some lessons to be learnt, and sharing debugging stories is for code gurus like sharing fishing stories for fishermen.

A famous story is the one from Fortran that caused a satellite to crash (Mariner Venus) due to '-' for '~' error... This story is in C++ and about a missing '&', and like the satellite story it also begins with a crash.

I have had my customized bitcoin server (based on libcoin) running on different mac and windows desktop machines running for weeks with no issues, it survived reorganizations and the recent date based protocol changes (BIP0016/BIP0030) silently, and seemed pretty stable. I hence installed it in its production environment a Ubuntu machine at Linode. It had compiled and functioned several times before on linux, except that sometimes running on small virtualized machines I had seen crashes due to memory issues. Nothing to be really alarmed about, and the instance running on an amazon large instance had never crashed. I hence compiled to code and started it on Linode. Blockchain download... a reorganization and crash! Reason was bad_alloc, so most likely the memory issue, and afterall the Linode instance had only 1Gig of RAM. In my custom server I also keep some rather large data structures in memory, so I disabled these during chain download and tried again. Reorganize, and crash! Again due to bad_alloc.

Annoyingly, following the crashes the databases got corrupted as well. I still believed that the crashes were due to low memory, but I wanted to make sure in the future that I could at least recover gracefully from database corruptions. So I tried to figure out what state the database was left in.

In the reorganize method of the BlockChain class (basically similar to the reorganize method in bitcoin/bitcoin) there are a lot of nice tests to check if things fails - if blocks can't be read from disk or if disconnect fails. If this is the case the series of accumulated database transactions are aborted (TxnAbort) leaving the database in the same state as before, and ensuring that even thought the reorganization failed (and will fail again) the database is left in the same state and the same error will reappear on restart and can hence easily be debugged.
In my case it was a memory exception thrown in the loop where all the txes are pushed on a stack to be resurrected after the chain reorganization. The exception is caught, but it is not caught before in the message processing method (ProcessMessage in bitcoin, handleMessage in libcoin). Here it results in an error message, but the program continues and the database transaction is never aborted (TxnAbort). The failed reorganize will repeat it self and the database transaction log fills up till 500 uncommitted transactions and then the BDB makes the program die. However, at some point in between, at least in my code, the best height was sat in the database and that caused the corruption.

I fixed the issue by inserting a try catch around everything in reorganize up to the TxnAbort ensuring correct behavior even in case of a memory exception.

Instead of rebuilding the database (from scratch download) I tried to rebuild the database: Forcing a reorganize 10 steps back and cleaning the blockchainindex for all non main chain blocks. However, that immediately caused another reorganize crash due to a bad_alloc ! This time the reason was clearly not a filed up memory. I debugged the code and the same loop as before was the culprit:
Code:
    // Queue memory transactions to resurrect
    BOOST_FOREACH(const Transaction& tx, block.getTransactions())
        if (!tx.isCoinBase()) {
            vResurrect.push_back(tx);
        }
The Block::getTransactions() was defined like:
Code:
    const TransactionList getTransactions() const { return _transactions; }
libcoin encapsulates all classes, and hence looping over a (const version) of transactions in a block is done this way. I have other part of the code (in Block) where:
Code:
    BOOST_FOREACH(const Transaction& tx, _transactions)
        // do something with tx

is called multiple times with no problem, yet the code crashed there and it was tx that was corrupted! I changed the loop to a normal for loop using block.getNumTransactions() and block.getTransaction(int) - now the code ran perfectly up to block 171091 - which every bitcoinerd will recognize as the first block of March the 15th ! BIP0030... My BIP0030 code is:
Code:
        if (pindex->nTime > _chain.timeStamp(Chain::BIP0030))
            BOOST_FOREACH(const Transaction& tx, block.getTransactions()) {
                TxIndex txindexOld;
                if (ReadTxIndex(tx.getHash(), txindexOld))
                    BOOST_FOREACH(const DiskTxPos &pos, txindexOld.getSpents())
                    if (pos.isNull())
                        return false;
            }
Hey, again the BOOST_FOREACH loop! That construct was apparently somehow polluted - a google for 'BOOST_FOREACH const reference' gave this like:

http://www.richelbilderbeek.nl/CppBOOST_FOREACHExample1.htm

Note from Richel's page:
Code:
//BAD: Compiles and crashes program
  //std::clog << __func__ << '\n'; BOOST_FOREACH(const std::string& s, f()) { t+=s; }
So you cannot use a function returning a copy of a vector in BOOST_FOREACH, however a const reference to it works, but will spew a compiler warning. Changing Block::getTransactions() to :
Code:
    const TransactionList& getTransactions() const { return _transactions; }
//                       ^ note the single &
made everything work smoothly again. So the culprit was a missing '&' or more correctly that BOOST_FOREACH that didn't behave correctly. However, it does behave correctly when compiled on visual studio and Xcode. Hence I never saw that error before...

What is the lesson learnt then ? I think there are two:
* Use either exceptions OR false/true functions in a program not both! Somethimes it is better to crash than to keep trying.
* Use macros (BOOST_FOREACH) with great care, the behavior is often hard to predict and also hard to debug. I will over time change all BOOST_FOREACH in libcoin to normal for / while loops.

Thanks for reading this - I hope you found it useful!

Cheers,

Michael
Jump to: