Starting preliminary 0.94 testing - "Headless fullnode" - page 8.

Roy Badami

hero member

Activity: 563

Merit: 500

Quote from: goatpig on June 19, 2015, 02:51:00 PM

Quote from: Carlton Banks on June 19, 2015, 02:41:40 PM

Like I said, this didn't happen in the first couple of 0.94 pre-versions you came up with, not sure what you can infer from that.

The usual conclusion in this case is: there always was an issue with this code, and changes in its vicinity revealed it. Otherwise it was probably occurring at low frequency, passing itself as a scan segfault of sorts.

The usual case is "OMG how can that ever have worked?" :-)

goatpig

legendary

Activity: 3794

Merit: 1375

Armory Developer

Quote from: Carlton Banks on June 19, 2015, 02:41:40 PM

Like I said, this didn't happen in the first couple of 0.94 pre-versions you came up with, not sure what you can infer from that.

The usual conclusion in this case is: there always was an issue with this code, and changes in its vicinity revealed it. Otherwise it was probably occurring at low frequency, passing itself as a scan segfault of sorts.

Carlton Banks

legendary

Activity: 3430

Merit: 3080

Quote from: goatpig on June 19, 2015, 02:14:53 PM

Quote from: Carlton Banks on June 19, 2015, 09:58:29 AM

I heard there were new processor instructions to abstract a bit of the complexity away from the programming logic, but Intel screwed up the implementation somehow and it's not quite "on the streets" yet. But all the specs are finalised as far as I remember, I believe they're calling it "lock elision" in C++, although I may be remembering wrong. So there's some more C++ to render the old standard obsolete for us all to look forward to Cheesy

.

That's pretty fancy. It doesn't really simplify much. It looks like it's meant to speed up interlocked operations that don't overlap in memory, and it's only in some Haswell CPUs so far. I suspect it's the kind of extra assembly optimization that I will personally never have to use, as either compilers or OSes will upgrade their locks with HLE. The way it looks like, there is little reason not to standardize it for every lock operation.

It's supposed to be some kind of speculative execution; the majority of the time that a lock is applied, there are no writes to data structures that break the intention of the code. I forget the full details though, this was years ago. Haswell was where it was borked, it was disabled in the firmware for all lines IIRC (TSX is what Intel call it). Suposedly the next 16nm Intel chip out in Q4 this year has it working correctly.

Quote from: goatpig on June 19, 2015, 02:14:53 PM

Quote from: Carlton Banks on June 19, 2015, 08:06:55 AM

Just to totally clarify, I cannot get this or the build from your last commit to behave in any way except for this bug. It does this every time I load it up. Carefully deleting old Db and logs every run. Tested that Armory 93.1 & 93.2 work normally on the same VM (I cloned it from a working bitcoin-configured VM to begin with, but you never know).

And that bit is what's disturbing me considering this part uses the same code across version. Again, this is pretty old code, so I'm puzzled as to why it starts failing now. Anyways, I think I got a clue as to what's going on.

Like I said, this didn't happen in the first couple of 0.94 pre-versions you came up with, not sure what you can infer from that.

Roy Badami

hero member

Activity: 563

Merit: 500

Quote from: goatpig on June 19, 2015, 02:31:58 PM

Quote from: Roy Badami on June 19, 2015, 02:27:40 PM

Quote from: goatpig on June 09, 2015, 09:42:03 PM

Quote from: Roy Badami on June 09, 2015, 07:15:09 PM

Could well be just chance since the behaviour before was clearly somewhat nondeterministric. But I did the above and it synced my wallets perfectly first time (and I hadn't managed to get a successfully sync at all so far up till now). If I get the chance tomorrow I'll nuke the databases and try again to confirm whether this is repeatable.

Well yes and no. I expect the issue is object life span: one type of threads creates some data that gets cleaned up before the next type of threads is done with it. If you reduce the amount of threads per groups to the bare minimum, you also reduce the "surface area" for the bug and thus the chance it will occur.

Quote

Just to confirm, only change I made was as you suggested:

That's the proper change to force thread count for all groups down to 1.

Is there any point in my trying the new version or will this version just behave the same for me?

I'd like you to run it, the main fix in this version is supposed to cover your issue.

Ah, glad I asked, then :-)

I initially assumed that since I wasn't seeing a segfault there wasn't much point in trying it. I'll give it a go and report back.

goatpig

legendary

Activity: 3794

Merit: 1375

Armory Developer

Quote from: Roy Badami on June 19, 2015, 02:27:40 PM

Quote from: goatpig on June 09, 2015, 09:42:03 PM

Quote from: Roy Badami on June 09, 2015, 07:15:09 PM

Could well be just chance since the behaviour before was clearly somewhat nondeterministric. But I did the above and it synced my wallets perfectly first time (and I hadn't managed to get a successfully sync at all so far up till now). If I get the chance tomorrow I'll nuke the databases and try again to confirm whether this is repeatable.

Well yes and no. I expect the issue is object life span: one type of threads creates some data that gets cleaned up before the next type of threads is done with it. If you reduce the amount of threads per groups to the bare minimum, you also reduce the "surface area" for the bug and thus the chance it will occur.

Quote

Just to confirm, only change I made was as you suggested:

That's the proper change to force thread count for all groups down to 1.

Is there any point in my trying the new version or will this version just behave the same for me?

I'd like you to run it, the main fix in this version is supposed to cover your issue.

Roy Badami

hero member

Activity: 563

Merit: 500

Quote from: goatpig on June 09, 2015, 09:42:03 PM

Quote from: Roy Badami on June 09, 2015, 07:15:09 PM

Could well be just chance since the behaviour before was clearly somewhat nondeterministric. But I did the above and it synced my wallets perfectly first time (and I hadn't managed to get a successfully sync at all so far up till now). If I get the chance tomorrow I'll nuke the databases and try again to confirm whether this is repeatable.

Well yes and no. I expect the issue is object life span: one type of threads creates some data that gets cleaned up before the next type of threads is done with it. If you reduce the amount of threads per groups to the bare minimum, you also reduce the "surface area" for the bug and thus the chance it will occur.

Quote

Just to confirm, only change I made was as you suggested:

That's the proper change to force thread count for all groups down to 1.

Is there any point in my trying the new version or will this version just behave the same for me?

goatpig

legendary

Activity: 3794

Merit: 1375

Armory Developer

Quote from: Carlton Banks on June 19, 2015, 08:06:55 AM

Just to totally clarify, I cannot get this or the build from your last commit to behave in any way except for this bug. It does this every time I load it up. Carefully deleting old Db and logs every run. Tested that Armory 93.1 & 93.2 work normally on the same VM (I cloned it from a working bitcoin-configured VM to begin with, but you never know).

And that bit is what's disturbing me considering this part uses the same code across version. Again, this is pretty old code, so I'm puzzled as to why it starts failing now. Anyways, I think I got a clue as to what's going on.

Quote from: Carlton Banks on June 19, 2015, 09:58:29 AM

I heard there were new processor instructions to abstract a bit of the complexity away from the programming logic, but Intel screwed up the implementation somehow and it's not quite "on the streets" yet. But all the specs are finalised as far as I remember, I believe they're calling it "lock elision" in C++, although I may be remembering wrong. So there's some more C++ to render the old standard obsolete for us all to look forward to Cheesy

.

That's pretty fancy. It doesn't really simplify much. It looks like it's meant to speed up interlocked operations that don't overlap in memory, and it's only in some Haswell CPUs so far. I suspect it's the kind of extra assembly optimization that I will personally never have to use, as either compilers or OSes will upgrade their locks with HLE. The way it looks like, there is little reason not to standardize it for every lock operation.

Carlton Banks

legendary

Activity: 3430

Merit: 3080

Quote from: doug_armory on June 19, 2015, 09:50:49 AM

Quote from: Carlton Banks on June 19, 2015, 08:06:55 AM

I'm pretty sure what you're saying about deadlocks sounds about right, from what I recall about how threading works.

Deadlocks can be a nightmare to debug. Threads are great in many ways. Debugging is not one of them. Sad

I heard there were new processor instructions to abstract a bit of the complexity away from the programming logic, but Intel screwed up the implementation somehow and it's not quite "on the streets" yet. But all the specs are finalised as far as I remember, I believe they're calling it "lock elision" in C++, although I may be remembering wrong. So there's some more C++ to render the old standard obsolete for us all to look forward to Cheesy

.

doug_armory

sr. member

Activity: 255

Merit: 250

Senior Developer - Armory

Quote from: Carlton Banks on June 19, 2015, 08:06:55 AM

I'm pretty sure what you're saying about deadlocks sounds about right, from what I recall about how threading works.

Deadlocks can be a nightmare to debug. Threads are great in many ways. Debugging is not one of them. Sad

Carlton Banks

legendary

Activity: 3430

Merit: 3080

Quote from: goatpig on June 19, 2015, 07:49:52 AM

Quote from: Carlton Banks on June 19, 2015, 07:36:51 AM

Quote from: goatpig on June 19, 2015, 07:16:06 AM

How long do you let this part run until you conclude it is hanging?

5-10 mins. Could there be something intended happening for longer than that length of time that consumes ~1% CPU on all cores? It's not consistent with the behaviour of your first couple of 0.94 commits, they took seconds to handle that stage.

Glancing at the code, it's pretty old and inefficient, but that wouldn't warrant that long a wait. This is probably some sort of dead lock. Unfortunately I haven't managed to reproduce it in this version (which is why I thought it was fixed), so it's gonna take me some time to figure it out. I expect this is the last hurdle left for this code to be stable and functional.

Just to totally clarify, I cannot get this or the build from your last commit to behave in any way except for this bug. It does this every time I load it up. Carefully deleting old Db and logs every run. Tested that Armory 93.1 & 93.2 work normally on the same VM (I cloned it from a working bitcoin-configured VM to begin with, but you never know).

I'm pretty sure what you're saying about deadlocks sounds about right, from what I recall about how threading works.

goatpig

legendary

Activity: 3794

Merit: 1375

Armory Developer

Quote from: Carlton Banks on June 19, 2015, 07:36:51 AM

Quote from: goatpig on June 19, 2015, 07:16:06 AM

How long do you let this part run until you conclude it is hanging?

5-10 mins. Could there be something intended happening for longer than that length of time that consumes ~1% CPU on all cores? It's not consistent with the behaviour of your first couple of 0.94 commits, they took seconds to handle that stage.

Glancing at the code, it's pretty old and inefficient, but that wouldn't warrant that long a wait. This is probably some sort of dead lock. Unfortunately I haven't managed to reproduce it in this version (which is why I thought it was fixed), so it's gonna take me some time to figure it out. I expect this is the last hurdle left for this code to be stable and functional.

Carlton Banks

legendary

Activity: 3430

Merit: 3080

Quote from: goatpig on June 19, 2015, 07:16:06 AM

How long do you let this part run until you conclude it is hanging?

5-10 mins. Could there be something intended happening for longer than that length of time that consumes ~1% CPU on all cores? It's not consistent with the behaviour of your first couple of 0.94 commits, they took seconds to handle that stage.

goatpig

legendary

Activity: 3794

Merit: 1375

Armory Developer

How long do you let this part run until you conclude it is hanging?

Carlton Banks

legendary

Activity: 3430

Merit: 3080

Double-checked the BTCARMORY_BUILD string, definitely compiling from the latest commit.

goatpig

legendary

Activity: 3794

Merit: 1375

Armory Developer

That's quite interesting considering I added verbose at this stage for this specific reason (was curious how deep it would get before failing). I guess it doesn't even get that far.

Carlton Banks

legendary

Activity: 3430

Merit: 3080

Same problems (not segfault).

The Db build gets so far:

Code:

[Thread 0x7fffca34b700 (LWP 3812) exited]
-DEBUG - 1434706596: (Blockchain.cpp:214) Organizing chain w/ rebuild
-INFO  - 1434706599: (BlockUtils.cpp:1385) Looking for first unrecognized block
-INFO  - 1434706599: (BlockUtils.cpp:1389) Updating Headers in DB

...and gets stuck, python process is more or less idling ~1% CPU. No errors in the logs. Quitting Armory doesn't work in this state, it does respond to ending the process.

goatpig

legendary

Activity: 3794

Merit: 1375

Armory Developer

New version. This one should fix the segfaults mid scan. Still working on the others ones. Pull away =)

goatpig

legendary

Activity: 3794

Merit: 1375

Armory Developer

Quote from: jl2012 on June 10, 2015, 04:19:54 AM

4. At block 360206, and armoryd says "BDM thread failed: The scanning process interrupted unexpectedly" and asks me to rebuild and rescan. I killed armoryd by Control-C.

That's consistent with the bug I'm going after currently. That error is too aggressive, half the time you can ignore it and restart without rescanning.

Quote

5. I started armory-qt. I didn't request a rescan but it seems doing that automatically.

There will be an empty file named either rescan.flag or rebuild .flag in your datadir when the previous instance of Armory wants to signal the next one to rebuild/rescan. Delete it and it won't.

Quote

7. I restored the previous databased folder by last ffreeze. It was incomplete and I stopped it at about block 300,000 after more than a week of scanning
8. I started armory-qt. No re-scan.

That's odd, it should try to catch up from 300,000 onwards

Quote

9. I imported this private key: 5JdeC9P7Pbd1uGdFVEsJ41EkEnADbbHGq6p1BwFxm6txNBsQnsw (SHA256 of "1") for the address 12AKRNHpFhDSBDD9rSn74VAzZSL3774PxQ. It only shows the 2 txs in 2014 but not the latest ones of 2015.

Expectable if the scan ended around block 300,000

Quote

So I guess both databases are somehow corrupted?

Shouldn't be, but I guess they are. Actually what's off is that they don't pick up where they left. I'd say hang on to your 300k DB, I intent to add some more checks to allow the DB to fix itself (instead of rebuild/rescan).

jl2012

legendary

Activity: 1792

Merit: 1111

Quote from: goatpig on June 09, 2015, 01:47:08 PM

Quote from: jl2012 on June 09, 2015, 10:20:37 AM

Quote from: goatpig on June 07, 2015, 04:31:57 PM

New version out in ffreeze, please test away. There's a wealth of robustness changes and it should also be a bit faster. However no one at Armory has experienced the issues Carlton Banks and btcchris ran into, so most of these changes were speculative. It's possible these bugs will still be there partially or in full.

As usual, looking forward to your bug reports and logs, and again thank you for the time and effort you put in helping us to test things out =)

Have you made any big change re supernode? In the previous ffreeze it did not complete even with a week. In this version it took only a few hours to complete.

Quote

I made changes to some stuff that makes supernode faster (some weird opts I should have rolled but figured it could wait for fear it would be unstable). The other issue, with pulling block data and the object lifespan remains. It's a rare occurrence but common enough to trip certain setup quite often (2 out of my 3 testers T_T). I ended rolled the optimization with the stability changes (it didn't cost that much more to add the opts at that stage)

Now I'm also convinced that the code speeding up supernode is not responsible for the instability. That gives me room to add some more optimizations (again supernode only) without fear of making bugs more convoluted (and thus harder to isolated).

Not having a clue what the issue was, I ended up fixing what you were suffering from (which was a lot more obvious than the remaining issue) in an attempt to fix the bug they are experiencing. In grand scheme of things it doesn't matter what order I fix these, since I have to go after both anyways. In the current scope, it doesn't speak so good of me since I thought I was fixing one thing and got the other instead =P

Quote

I can now get the details of a random tx with "armoryd.py gettransaction txid". I guess this means the supernode in running properly?

Not really, that just means you are using supernode indeed (which is the only the DB format now that support random txhash resolution), but that doesn't mean supernode is working per se. What you need to do to test it is to add a random bitcoin address to a wallet and getledger on the wallet, which will display that random address history along with the rest.

You can also verify this in the UI by importing some publicly known private keys to a wallet (kind of an oxymoron I know) and verify they get imported quasi instantly and that their history shows up in the main ledger.

It fails again. This is what I did and saw:

1. I compiled the latest ffreeze. Removed previous databases folder.
2. I started armoryd in supernode mode. It finished "Reading raw blocks finished at file xxx offset xxxxx" in a few hours
3. At this time, I can get the details of a random tx with "armoryd.py gettransaction txid". However, I didn't tested by importing private key.
4. At block 360206, and armoryd says "BDM thread failed: The scanning process interrupted unexpectedly" and asks me to rebuild and rescan. I killed armoryd by Control-C.
5. I started armory-qt. I didn't request a rescan but it seems doing that automatically.
6. The console started to show "Finished applying blocks up to xxxxxx". This message did not show up before the crash. The progress became slower and slower. I knew it will never complete and closed armory-qt.
7. I restored the previous databased folder by last ffreeze. It was incomplete and I stopped it at about block 300,000 after more than a week of scanning
8. I started armory-qt. No re-scan.
9. I imported this private key: 5JdeC9P7Pbd1uGdFVEsJ41EkEnADbbHGq6p1BwFxm6txNBsQnsw (SHA256 of "1") for the address 12AKRNHpFhDSBDD9rSn74VAzZSL3774PxQ. It only shows the 2 txs in 2014 but not the latest ones of 2015.

So I guess both databases are somehow corrupted?

goatpig

legendary

Activity: 3794

Merit: 1375

Armory Developer

Quote from: Roy Badami on June 09, 2015, 07:15:09 PM

Could well be just chance since the behaviour before was clearly somewhat nondeterministric. But I did the above and it synced my wallets perfectly first time (and I hadn't managed to get a successfully sync at all so far up till now). If I get the chance tomorrow I'll nuke the databases and try again to confirm whether this is repeatable.

Well yes and no. I expect the issue is object life span: one type of threads creates some data that gets cleaned up before the next type of threads is done with it. If you reduce the amount of threads per groups to the bare minimum, you also reduce the "surface area" for the bug and thus the chance it will occur.

Quote

Just to confirm, only change I made was as you suggested:

That's the proper change to force thread count for all groups down to 1.

Topic: Starting preliminary 0.94 testing - "Headless fullnode" - page 8. (Read 15290 times)