Pages:
Author

Topic: [ATTN: POOL OPERATORS] PoolServerJ - scalable java mining pool backend - page 7. (Read 31168 times)

sr. member
Activity: 266
Merit: 254
if yr unable to get it working you could try this one from ArtForz:
https://github.com/ArtForz/namecoin/commit/127deb4aff13965741130dba7304073330a4adea

I think it's only for the duplicate work issue but at a quick glance it looks like it's applicable to the merged mining version.
hero member
Activity: 780
Merit: 510
Bitcoin - helping to end bankster enslavement.
a quick grep search shows the patch is not applied I will I apply it after I modify 4diff.txt for namecoind source code changes unless you have a copy that google is hiding.

Well the good news is after reviewing your logs I'm pretty sure 4diff will solve your problem.

If you look at the consolidated log I sent you you can see towards the end of the period the namecoin daemon suddenly starts sending masses of duplicate works and at one point reaches a duplicate rate of nearly 100%.  This is a known bug that the 4diff patch fixes.  The problem here is that poolserverj checks for duplicates incoming from the daemon and discards them if it finds one.  If this bug is happening you're likely to be only gettting one unique work/second from the daemon.  Because it can't fill it's cache psj keeps issuing getwork requests to the daemon and keeps discarding them when they come back duplicates.

If poolserverj wasn't behaving like this you would be issuing the same work to all your miners, you and they would think everything is fine but eventually you'd notice that your pool is not finding anywhere near as many blocks as it should be.  If you've got 10 miners all working on the same work you're effectively only getting 1 miners worth of work done.  IMHO it's better for the server to crash in this case then at least you'll know something is wrong.

Pushpool does not check incoming work for duplicates so this is why you saw it wasn't working nearly as hard.  It was feeding the duplicates to your miners then going back to sleep whereas psj was caning your daemon trying to get valid work and eventually everything went down in flames...

You will see massive improvements with a 4diff patched daemon.

Thanks I will apply the patch and document provide the public with edited version of 4diff as it does not patch the namecoind code correctly.
legendary
Activity: 1428
Merit: 1000
is it feasible to include the mining proxy (for merged mining) directly in PoolServJ?

if namecoin daemon dies bitcoin deamon could still deliever getworks (or vice versa)

I am working on that right now... You can thank Davinci for tempting me with a fat bounty or I probably wouldn't have.  I'm waiting on some detail from one of the namecoin devs before I can start implementing...

+1 to davince and you ,)
sr. member
Activity: 266
Merit: 254
is it feasible to include the mining proxy (for merged mining) directly in PoolServJ?

if namecoin daemon dies bitcoin deamon could still deliever getworks (or vice versa)

I am working on that right now... You can thank Davinci for tempting me with a fat bounty or I probably wouldn't have.  I'm waiting on some detail from one of the namecoin devs before I can start implementing...
legendary
Activity: 1428
Merit: 1000
is it feasible to include the mining proxy (for merged mining) directly in PoolServJ?

if namecoin daemon dies bitcoin deamon could still deliever getworks (or vice versa)
sr. member
Activity: 266
Merit: 254
a quick grep search shows the patch is not applied I will I apply it after I modify 4diff.txt for namecoind source code changes unless you have a copy that google is hiding.

Well the good news is after reviewing your logs I'm pretty sure 4diff will solve your problem.

If you look at the consolidated log I sent you you can see towards the end of the period the namecoin daemon suddenly starts sending masses of duplicate works and at one point reaches a duplicate rate of nearly 100%.  This is a known bug that the 4diff patch fixes.  The problem here is that poolserverj checks for duplicates incoming from the daemon and discards them if it finds one.  If this bug is happening you're likely to be only gettting one unique work/second from the daemon.  Because it can't fill it's cache psj keeps issuing getwork requests to the daemon and keeps discarding them when they come back duplicates.

If poolserverj wasn't behaving like this you would be issuing the same work to all your miners, you and they would think everything is fine but eventually you'd notice that your pool is not finding anywhere near as many blocks as it should be.  If you've got 10 miners all working on the same work you're effectively only getting 1 miners worth of work done.  IMHO it's better for the server to crash in this case then at least you'll know something is wrong.

Pushpool does not check incoming work for duplicates so this is why you saw it wasn't working nearly as hard.  It was feeding the duplicates to your miners then going back to sleep whereas psj was caning your daemon trying to get valid work and eventually everything went down in flames...

You will see massive improvements with a 4diff patched daemon.
hero member
Activity: 780
Merit: 510
Bitcoin - helping to end bankster enslavement.
Quote
Does your daemon have the 4diff patch?

I assumed the latest version of namecoind has those patches but I'm not sure.

I don't know for sure but I'd be surprised if it's included in the stock namecoind.  Perhaps best to check with the devs.  I haven't got far enough to with native merged mining to need a patched namecoind yet so I haven't checked...

a quick grep search shows the patch is not applied I will I apply it after I modify 4diff.txt for namecoind source code changes unless you have a copy that google is hiding.
sr. member
Activity: 266
Merit: 254
Quote
Does your daemon have the 4diff patch?

I assumed the latest version of namecoind has those patches but I'm not sure.

I don't know for sure but I'd be surprised if it's included in the stock namecoind.  Perhaps best to check with the devs.  I haven't got far enough to with native merged mining to need a patched namecoind yet so I haven't checked...
hero member
Activity: 780
Merit: 510
Bitcoin - helping to end bankster enslavement.
So I thought if I add a PSJ node to the round robin only the users that are using poclbm will roll over to the PSJ nodes as they stick to one server until there is a problem (and there are lots of problems with pushpoold) and the CGIMiners don't use the round robin domain name nmcbit.com.   As a result PSJ node is serving up requests over 10 per second with no problems on a micro server.

going back to the other servers the CGIMiners are HAMMERING it with 20+ requests per second. I don't know what they are doing but PSJ does not like it.

Code:
[2011-09-22 00:50:17.16420] ::ffff:84.207.224.3 "/"
[2011-09-22 00:50:17.71675] ::ffff:84.207.224.3 "/"
[2011-09-22 00:50:17.125674] ::ffff:84.207.224.3 "/"
[2011-09-22 00:50:17.181142] ::ffff:84.207.224.3 "/"
[2011-09-22 00:50:17.232956] ::ffff:84.207.224.3 "/"
[2011-09-22 00:50:17.288010] ::ffff:84.207.224.3 "/LP"
[2011-09-22 00:50:17.322250] ::ffff:84.207.224.3 "/"
[2011-09-22 00:50:17.375163] ::ffff:84.207.224.3 "/"
[2011-09-22 00:50:17.430217] ::ffff:84.207.224.3 "/"
[2011-09-22 00:50:17.484228] ::ffff:84.207.224.3 "/"
[2011-09-22 00:50:17.539632] ::ffff:84.207.224.3 "/"
[2011-09-22 00:50:17.592514] ::ffff:84.207.224.3 "/"
[2011-09-22 00:50:17.648209] ::ffff:84.207.224.3 "/"
[2011-09-22 00:50:17.701122] ::ffff:84.207.224.3 "/"
[2011-09-22 00:50:17.757177] ::ffff:84.207.224.3 "/"
[2011-09-22 00:50:17.809239] ::ffff:84.207.224.3 "/"
[2011-09-22 00:50:17.865398] ::ffff:84.207.224.3 "/"
[2011-09-22 00:50:17.917665] ::ffff:84.207.224.3 "/"
[2011-09-22 00:50:17.959539] ::ffff:84.207.224.3 "/"

I took out the user name but as you can see from the pushpool, the same user is hitting pushpool like crazy!
I will set up another node and point those guys to it and log what you are requesting.


Quote
Does your daemon have the 4diff patch?

I assumed the latest version of namecoind has those patches but I'm not sure.


Over all I'm impressed with how smooth PSJ is running.
sr. member
Activity: 266
Merit: 254
There's definately something wrong here, one miner with scan=1 should request 1 getwork/second.  With 5 ECU I'd be surprised if psj had a hiccup at any less 500-1000/sec, *provided* the daemon is keeping up.

Does your daemon have the 4diff patch?

I'd like to see some running stats from the server.  Can you try running this bash script then zip up the log directory and send it to me.

Code:
#!/bin/bash

mkdir -p log
rm -f log/wget.log

INTERVAL=2s

while true; do
echo poke...
wget -a log/wget.log -O log/getsourcestats-$(date +%Y%m%d-%H%M%S).log --connect-timeout=10 --read-timeout=10 http://localhost:8997/?method=getsourcestats
sleep $INTERVAL
done

just save it to a file 'logger.sh' then
chmod +x logger.sh
./logger.sh

start it just after poolserverj comes online and keep it running until you have a longpoll.

Edit: Also it would be helpful if you could send me a copy of your properties file as well (remove your passwords).  PM me on IRC for an email address.
hero member
Activity: 780
Merit: 510
Bitcoin - helping to end bankster enslavement.
So I upgraded to a large EC2 Amazon server and tried 100% of the load and PSJ could not take it.  Cry

The users of CGIMiner can not use nmcbit.com as I use Round Robin and that miner does not stick to a server until it fails like poclmb does.
So I have all my CGIMiners accessing one pushpool server.  When I change the domain's IP address to point to PSJ server they take the PSJ DOWN even on a large EC2 Amazon server!

Here is a look at the linux server values...

Code:
top - 18:13:22 up  1:52,  2 users,  load average: 0.72, 0.65, 0.43
Tasks: 116 total,   1 running, 115 sleeping,   0 stopped,   0 zombie
Cpu(s): 34.4%us, 14.1%sy,  0.0%ni, 38.3%id,  0.0%wa,  0.0%hi,  2.4%si, 10.9%st
Mem:   7645956k total,  1158644k used,  6487312k free,    38172k buffers
Swap:        0k total,        0k used,        0k free,   648712k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
23621 ubuntu    20   0 2292m 140m  11m S   83  1.9  16:03.16 java
  611 root      20   0  236m  15m 8860 S   60  0.2  24:05.11 namecoind
  616 ubuntu    20   0  225m  63m 8924 S    9  0.9   3:18.40 bitcoind


If i get rid of the CGI miner guys PSJ acts like no one is touching the server mean while PSJ is inserting 15 to 20 shares ever 15 seconds.


Code:
top - 18:27:19 up  2:06,  2 users,  load average: 0.12, 0.18, 0.29
Tasks: 116 total,   1 running, 115 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.5%us,  0.1%sy,  0.0%ni, 99.2%id,  0.2%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   7645956k total,  1141728k used,  6504228k free,    39484k buffers
Swap:        0k total,        0k used,        0k free,   649556k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
23621 ubuntu    20   0 2292m 121m  11m S    1  1.6  21:24.45 java
  595 mysql     20   0  168m  26m 6916 S    0  0.3   0:09.45 mysqld
  611 root      20   0  236m  15m 8860 S    0  0.2  27:34.66 namecoind
24076 ubuntu    20   0 19352 1292  944 R    0  0.0   0:08.55 top
    1 root      20   0 24144 2208 1324 S    0  0.0   0:00.29 init
    2 root      20   0     0    0    0 S    0  0.0   0:00.00 kthreadd


There has to be a setting to kick those CGI guys to the curb. Wink

If I add just ONE I repeat JUST 1 CGI miner with scantime of 1 I get this...


Code:
top - 18:48:56 up  2:27,  2 users,  load average: 0.98, 0.46, 0.33
Tasks: 116 total,   2 running, 114 sleeping,   0 stopped,   0 zombie
Cpu(s): 36.4%us, 19.0%sy,  0.0%ni, 32.5%id,  0.1%wa,  0.0%hi,  2.5%si,  9.5%st
Mem:   7645956k total,  1206340k used,  6439616k free,    41564k buffers
Swap:        0k total,        0k used,        0k free,   651896k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 7895 ubuntu    20   0 2236m 178m  11m S   91  2.4   3:05.97 java
  611 root      20   0  236m  16m 8860 S   62  0.2  33:25.10 namecoind
  616 ubuntu    20   0  225m  63m 8924 S    4  0.9   4:20.02 bitcoind


If there is a LP it's over the system is down. (LP occurred shortly after I copied this.) Remember this is on a large server!
CGIMiner with scan settings set to 1 is like a DOS attack.

The reason no other pool is seeing this is because I told my uses to set CGIMiner scan rate to a low number, this reduces efficiency but also reduces the number of rejects people where seeing.  So please try not dismiss my issue as this would take down pools with just a few miners.


hero member
Activity: 780
Merit: 510
Bitcoin - helping to end bankster enslavement.
When a new block is found by my pool stales go up big time bu not on PSJ!  Average guy has 3 to 6% stales I have 0.63 and only 3 miners have stales the rest are at ZERO after a block found!  Amazing!

I'm going to upgrade my server then put the full load of users.  Then I will fool around with clustering PSJ on micro servers see if I can get that working.

EDIT: Spoke to soon, PSJ server was throttled.
hero member
Activity: 780
Merit: 510
Bitcoin - helping to end bankster enslavement.
I know that I just want to learn about clustering and it's issues before having large number of users.  I want to see how the server reacts and how to fail over to a server that can handle all load.  Basiclly I am educating myself on all this stuff and learning linux at the same time.

A micro instance is probably a very good tool for learning with.  I'm just pointing out that psj on a micro is pushing a square peg into a round hole... If it's just for experimentation that fine, but you will likely see some behaviours on a micro that you wouldn't on standard ec2 instance.  Having said that I'm running the psj website on a micro... I've run the odd bitcoind on it and even a test psj instance but I didn't expect too much of it when it...

So I am running PSJ on a micro and I have thrown virtually all of my hashing power on it and I am getting better speeds from my miners than pushpool.  I have to say PSJ is kick ass solid if someone wanted to create a small pool using micro ec2.


sr. member
Activity: 266
Merit: 254
I know that I just want to learn about clustering and it's issues before having large number of users.  I want to see how the server reacts and how to fail over to a server that can handle all load.  Basiclly I am educating myself on all this stuff and learning linux at the same time.

A micro instance is probably a very good tool for learning with.  I'm just pointing out that psj on a micro is pushing a square peg into a round hole... If it's just for experimentation that fine, but you will likely see some behaviours on a micro that you wouldn't on standard ec2 instance.  Having said that I'm running the psj website on a micro... I've run the odd bitcoind on it and even a test psj instance but I didn't expect too much of it when it...
hero member
Activity: 780
Merit: 510
Bitcoin - helping to end bankster enslavement.
you are not to be able to control other people's miner settings so you will have to deal with miners who have high scan rates.  I repeat and micro instance is not suitable for running a pool of any scale. 

In recent tests on a large production pool using a dedicated server it took over 1500GH/s before psj had it's first hiccup.  At that point it was servicing over 4000 getworks/sec.  The problem you're having is the CPU throttling on the micro instance.  Once the throttling kicks in you're CPU capacity is squeezed from 2 ECU down 0.35ECU so if your baseline load is 0.5ECU your server is effectively squashed.

I know that I just want to learn about clustering and it's issues before having large number of users.  I want to see how the server reacts and how to fail over to a server that can handle all load.  Basiclly I am educating myself on all this stuff and learning linux at the same time.
sr. member
Activity: 266
Merit: 254
you are not to be able to control other people's miner settings so you will have to deal with miners who have high scan rates.  I repeat and micro instance is not suitable for running a pool of any scale. 

In recent tests on a large production pool using a dedicated server it took over 1500GH/s before psj had it's first hiccup.  At that point it was servicing over 4000 getworks/sec.  The problem you're having is the CPU throttling on the micro instance.  Once the throttling kicks in you're CPU capacity is squeezed from 2 ECU down 0.35ECU so if your baseline load is 0.5ECU your server is effectively squashed.
hero member
Activity: 780
Merit: 510
Bitcoin - helping to end bankster enslavement.
So I was running some test put all 6GH I have on PSJ using poclmb 12 hrs NO problem PSJ on a mirco server does not even flinch as it posts 30+ shares ever 15 seconds to the database.

AWESOME!

I take 1 GPU using CGIMiner default settings... NO problem.  Change scans to 1 and PSJ collapses in a ball of flames.

THAT WAS JUST 1 miner!

I'm going to re-test that and post again if I have more info.
EDIT:
Its confirmed taking the scantime setting down to low number hits the server very hard I'm going to make some adjustments on the settings see if I can prevent this hit.  Also going to try CGIMiner by it's self with scan time high.
sr. member
Activity: 266
Merit: 254
BTW you can't go from small to large as a small is a 32bit computer... unless linux has some magic code I did not know about that allows it to reboot into 64bit. Smiley

Sorry the one I was thinking of was High-CPU Medium Instance which is 32bit with 5 EC2 compute units.  2x price/hour but 5x CPU capacity.
hero member
Activity: 780
Merit: 510
Bitcoin - helping to end bankster enslavement.
still on 2.9 and It works WAY WAY better than pushpool,  I could kiss you shadders  Kiss

HOWEVER I have only got it to go with half my load before it falls down because it sucks up 100% of the amazon micro CPU this is not good as amazon only gives you burst of 100% usage then throttles the VM down.

Pushpool will take 100% of my load but I get these errors...

Code:
2011-09-19 14:05:09: Listener for "coinserver3 test": 19/09/2011 14:05:09, Problems communicating with bitcoin RPC 0 2
2011-09-19 14:07:01: Listener for "coinserver3 test": 19/09/2011 14:07:01, Problems communicating with bitcoin RPC 0 2

Still trying to fine tune it so that I get the best performance need upgrade and code in a control panel for my web app.  Man I knew when I read your web site that my 10 BTC donation was worth every BTC cent.

If I can get the poolserverj handling 25 GHashs on on micro server with (bitcoind and namecoind running) I will kick you some more BTC or one better higher you on as a part time consultant but we will see.

Davinci

The CPU burst profile of an EC2-micro instance really isn't suitable for PSJ.  PSJ will always have a baseline level of load due to keeping the cache full this is an advantage when load increases but an EC2-micro is designed for scenario's where most of the time the load is near zero and burst occasionally.  If you have a constant baseline load it won't allow the CPU to burst.  So I think from memory you end up with about 0.2 of an EC2 compute unit as your constant capacity.  In pushpool's case it basically doesn't do anything until it gets a request so it's probably using the full 2 EC2 compute unit burst capacity..  For low load pools pushpool will probably run better on a micro that psj.  You would be better off running it on a small instance to get a constant 1 EC2 compute unit then raising it a large as your capacity needs increase.   

Amazon have a very detailed article on this but I can't find it... This talks about the same sort of thing though : http://huanliu.wordpress.com/2010/09/10/amazon-ec2-micro-instances-deeper-dive/


Right now I have PSJ running handling 2GH nicely for the miner compared to pushpool my miners are cleanly getting LP new blocks and thus low stales and running a lot faster than before.
3 of the miners go a Timeout error from PSJ once but I have not seen it again after 2 other blocks where found.

Pushpool always gives me timeout errors for LP and the above error.

Not once have I gotten this from Pushpool...
Code:
2011-09-19 22:52:09: Listener for "coinserver5 test": 19/09/2011 22:52:09, long poll: new block 0000589b63ba445d
2011-09-19 22:52:15: Listener for "coinserver5 test": 19/09/2011 22:52:15, Using new LP URL /LP/
2011-09-19 22:52:15: Listener for "coinserver5 test": 19/09/2011 22:52:15, LP connected to test1.nmcbit.com:8332

It's always been a timeout Exception.

So anyways it's looking good with 2GH and my currently incorrect settings.  I will adjust see what happens.


BTW you can't go from small to large as a small is a 32bit computer... unless linux has some magic code I did not know about that allows it to reboot into 64bit. Smiley
sr. member
Activity: 266
Merit: 254
I would like to know is this memcache or something else internal to java like .NET cached objects?

Poolserverj does not use memcached.  All of it's caches, queues, maps etc are internal which is part of the reason psj eats a lot more memory than the pushpoold process.  It's an awful lot faster this way because we can avoid LRU overhead, don't have to traverse the network stack and in many cases can avoid using map keys altogether. 

The maps are using the obscenely efficient trove library which boosted getwork performance by about 50% when it was implemented as well as some intelligent hashing/comparison strategies which would be impossible with memcached. 

e.g. string comparisons for duplicate checks don't just check char by char and abort when they find one that isn't equal.  They check the chars in a different order, starting with the ones that are most likely to differ first.  Hash codes only use about 5 chars of a getwork data string and from memory I think they pick those from the merkle root which should be unique to every work.  This provides more than enough uniqueness for the hashmap to work efficiently but saves about 80-90% of the work involved in a hashmap put or get.
Pages:
Jump to: