Thank you DSTM for your software.
Las week I used it with 2500 GPUs, but I had a lot of problems. 10% of the Gpus crash each 30-60 minutes. I really want to work with you to solve that problem. We plan to switch more GPUs but we need more stability and after we will have specials requests for the JSON API.
Questions :
1- What is the ideal nvidia driver version ? We use the latest : 384.90
2- What is the perfect Ubuntu version ? We use Ubuntu 14.04
3- What log I need to follow to try to find the problem ?
When a GPU fail, could you recover it ? I note with nvidia-smi that the GPU still working. I think EWBF recover failed GPU.
Las week I used it with 2500 GPUs, but I had a lot of problems. 10% of the Gpus crash each 30-60 minutes. I really want to work with you to solve that problem. We plan to switch more GPUs but we need more stability and after we will have specials requests for the JSON API.
I had no reports about crashes, the development is pretty fast paced currently so there could be bugs ofc. I think the fastest/easiest way to resolve the issues is to have ssh access to one of your systems that crashes.
JSON API: this is pretty easy to extend, currently I've only a basic set, It was meant for testing.
1- What is the ideal nvidia driver version ? We use the latest : 384.90
2- What is the perfect Ubuntu version ? We use Ubuntu 14.04
3- What log I need to follow to try to find the problem ?
1. I've tested zm on 375.66 and 384.90, both perform equal without issues.
2. I've tested zm on 16.04 so I can't make a robust statement about 14.04
3. It's much faster/easier if you could provide an ssh access to one of your systems.
When a GPU fail, could you recover it ? I note with nvidia-smi that the GPU still working. I think EWBF recover failed GPU.
ZM is designed such that every GPU is separated and independent, so yes on some cases a crashed GPU is recoverable (not always!), this is currently not implemented but it's planned.