Hello all.
By way of introduction: I run a ~100 node S9 farm that I brought up in early December. Prior to that I ran a small cluster of around (8) S7s and S9s. Got lucky and bought the bulk (80) of my machines during the brief 14THs sale.
The vast majority of the machines run just fine. I've had a few APW3++ power supply issues, but swapping those out for new ones generally fixes those problems.
I have also had a few hash card problems: things that new power supplies just don't fix, like this obvious one where a couple of chips are bad:
http://puu.sh/zmqcQ/be27eff6a1.pngAnyhow, yesterday one of my nodes had a card go offline. Happens. Normally a reboot and all is well. Not this time. When I rebooted, the system came up, but never started hashing. Upon investigation I found the single board speed test seemed to be hanging... and I let one pass run over 12 hours to be sure. OK, no biggie, standard process of disconnecting one board at a time and rebooting should identify whatever was wrong. Nope. Went through all 3 boards with no change!
Well, if it wasn't the boards, I figured it must be the controller... and fortunately I received some spares from Bitmain recently that I purchased just for this case. For reference, the original controller card was version 1.20, and my replacements are 1.30... if that matters.
In any case, I swapped the controller and powered up. All three cards are now running fine, but at 400Mhz.
Also for reference:
Hardware Version 16.8.1.3
Kernel Version Linux 3.14.0-xilinx-gb190cb0-dirty #57 SMP PREEMPT Fri Dec 9 14:49:22 CST 2016
File System Version Sun Jul 30 20:19:24 CST 2017
Interesting tidbits from the kernel log:
Chain[J6] has 63 asic
Chain[J7] has 63 asic
Chain[J8] has 63 asic
Chain[J6] has no freq in PIC, set default freq=400M
Chain[J6] has no core num in PIC
Chain[J7] has no freq in PIC, set default freq=400M
Chain[J7] has no core num in PIC
Chain[J8] has no freq in PIC, set default freq=400M
Chain[J8] has no core num in PIC
Which explains the 400M clock speed.
These leads me to my primary question: How do I get the controller to run its speed test and auto-tune these back up to their proper setting?
I tried using
http://172.16.4.155/cgi-bin/minerAdvanced.cgi to set the starting speed to 550M, hoping that would kick start the process. The 2nd time I tried that, it appeared to work, but all cards are at the fixed frequency I specified.
Guessing I just need to flash one of the auto-freq binaries, but wanted to check and make sure before I potentially trash a $60 (plus ridiculous shipping) card.
Secondary question: Can all the controllers support both fixed and auto-freq binaries? If so, is there any reason to run fixed on any of my older miners that came with it? Related, is it then just the mix of hash cards that determines if a miner is 13.5TH or 14TH?
Final question: Anybody know of a trustworthy US based repair shop? I don't mind sending in individual cards for repair, but no way I'm I going to send something like that miner with a few bad chips off to China for a few months of travel, losing all that hash the whole time. I've previously used BitmainWarrenty (now MyRig), but am not 100% happy with them due to what seems like excessive repair time (several months).