Author

Topic: Canaan Avalon 1246 Temperature Anomaly? (Read 178 times)

newbie
Activity: 15
Merit: 10
March 11, 2021, 02:46:57 PM
#13
If temp is really over 180C then char it - yes. Also a safe be that the chip would be dead by now from semiconductor failures in it.

Historically Canaan has always had temp sensors inside of each chip (hence the per-chip readout) so considering it is otherwise running normally I go with the idea that it is a false positive.

Agreed, my hypothesis is that if that temperature was correct it would have been fried already, or lead to some other failure.

So for posterity, is it possible it's a faulty temperature sensor inside of the chip? I'm not sure how Avalon's CGMiner implementation works against their hardware - I took a look at their github and nothing jumped out at me at first glance, as I thought maybe it was their software misreporting.

This was running for quite an extended period of time at that reported temp, and hashing on that chip is normal and working without issue. I can't say too much for the chip architecture, but the likelihood of one of the ASICs even reaching that high temperature while being in such close proximity to the heat sink (even if the thermal paste wasn't 100%) AND the fans cranking seems VERY unlikely. For example: The unit will be completely shut off, ambient room/device temperature is 55F. I turn the unit on and that ASIC reports 180C immediately.

Apologies for the hypothesizing, I'm just trying to pinpoint the false positive and if it's a hardware or software issue.
legendary
Activity: 3822
Merit: 2703
Evil beware: We have waffles!
March 11, 2021, 02:15:16 PM
#12
Melt the PCB? No. They are made of FR4 fiberglass & epoxy resin. That stuff cannot melt.
The solder however, being lead-free its typical melting point is around 217C but the reported chip temp is still below that.

If temp is really over 180C then the board would discolor and char. Also a safe bet that the chip would be dead by now from semiconductor failures in it.

Historically Canaan has always had temp sensors inside of each chip (hence the per-chip readout) so considering it is otherwise running normally I go with the idea that it is a false positive.
newbie
Activity: 15
Merit: 10
March 11, 2021, 01:17:27 PM
#11
You can see my Troubleshooting guide for older Avalons for reference (link is in my signature space). Good cpu paste will do.

Great, thanks a lot. I was waiting to hear back from Canaan before opening anything up, as I followed up with them with latest logs showing that same ASIC running at 120C... well...

Latest update:
That same chip is now showing 180C+ (Sometimes 198C!) for same voltage in the miner logs, and Canaan replied back that it is most likely a false positive and everything is fine (but pretty sparse details on their end).

Could this be a false positive? Wouldn't 180C melt the PCB?

Keep in mind, it's not rebooting or anything, and seems to be chugging along fine. All other indicators seem okay.
legendary
Activity: 4326
Merit: 8950
'The right to privacy matters'
March 03, 2021, 09:13:47 PM
#10
re: thermal pads - if needed yes they will bridge gaps better than paste BUT - that ability comes at the expense of added thermal resistance. Assuming the fitup between chip and heat sink is good (flat contact with no gaps) using a SLIGHT amount of paste is better.

FYI: The design goal of thermal paste is to fill microscopic pits/pores in very flat surfaces that fit together well. It is not for filling voids or gaps and as such only minimal amounts are to be used.

hopfully the case is some underapplied paste vs a slight dip in the heatsink of pcb board.

i would go paste first them switch to a pad if paste does not help.

lastly it could be a shitty chip and paste or pad does not help.
legendary
Activity: 3822
Merit: 2703
Evil beware: We have waffles!
March 03, 2021, 09:02:23 PM
#9
re: thermal pads - if needed yes they will bridge gaps better than paste BUT - that ability comes at the expense of added thermal resistance. Assuming the fitup between chip and heat sink is good (flat contact with no gaps) using a SLIGHT amount of paste is better.

FYI: The design goal of thermal paste is to fill microscopic pits/pores in very flat surfaces that fit together well. It is not for filling voids or gaps and as such only minimal amounts are to be used.
legendary
Activity: 4326
Merit: 8950
'The right to privacy matters'
March 03, 2021, 06:29:08 PM
#8
possibly a pad like this

https://www.newegg.com/arctic-actpd00011a/p/2MB-000F-00063

sometimes it is a larger gap and thermal pads work better.
legendary
Activity: 2506
Merit: 1714
Electrical engineer. Mining since 2014.
March 03, 2021, 05:18:32 PM
#7
You can see my Troubleshooting guide for older Avalons for reference (link is in my signature space). Good cpu paste will do.
newbie
Activity: 15
Merit: 10
March 03, 2021, 01:18:34 PM
#6
I'm willing to take a look, not a big deal. It's a shame too, because the room is fairly cold (45F).

Do you have any resources you can point in my direction for things to be aware of during disassembly/inspection? Will typical cpu thermal paste do, or do you recommend a specific type?
legendary
Activity: 2506
Merit: 1714
Electrical engineer. Mining since 2014.
March 03, 2021, 10:29:50 AM
#5
Would it be a difficult operation to re-paste the A1246 hash board heat sink?
That would be my suggestion for a fix procedure.
legendary
Activity: 3822
Merit: 2703
Evil beware: We have waffles!
March 03, 2021, 10:15:49 AM
#4
The log shows that the one chip is both running at a higher core voltage and temp than all the others.

As to why, kind of a chicken-egg situation: core voltage and temps are related in that a higher temp can require a higher core voltage for low error rate and at the same time, higher voltage results in higher temps. Possibly that one chip is not making proper contact with the heat sink?
newbie
Activity: 15
Merit: 10
March 03, 2021, 09:06:45 AM
#3
Ah! Thank you! I think my eyes are shot. Good catch.



I contacted Canaan, and they told me not to worry about it unless it runs consistently over 105C (the most I've seen that chip get to was 100C for a few minutes before dropping down to 88C). Not sure if its worth cracking this open and checking everything, or sending it back and losing a month or two over one ASIC.

Also, funny enough, it found a block this morning, no kidding (you're welcome, slushpool).



Posting this, maybe someone can help shed some light. Also replied to Canaan with more log data.

Code:
PVT_T0
[71 70 69 71 70 70 70 70 69 74 70 69 74 69 72 75 72 69 72 72 73 73 68 71 69 73 72 72 71 70 70 71 70 73 72 70 68 69 74 69 70 69 65 70 69 68 69 66 64 66 68 67 64 63 62 [b]105[/b] 64 62 64 63 63 57 57 61 59 60 62 61 60 59 61 62 64 64 60 61 66 64 65 64 63 59 66 65 64 63 64 63 66 65 69 64 62 62 67 68 69 68 64 65 68 68 69 69 64 64 68 70 69 66 63 65 73 68 67 69 62 62 67 69]

PVT_V0
[294 299 303 292 295 297 301 308 306 292 291 294 294 292 295 301 299 295 292 289 289 307 306 309 290 288 286 290 290 297 290 291 292 291 291 294 291 295 300 290 292 296 297 301 302 300 294 297 298 304 299 301 300 299 305 [b]337[/b] 297 300 297 302 303 298 297 299 298 304 302 303 308 300 302 303 306 310 306 308 306 306 296 295 299 301 305 300 297 297 294 299 295 296 297 298 307 296 303 308 294 297 302 299 298 301 296 294 293 305 302 294 295 293 291 306 300 303 298 299 296 298 296 298]

MW0
[82 87 96 74 69 53 83 81 74 50 75 77 69 76 55 72 78 87 52 83 61 79 81 80 77 70 66 67 72 16 71 88 85 75 76 73 66 88 75 57 81 85 88 68 75 73 86 77 96 48 79 61 77 67 80 72 71 82 68 77 81 67 64 69 82 92 78 78 57 83 63 72 63 95 76 84 78 71 71 80 74 82 60 72 74 90 80 80 92 53 46 74 79 83 87 72 86 81 82 91 66 46 74 90 69 100 91 73 104 86 67 88 85 91 69 85 78 70 73 81]

There is definitely something funky going on with the ASIC there.
legendary
Activity: 2506
Merit: 1714
Electrical engineer. Mining since 2014.
March 01, 2021, 12:39:40 AM
#2
You have one chip reporting the 91C temperature (in the first hash board, see PVT_T0 values). Maybe a bad asic chip or a bad contact with the heat sink?
The TMax literally just shows the highest reading among all the chips.
newbie
Activity: 15
Merit: 10
March 01, 2021, 12:33:01 AM
#1
I've got an Avalon 1246, and it has been reporting a Tmax pretty high (over 85C) in the logs. However, the actual chip temperatures, the MTavg values don't seem to match up.  The hash rate is solid (between 80-82 TH/s on low power mode), and the Dashboard shows everything is fine, but I was concerned about this high value being reported. Sometimes it reads up to 93C.

Has anyone seen anything like this with Canaan or have any ideas why the discrepancy? No audible issues and no major fluctuations in Hashing, mind you. Faulty temperature sensor?

Code:
Temp[31]
TMax[91]
TAvg[65]
Fan1[4360] Fan2[4330] Fan3[4330] Fan4[4315]
FanR[57%]
MTmax[91 76 70]
MTavg[66 67 63]
PVT_T0[ 70 70 68 69 69 69 69 69 68 72 68 68 72 68 71 74 70 68 70 70 71 71 66 70 68 71 70 69 69 69 69 69 68 71 71 69 67 68 73 69 69 68 65 69 68 68 68 65 64 66 67 67 63 62 62 91 63 61 64 63 64 57 57 62 60 60 62 61 60 60 62 62 63 64 60 61 66 64 64 64 63 59 65 65 64 63 63 62 66 64 69 63 61 62 66 68 69 68 64 65 67 67 67 68 64 64 67 69 67 66 63 64 72 67 66 69 62 61 67 68]

PVT_T1[ 66 71 70 69 67 69 69 67 70 72 70 69 66 69 71 71 74 66 71 72 70 70 76 71 69 71 71 73 70 69 71 74 74 69 73 70 71 72 70 70 73 71 66 71 69 68 69 68 66 64 65 66 67 66 64 64 66 66 64 65 60 57 58 62 64 62 64 65 66 64 65 65 61 64 62 63 67 70 66 64 65 64 66 67 69 66 64 66 67 68 67 65 64 67 67 67 69 68 66 64 67 69 68 65 62 62 68 65 67 65 66 65 66 68 67 65 65 64 63 69]

PVT_T2[ 68 69 69 66 65 65 63 68 69 65 65 68 65 64 69 67 69 67 67 68 66 70 68 69 66 65 66 69 66 68 65 65 69 70 67 65 66 67 68 69 66 67 63 66 66 64 64 64 58 63 64 61 62 59 57 60 59 58 59 57 56 56 57 59 58 58 62 60 60 61 60 61 60 63 63 63 61 62 63 64 64 61 62 64 64 63 62 64 64 64 61 66 62 63 66 63 67 67 64 64 64 63 67 63 61 65 66 60 61 61 63 64 65 63 66 63 64 61 61 67]
Jump to: