log/ posts/ Preventing overheating of my hp2510p

A few weeks ago, when it was fairly hot outside, my notebook suddenly decided to shut down while compiling a new kernel. Problem was that the processor kept going at full speed right until the end, something that's not supposed to happen. Thermal monitoring by ACPI and the kernel is supposed to throttle the system before it gets to critical temperatures.

But that only works if the system is designed correctly. Now I must admit that the fan was already failing (it's in the mean time been replaced by HP time, under warranty) so that may have contributed, but it was still working sufficiently.

What also contributed is that I have the notebook in a docking station. This is great, but its design is such that it basically block most of the air flow from the fan...

After an extensive investigation with the excellent help of kernel developer Zhang Rui, the cause was found to be in the thermal zones defined in the notebook.

There are 6 thermal zones. Below some info from /proc/acpi/thermal/.

TZ0/temperature:temperature:     60 C
TZ0/trip_points:critical (S5):  256 C
TZ0/trip_points:passive:         99 C: tc1=1 tc2=2 tsp=300 devices=CPU0 CPU1
TZ1/temperature:temperature:     60 C
TZ1/trip_points:critical (S5):  110 C
TZ3/trip_points:critical (S5):  105 C
TZ3/trip_points:passive:         95 C: tc1=1 tc2=2 tsp=300 devices=CPU0 CPU1
TZ4/trip_points:critical (S5):  110 C
TZ4/trip_points:passive:         60 C: tc1=1 tc2=2 tsp=300 devices=CPU0 CPU1
TZ5/temperature:temperature:     50 C
TZ5/trip_points:critical (S5):  110 C
TZ6/temperature:temperature:     25 C
TZ6/trip_points:critical (S5):   70 C
TZ6/trip_points:passive:         60 C: tc1=1 tc2=2 tsp=300 devices=CPU0 CPU1

The key here is the "passive" and "critical" trip points. At the first, the processor will be throttled. At the second the system will enter into an emergency shutdown. TZ0 is the zone monitoring the temperature of the processor itself; I'm not sure what exactly the other zones correspond to.

You may have noticed that zones TZ1 and TZ5 do not have a passive trip point. And that was exactly the problem. Some testing showed that those two zones can get quite high temperatures while the other zones are still OK. So the thermal protection never triggers and the system gets shut down when the critical limit for zone TZ1 or TZ5 is reached.

But as the system is running open source software there's a solution. A kernel patch makes it possible to load a custom DSDT ACPI table from the initramfs initrd. So after diving into the DSDT code I came up with three small modifications, and I now have an extra passive trip point for TZ1:

TZ1/temperature:temperature:     60 C
TZ1/trip_points:critical (S5):  110 C
TZ1/trip_points:passive:         95 C: tc1=1 tc2=2 tsp=300 devices=CPU0 CPU1

Now that's one of the main reasons I still enjoy working with Linux so much. Maybe I should also fix that crazy critical temperature for TZ0...


Update

Via IRC Matthew Garrett offered the following alternative:

If there's no passive zone defined in the DSDT then there should be a /sys/class/thermal/thermal_zoneN/passive that you can write a temperature into. So you can do this without a custom DSDT.

I gave it a quick try for TZ5, but the result was that the system became very sluggish and any task suddenly used much more CPU than normal. In principle it seems like a very nice alternative, but it looks as if there's a bug to be fixed first.


Update 2

The performance problem was due to a very simple mistake. I entered the temperature in degrees Celcius instead of in millidegrees Celcius, which immediately caused the system to be throttled down to unusability. Well, at least that proved the mechanism works :-)

So now I have the following simple lines in /etc/sysfs.conf:

  class/thermal/thermal_zone2/passive = 95000
  class/thermal/thermal_zone5/passive = 95000

(thermal_zone2 matches TZ1 because of some weird ordering in ACPI.)