Debian 11 (Bullseye) boot and freeze issues with Kernel panics on HP Proliant DL380 G7 servers

Written by - 0 comments

Published on September 16th 2021 - Listed in Hardware Linux


Since the first Release Candidate (RC1) of Debian 11 (Bullseye) was available, I have tested the new version on an older HP Proliant DL380 G7 server. Although old and EOL, this server model is still running smoothly and with good performance with Linux installed. But I ran into unexpected boot failures - even with RC2 and the official production release.

Bullseye boot fails - often

Interestingly though, once Bullseye was installed, the Operating System did not boot properly and ran into a freeze of the server. Neither CTRL+ALT+DEL key combo nor a momentary press on the power button would work. Only a server (hard) reset would release the machine and force a power down.

With normal boot parameters, the "crash" would not be visible. But with an additional "debug" added in the Kernel command line in the Grub config, the following problems could be spotted.

Bullseye boot error/freeze on HP Proliant DL380 G7
Bullseye boot error/freeze on HP Proliant DL380 G7

I al also recorded a video of the failed Bullseye boot:

With every failed boot, the error logged mentioned cpuidle_enter. This raised a couple of questions concerning the source of the problem:

  • Is the newer Kernel (Bullseye comes with 5.10) incompatible with this kind of processors (Intel Xeon X5660)? Although possible, it is highly unlikely that the widely used Xeon processor architecture would all of a sudden be removed from the Kernel.
  • Could it be a problem with the power balancing? The error mentions cpuidle so maybe this has something to do with how the HP servers try to save power consumption (by detecting the CPU C states). But even with all kinds of different power modes and disabled C states the boot problem continued.
  • Is the hardware defective? However the very same boot issues were seen and could be reproduced on two DL380 G7 servers. The chance that both machines had failing hardware at the same time would be a big surprise.

But then at the next boot it would work again, finally booting up until the login prompt. If there was really a hardware incompatibility, why would the server boot sometimes and sometimes run into the freeze?

Detailed boot log with Debian 11 (Bullseye)

To figure out how stable (or unstable) Bullseye on this HP Proliant DL380 G7 actually is, I decided to to mass boot testing. Once Debian Bullseye would boot 10x in a row without a hiccup, I would call it stable. 

 Boot Attempt
 Success or Fail
 Changes / Description
 #1 FAIL  -
 #2 FAIL  BIOS -> Power Management
 HP Power Profile changed to Custom
 HP Power Regulator changed to "OS Control Mode"
 #3 OK
 #4 OK
 #5 OK
 #6 OK
 #7 FAIL
 #8 OK
 #9 OK  Install "intel-microcode" package
 #10
OK
 Freeze of system after 2 minutes -> FAIL
 #11
FAIL
 BIOS -> Power Management
 Power Management Options -> Advanced Power Management Options
 -> Minimum Proc Idle Power Core State set to "No C-States"
 -> Minimum Proc Idle Power Package State set to "No Package States"
 #12
OK

 #13
FAIL

 #14
OK

 #15
FAIL

 #16
FAIL
 Noticed the following error in the console:
 iTCO_wdt unable to reset NO_REBOOT flag, device disabled by hardware/BIOS
 #17
OK

 #18
FAIL

 #19
OK
 Added "idle=nomwait" to Grub config (Kernel cmdline)
 #20
FAIL

As you can see from all the boot tests, the wanted ten successful boots in a row did not happen. Is this problem caused by the Kernel 5.10 or by the new Debian Bullseye (maybe a certain way of configuration?)?

To answer this, another Operating System (Ubuntu 20.04) was installed and tested.

Detailed boot log with Ubuntu 20.04 (Focal)

The idea behind installing Ubuntu 20.04 was to test the older Kernel 5.4. Would this one work? How does the Ubuntu installation behave in general compared to Debian?

 Boot Attempt
 Success or Fail
 Changes / Description
 #1 OK
#2 OK
#3 OK
#4 OK
#5 OK
#6 OK
#7 OK
#8 OK
#9 OK
#10 OK

Surprise, surprise! The server booted correctly 10x in a row with Ubuntu 20.04! To be honest, I did not expect that. A hardware defect can therefore definitely be ruled out in this case. But is it the Kernel version causing problems or something distribution specific?

ILO freezes and causes Kernel panic with NMI

Meanwhile back with Debian 11 installed again, the tests were about to continue. I prepared myself to run a Kernel bisect to find out which Kernel version would actually trigger the boot issues. Unfortunately the documentation seems to be so out of date, that all attempts to correctly run a bisect failed. At that point in time I wanted to focus on preparing ILO3 for additional hardware monitoring (using check_ilo2_health monitoring plugin) and then get back into the bisect procedure. While testing the monitoring plugin with a read-only user, ILO completely froze; not only the XML Remote API but also the User Interface in the browser. Nothing too bad, I thought, I just stop the monitoring and let ILO recover. But I quickly realized, that the server itself stopped responding to pings. A look at the console revealed, that the server (again running with Debian Bullseye) froze and ran into a Kernel panic:

According to the console output, the crash was triggered by a NMI:

<NMI>
dump_stack+0x6b/0x83
panic+0x101/0x2d7
nmi_panic.cold+0xc/0xc
hpwdt_pretimeout+0x7f/0xd0 [hpwdt]
nmi_handle+0x58/0x100
default_do_nmi+0x98/0x130
exc_nmi+0x12f/0x150
end_repeat_nmi+0x16/0x55
RIP: 0010:mwait_idle_with_hints.constprop.0+0x4b/0x90
Code: 65 48 8b 04 25 c0 7b 01 00 0f 01 c8 48 8b 00 a8 08 75 17 e9 07 [...]
RSP: 0018:ffffffff87203e58 EFLAGS: 00000046
RAX: 0000000000000001 RBX: 0000000000000002 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ffffffff873ae6c0 RDI: 0000000000000001
RBP: ffffccdbff218e00 R08: 0000068289c6860d R09: 0000068439b22cc1
R10: 0000000000000f8e R11: 00000000001d86ca R12: 0000000000000002
R13: ffffffff873ae7a8 R14: 0000000000000002 R15: 0000000000000000
? mwait_idle_with_hints.constprop.0+0x4b/0x90
? mwait_idle_with_hints.constprop.0+0x4b/0x90
</NMI>
intel_idle+0x1f/0x30
cpuidle_enter_state+0x89/0x350
cpuidle_enter+0x29/0x40
do_idle+0x1ef/0x2b0
cpu_startup_entry+0x19/0x20
start_kernel+0x587/0x5a8
secondary_startup_64_no_verify+0xb0/0xbb

The logged entries looked eerily similar to the ones seen during the boot freezes (cpuidle_enter) but with one major difference: At the beginning of this Kernel panic the causing module can be spotted:

hpwdt_pretimeout+0x7f/0xd0 [hpwdt]

hpwdt: The HP hardware watchdog Kernel module

Now with finally more information at hand, more information about this hpwdt Kernel module and what it is doing was required. It turns out that this Kernel module triggers an NMI on the Operating System in certain situations, for example if ILO is hanging:

If the system gets into a bad state and hangs, the HPE ProLiant iLO timer register will not be updated in a timely fashion and a hardware system reset (also known as an Automatic Server Recovery (ASR)) event will occur.

The seen pretimeout before is actually a module parameter:

 pretimeout  - allows the user to set the watchdog pretimeout value.
               This is the number of seconds before timeout when an
               NMI is delivered to the system. Setting the value to
               zero disables the pretimeout NMI.
               Default value is 9 seconds.

This explains the Kernel panic from above: ILO was not responding and after 9 seconds the HP watchdog was unable to communicate with ILO - forcing a NMI on the system.

But could hpwdt also be responsible for the boot problems? After all, the logged errors in the console looked very similar.

Ubuntu: We disable hpwdt!

Additional research led to a mailing list conversation (HPWDT watchdog module leads to panics), based on the reported Ubuntu bug #1432837:

We have been seeing random crashs from various HP systems, this has been tracked to loading of the hpwdt watchdog modules.  Basically these modules are a loaded gun and unless you know exactly what you are doing you are likely to take off your own head.  For this reason we already blacklist "all" of these modules in kmod/module-in-tools blacklists.

This basically means: Ubuntu disables (blacklists) the hpwdt module, since 2015 already! This would perfectly explain why Ubuntu 20.04 booted successfully, ten times in a row.

But was this bug only reported in Ubuntu? No! Almost every Linux distribution got a bug report concerning panics with hpwdt:

Blacklisting hpwdt in Debian Bullseye

Let's talk turkey! Assuming the hpwdt module really causes the boot problems, let's disable (blacklist) this, the same way as Ubuntu mentions it in the bug report and update the initramfs:

root@bullseye:~# echo "blacklist hpwdt" >> /etc/modprobe.d/blacklist-hp.conf
root@bullseye:~# cat /etc/modprobe.d/blacklist-hp.conf
blacklist hpwdt
root@bullseye:~# update-initramfs -k all -u
root@bullseye:~# update-grub

After this change, the server was rebooted and the multi boot testing started again:

 Boot Attempt
 Success or Fail
 Changes / Description
 #1 OK
 #2 OK
 #3 OK
 #4 OK
 #5 OK
 #6 OK
#7 OK
#8 OK
#9 OK
#10 OK

And here we go! 10 x boot without any freezes or crashes, once hpwdt was disabled!

Is it safe to disable the hpwdt module?

To determine whether or not a Kernel module should be loaded (or unloaded) there's actually one question to be answered: Do you need it?

Looking at the specifics of hpwdt and what it actually does:

 The HPE iLO NMI Watchdog driver is a kernel module that provides basic watchdog functionality and handler for the iLO "Generate NMI to System" virtual button.

How often does it happen, that someone needs to trigger a NMI (non maskable interrupt) via ILO in production? Probably never. I did that once or twice on test machines in the last 10 years. So at least in our situation we're fine disabling this module - even easier now comparing the positives (more stable system, no boot problems) vs. the negatives (can't launch NMI from ILO).

Additional resources

During this troubleshooting and research act, a lot of documents, bugs, mailing list posts etc about hpwdt have been read. Here are some interesting ones:



Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.