Last night there was a crash of a virtual Linux server, running on SLES11 SP1 which uses Kernel 2.6.32 (126.96.36.199-0.5-default to be exact).
At first I suspected a problem with the vmware tools as vSphere client showed the following error at 05.38am this morning:
This virtual machine reset by HA. Reason: VMware Tools heartbeat failure. A screenshot is saved at.....
One of the first things to look at a vm crash is the cpu and memory usage. It often happens that a process gone haywire increases its memory allocation or cpu usage until the machine gives up and eventually dies.
But in this case it's different. At the moment of the crash, a spike of CPU usage is shown:
First thing I saw in the messages log were thousands of kernel entries, starting with this one yesterday at 11.15pm:
Apr 3 23:15:29 linuxserver kernel: [18446744063.848170] ------------[ cut here ]------------
Apr 3 23:15:29 linuxserver kernel: [18446744063.848179] WARNING: at /usr/src/packages/BUILD/kernel-default-188.8.131.52/linux-2.6.32/kernel/sched.c:3847 update_c
Apr 3 23:15:29 linuxserver kernel: [18446744063.848183] Hardware name: VMware Virtual Platform
Apr 3 23:15:29 linuxserver kernel: [18446744063.848184] Modules linked in: af_packet cifs(X) nls_utf8 nfs lockd fscache nfs_acl auth_rpcgss sunrpc softdog vso
ck(X) vmmemctl(X) acpiphp microcode fuse ext2 loop ppdev parport_pc parport sg tpm_tis tpm floppy vmci(X) tpm_bios pcspkr sr_mod cdrom rtc_cmos i2c_piix4 mpt
ctl rtc_core shpchp rtc_lib pci_hotplug intel_agp i2c_core container button ac dm_mirror dm_region_hash dm_log linear sd_mod crc_t10dif dm_snapshot edd dm_mo
d ext3 mbcache jbd fan processor ide_pci_generic piix ide_core ata_generic ata_piix libata mptspi mptscsih mptbase scsi_transport_spi thermal thermal_sys hwm
on vmw_pvscsi scsi_mod vmxnet3 vmxnet(X)
Apr 3 23:15:29 linuxserver kernel: [18446744063.848343] Supported: Yes
Apr 3 23:15:29 linuxserver kernel: [18446744063.848345] Pid: 4691, comm: db2sysc Tainted: G X 184.108.40.206-0.5-default #1
Apr 3 23:15:29 linuxserver kernel: [18446744063.848347] Call Trace:
Apr 3 23:15:29 linuxserver kernel: [18446744063.848358] [
Apr 3 23:15:29 linuxserver kernel: [18446744063.848363] [
Apr 3 23:15:29 linuxserver kernel: [18446744063.848368] [
Apr 3 23:15:29 linuxserver kernel: [18446744063.848372] [
Apr 3 23:15:29 linuxserver kernel: [18446744063.848376] [
Apr 3 23:15:29 linuxserver kernel: [18446744063.848379] [
Apr 3 23:15:29 linuxserver kernel: [18446744063.848382] [
Apr 3 23:15:29 linuxserver kernel: [18446744063.848386] [
Apr 3 23:15:29 linuxserver kernel: [18446744063.849847] [
Apr 3 23:15:29 linuxserver kernel: [18446744063.849852] [
Apr 3 23:15:29 linuxserver kernel: [18446744063.849855] [
Apr 3 23:15:29 linuxserver kernel: [18446744063.849858] [
Apr 3 23:15:29 linuxserver kernel: [18446744063.849863] [
Apr 3 23:15:29 linuxserver kernel: [18446744063.849872] [<00002aaab0fccb93>] 0x2aaab0fccb93
Apr 3 23:15:29 linuxserver kernel: [18446744063.849874] ---[ end trace 298e520e5f7997c9 ]---
Apr 3 23:15:29 linuxserver kernel: [18446744063.849875] update_cpu_power: cpu_power = 2147741369; scale_rt = 18428729677347811001
Apr 3 23:15:29 linuxserver kernel: [18446744063.851737] update_cpu_power: cpu_power = 2151321416; scale_rt = 18428729677351391048
Apr 3 23:15:29 linuxserver kernel: klogd 1.4.1, ---------- state change ----------
Apr 3 23:15:29 linuxserver kernel: [18446744063.852491] update_cpu_power: cpu_power = 2152131909; scale_rt = 18428729677352201541
Apr 3 23:15:29 linuxserver kernel: [18446744063.854241] update_cpu_power: cpu_power = 2153881873; scale_rt = 18428729677353951505
Apr 3 23:15:29 linuxserver kernel: [18446744063.855667] update_cpu_power: cpu_power = 2155303608; scale_rt = 18428729677355373240
Apr 3 23:15:29 linuxserver kernel: [18446744063.855734] update_cpu_power: cpu_power = 2155375208; scale_rt = 18428729677355444840
Apr 3 23:15:29 linuxserver kernel: [18446744063.858768] update_cpu_power: cpu_power = 2158407876; scale_rt = 18428729677358477508
Apr 3 23:15:29 linuxserver kernel: [18446744063.859665] update_cpu_power: cpu_power = 2159301732; scale_rt = 18428729677359371364
Apr 3 23:15:29 linuxserver kernel: [18446744063.859730] update_cpu_power: cpu_power = 2159370975; scale_rt = 18428729677359440607
Apr 3 23:15:29 linuxserver kernel: [18446744063.862231] update_cpu_power: cpu_power = 2161871656; scale_rt = 18428729677361941288
Apr 3 23:15:34 linuxserver kernel: [18446744068.857877] __ratelimit: 1286 callbacks suppressed
There were actually more than 40'000 entries of these "update_cpu_power" lines in the log, the value cpu_power increasing with each entry.
The last entry at the crash was the following:
Apr 4 05:35:14 linuxserver kernel: [22764.197655] update_cpu_power: cpu_power = 22776205338843; scale_rt = 18428752451405408475
By doing a research of "update_cpu_power: cpu_power" one eventually lands on this VMware community forum topic: VM becomes unresponsive, some vCPUs are 100% utilized. In this topic the last entry contains a link to the archive of the Linux kernel mailing list where a bug occurs after at least 208 days of uptime.
Novell makes a great summary of this bug on their support document Possible system freeze after more than 208 days, where they describe it this way:
Currently we are evaluating/fixing an issue with possible system freezes/crash when the uptime exceeds 208 days. The issue can only happen if the system meets the following conditions:
CPU vendor is Intel, /proc/cpuinfo contains both the CPU flags "constant_tsc" and "nonstop_tsc", dmesg and/or /var/log/boot.msg does not contain the string "Marking TSC unstable", kernel flavor is not xen
Only if all four conditions are present the system is affected.
The freeze/crash can happen anytime after 208 days uptime. It cannot happen before the system reaches 208 days uptime.
For critical production systems that cannot update their kernel version, we currently recommend a reboot prior the system reaching 208 days uptime.
A fix for this issue has been released in kernel version 220.127.116.11-0.3.
Important note: This kernel bug does not only affect SLES (SuSE Linux Enterprise Server) systems, all Linux distros are affected.
Workaround: Reboot the machine before 208 days of uptime.
Final Solution: Use newer kernel than 2.6.32 (SLES11 SP2 uses kernel 3.0) or update to at least 18.104.22.168-0.3 (Linux kernel 5732, released February 6th 2012)
Update April 8th 2012:
The same bug was fixed in the main kernel in version 22.214.171.124 (look for sched_clock in changelog), released December 9th 2011.
In Debian the bug was fixed in linux-2.6 (2.6.32-40), released on December 23rd 2011.
Benjamin from wrote on May 22nd, 2012:
Just wanted to throw a big thank you.
I had a kernel panic on a server with this problem (not rebootet since the fix came out) and your description was immensely helpful. Especially the hint to look for sched_clock made it easy to see which ubuntu kernel package got the fix.
Jason from wrote on Apr 6th, 2012:
IBM recently released a firmware update for their Storwize v7000 series of SANs that was affected by this same bug... it's worth checking out if you use anything Linux-based...
AWS Android Ansible Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Container Containers CouchDB DB DNS Database Databases Docker ELK ElasticSearch Elasticsearch Filebeat FreeBSD GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Icingaweb2 InfluxDB Internet Java Kibana Kubernetes LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Rancher SSL Security Shell SmartOS Solaris Surveillance SystemD TLS Tomcat Ubuntu Unix VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder