Linux server crash due to defect memory

Written by - 0 comments

Published on - Listed in Linux Rant


Just recently I had to handle two crashes of the same Linux server. As soon as I launched some I/O intensive process (rsync in my case), the machine crashed.

The following log entries were written in the kern.log.

First crash:

Apr 25 20:12:15  kernel: [12156.863672] BUG: unable to handle kernel NULL pointer dereference at (null)
Apr 25 20:12:15  kernel: [12156.863728] IP: [] writeback_inodes_wb+0xf6/0x4ff
Apr 25 20:12:15  kernel: [12156.863765] PGD 0
Apr 25 20:12:15  kernel: [12156.863787] Oops: 0002 [#1] SMP
Apr 25 20:12:15  kernel: [12156.863812] last sysfs file: /sys/devices/system/cpu/cpu7/cpufreq/scaling_governor
Apr 25 20:12:15  kernel: [12156.863862] CPU 4
Apr 25 20:12:15  kernel: [12156.863883] Modules linked in: acpi_cpufreq cpufreq_conservative cpufreq_powersave cpufreq_stats cpufreq_userspace ext3 jbd loop snd_pcm snd_timer i2c_i801 snd soundcore snd_page_alloc i2c_core video wmi button output pcspkr evdev ext4 mbcache jbd2 crc16 dm_mod aacraid 3w_9xxx 3w_xxxx raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 md_mod sata_nv sata_sil sata_via sd_mod crc_t10dif ahci libata ehci_hcd r8169 xhci scsi_mod usbcore thermal nls_base mii processor thermal_sys [last unloaded: scsi_wait_scan]
Apr 25 20:12:15  kernel: [12156.864195] Pid: 9876, comm: flush-253:1 Not tainted 2.6.32-5-amd64 #1 System Product Name
Apr 25 20:12:15  kernel: [12156.864246] RIP: 0010:[]  [] writeback_inodes_wb+0xf6/0x4ff
Apr 25 20:12:15  kernel: [12156.864298] RSP: 0018:ffff88043b4c9d00  EFLAGS: 00010286

Second crash, very similar log entries:

Apr 26 11:11:12 kernel: [ 2942.917788] BUG: unable to handle kernel NULL pointer dereference at (null)
Apr 26 11:11:12 kernel: [ 2942.917838] IP: [<(null)>] (null)
Apr 26 11:11:12 kernel: [ 2942.917862] PGD 0
Apr 26 11:11:12 kernel: [ 2942.917884] Oops: 0010 [#1] SMP
Apr 26 11:11:12 kernel: [ 2942.917907] last sysfs file: /sys/devices/system/cpu/cpu7/cpufreq/scaling_governor
Apr 26 11:11:12 kernel: [ 2942.917952] CPU 0
Apr 26 11:11:12 kernel: [ 2942.917971] Modules linked in: acpi_cpufreq cpufreq_conservative cpufreq_powersave cpufreq_stats cpufreq_userspace ext3 jbd loop i2c_i801 i2c_core video snd_pcm evdev output wmi snd_timer snd soundcore snd_page_alloc pcspkr button ext4 mbcache jbd2 crc16 dm_mod aacraid 3w_9xxx 3w_xxxx raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 md_mod sata_nv sata_sil sata_via sd_mod crc_t10dif ahci libata ehci_hcd scsi_mod xhci r8169 mii thermal usbcore nls_base processor thermal_sys [last unloaded: scsi_wait_scan]
Apr 26 11:11:12 kernel: [ 2942.918246] Pid: 1288, comm: flush-253:1 Not tainted 2.6.32-5-amd64 #1 System Product Name
Apr 26 11:11:12 kernel: [ 2942.918292] RIP: 0010:[<0000000000000000>]  [<(null)>] (null)
Apr 26 11:11:12 kernel: [ 2942.918320] RSP: 0018:ffff88043b651c28  EFLAGS: 00010087

First I assumed a bug in the kernel for EXT4 file systems but after an extended hardware stress test, a defect memory dimm was found.

After replacing the dimm I launched the same rsync process again and no problems (and therefore no crashes) occured this time.



Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.