I am an egg boiler, too, said the hard drive and caused application issues

Written by - 0 comments

Published on August 8th 2019 - Listed in Hardware Linux


Sometimes you really need to dig deep to find the needle in the haystack. In the past three days we helped on solving an issue in an application API. The only hint we got from monitoring the API was that around every 90 minutes, the API's response time would spike up. 

API response times spikes

The first (obvious) assumption was that this could be caused by regular jobs within the application or triggered by a cron job or similar. But none of the kind were the cause. Over multiple hours we also closely monitored the API access requests - but no increased traffic or weird API calls were logged.

Finally we switched the focus to the secondary (standby) API node. This node is in sync with the primary node but does not receive API traffic; it just sits there IDLE in case of a failover. And this is where we noted the CPU spikes:

CPU spikes

A closer look to the CPU stats revealed long periods of IOWAIT. Interestingly in ~90 minute intervals. Sounds familiar, right? And we also saw that these IOWAITs seemed to start since we noticed issues with the API. 

Once logged in on this physical server (right during such a 30min high IOWAIT period), it became clear that something was not right. Some simple commands took a very long time to complete. For example listing the LVM's logical volumes took an extreme 18 seconds:

[root@server02 ~]# time lvs
  LV      VG              Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  lv_opt  vg_server02 -wi-ao----  10,00g                                                   
  lv_root vg_server02 -wi-ao---- 200,00g                                                   
  lv_swap vg_server02 -wi-ao----  16,00g                                                   

real    0m18.741s
user    0m0.003s
sys    0m0.006s

Compared to the primary server, where the same command finished much faster:

[root@server01 ~]# time lvs
  LV      VG              Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  lv_opt  vg_server01 -wi-ao----  10,00g                                                   
  lv_root vg_server01 -wi-ao---- 200,00g                                                   
  lv_swap vg_server01 -wi-ao----  16,00g                                                   

real    0m0.042s
user    0m0.006s
sys    0m0.001s

Dmesg revealed that a couple of "Kernel hung" events were logged:

INFO: task c1_cred:27410 blocked for more than 120 seconds.
      Not tainted 2.6.32-754.10.1.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
c1_cred       D 0000000000000001     0 27410  18151 0x00000080
 ffff88101cdabd98 0000000000000086 0000000000000000 ffff88101cdabdc8
 ffff88101cdabe40 0000000000000001 000005eb68734899 ffff88101cdabdb8
 00000001005ed093 000000000000279c ffff881026047ad8 ffff88101cdabfd8
Call Trace:
 [] jbd2_log_wait_commit+0xc5/0x140 [jbd2]
 [] ? autoremove_wake_function+0x0/0x40
 [] jbd2_complete_transaction+0x68/0xb0 [jbd2]
 [] ext4_sync_file+0x121/0x1d0 [ext4]
 [] vfs_fsync_range+0xa6/0x100
 [] vfs_fsync+0x1d/0x20
 [] do_fsync+0x3e/0x60
 [] ? system_call_after_swapgs+0xa2/0x152
 [] sys_fdatasync+0x13/0x20
 [] system_call_fastpath+0x35/0x3a
 [] ? system_call_after_swapgs+0xae/0x152

From this point on we could not rule out a hardware failure of that particular server, but according to the hardware status of the iDRAC card, everything seemed to be normal:

Physical drive status in Dell iDrac

However, by reading the drive's SMART table with the check_smart monitoring plugin, something interesting was revealed:

[root@server02 ~]# /usr/lib/nagios/plugins/check_smart.pl -i megaraid,0 -d /dev/sda
CRITICAL: Disk temperature is higher than maximum|sent_blocks=1974874228 temperature=100;;68

 The drive was at a cooking 100 degrees Celsius (= 212 degrees Fahrenheit)! Compared to this heat, the second drive of the RAID-1 array was at a chilling 30 degrees.

[root@server02 ~]# /usr/lib/nagios/plugins/check_smart.pl -i megaraid,1 -d /dev/sda
OK: no SMART errors detected. |sent_blocks=1830397621 temperature=30;;68

Now the second value (68) indicates the critical temperature when drives stop being operational or reliable. This value is set by the drive vendor. 100 degrees is clearly way above the threshold. I added additional monitoring and graphed the temperature using Icinga 2, Influx and Grafana. Once we stopped the application on that server (and therefore decreasing disk access) the temperature slowly decreased (but then increased again). But still now, while waiting for the replacement drive, and powered on but idle for the last 20 hours, the degrees is again at 81 degrees:

Drive Temperature

[root@server02 ~]# smartctl -d megaraid,0 -a /dev/sda
smartctl 5.43 2016-09-28 r4347 [x86_64-linux-2.6.32-754.10.1.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

Vendor:               SEAGATE
Product:              ST9300653SS    
Revision:             YS0A
User Capacity:        300.000.000.000 bytes [300 GB]
Logical block size:   512 bytes
Logical Unit id:      0x5000c50076f80dc7
Serial number:        XXXXXXXX
Device type:          disk
Transport protocol:   SAS
Local Time is:        Wed Aug  7 14:45:02 2019 CEST
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
SMART Health Status: OK

Current Drive Temperature:     81 C
Drive Trip Temperature:        68 C
Manufactured in week 28 of year 2014
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  47
Elements in grown defect list: 0
Vendor (Seagate) cache information
  Blocks sent to initiator = 1975376908
  Blocks received from initiator = 20663588
  Blocks read from cache and sent to initiator = 197566454
  Number of read and write commands whose size <= segment size = 211742236
  Number of read and write commands whose size > segment size = 64
Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 43547,12
  number of minutes until next internal SMART test = 60

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   3545575637        0         0  3545575637          0       5066,173           0
write:         0        0         1         1          1     195856,879           0
verify: 1103591886        2         0  1103591888          5      74753,575           1

Non-medium error count:       20

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Completed                  32   43539                 - [-   -    -]
# 2  Background short  Completed                  32      26                 - [-   -    -]
# 3  Reserved(7)       Completed                  32      25                 - [-   -    -]
# 4  Background short  Completed                  32      24                 - [-   -    -]

Long (extended) Self Test duration: 2052 seconds [34,2 minutes]

Once the drive was replaced you could see the temperature drop of the first drive:

Temperature drop after drive replaced

The server was taken back into the API cluster, the API continued to run smoothly. The IOWAITs were gone. It was indeed the cooking hard drive which caused these high re-occurring API latencies.

Once again such a case shows that monitoring a single endpoint (in this case the Hardware using Dell iDrac) is never enough. Although, this was not an easy find and honestly nobody would have thought of a cooking hard drive as cause, a SMART disk monitoring from the beginning would have helped reduce the troubleshooting time.


Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.