rm: cannot remove due to Input/output error (EXT4-fs error)

Written by Claudio Kuenzler - 1 comments

Published on March 2nd 2013 - Listed in Linux Hardware

On some servers I use rsnapshot as a backup method. It's fast, easy to manage and reliable. But a backup a few days ago returned the following error in the daily backup log:

echo 5190 > /var/run/rsnapshot.pid
/bin/rm -rf /backup/rsnapshot/daily.9/
/bin/rm: cannot remove `/backup/rsnapshot/daily.9/localhost/home/mail/web98p1/Maildir/cur/1359872706S=12695,W=12855:2,S': Input/output error
/bin/rm: cannot remove `/backup/rsnapshot/daily.9/localhost/home/mail/web98p8/Maildir/cur/1360095843S=4225,W=4321:2,S': Input/output error
/bin/rm: cannot remove `/backup/rsnapshot/daily.9/localhost/var/www/web136/files/g2data/cache/entity/3/2': Input/output error
----------------------------------------------------------------------------
rsnapshot encountered an error! The program was invoked with these options:
/usr/bin/rsnapshot daily

[Backflash] This particular server suffered a disk defect in the past days and the raid recovery had some issues to resynchronize. I removed the defect disk out of the raid array and told the guys in the data center to replace the defect disk. And here's comes the problem: They made a big mistake by making a short SMART test which resulted in no errors and decided, the disk was good and rebooted the server. As the server booted, the Kernel recognized two disks with each a raid configuration on them. But guess what? mdraid tells the kernel that these disks are not in the same raid (because I previously removed the defect disk, remember?).
So by accident, the kernel took the defect disk, created a weird new raid array (called md127) and mounted the filesystems on it while the good disk was just mounted as read-only block device. During a day or so the server was running like this until I realized what happened. I then hoped I could at least resynchronize the defect disk with the good disk, so the new data would be synchronized. But due to too many I/O errors from the bad disk (because the disk is bad, told ya!!) the raid recovery failed at around 20%. So some data was resynchronized, some not. Some probably even contain invalid data as the disk was going to die. After a lot of nightly effort I could restore the server to a more or less stable system, thanks to the help of a very good friend. But one question remained open: What did the failed raid recovery do to the good disk? [/Backflash]

Back to present. My guess is that the input/output errors on that filesystem originate from the failed raid rebuild where some corrupt data might have come from the defect disk. I tried to manually remove the daily.9 folder to see what would happen:

rm -r daily.9
rm: cannot remove `daily.9/localhost/home/mail/web98p1/Maildir/cur/1359872706,S=12695,W=12855:2,S': Input/output error
rm: cannot remove `daily.9/localhost/home/mail/web98p8/Maildir/cur/1360095843,S=4225,W=4321:2,S': Input/output error
rm: cannot remove `daily.9/localhost/var/www/web136/files/g2data/cache/entity/3/2': Input/output error

Same issue. Let's see if dmesg tells us something useful:

dmesg
[116836.420063] EXT4-fs error (device dm-5): ext4_lookup: deleted inode referenced: 24274320
[116842.656399] EXT4-fs error (device dm-5): ext4_lookup: deleted inode referenced: 24400033
[116842.574064] EXT4-fs error (device dm-5): ext4_lookup: deleted inode referenced: 24273584

The "fs error" tells it all. There's some issues on that file system. Next step: File system check.
As this is the backup file system, separately mounted, I can unmount it and make the file system check while the server continues to do its job.

umount /backup

fsck.ext4 /dev/mapper/vg0-backup
e2fsck 1.41.12 (17-May-2010)
/dev/mapper/vg0-backup contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure

Entry '1360095843,S=4225,W=4321:2,S' in /rsnapshot/daily.9/localhost/home/mail/web98p8/Maildir/cur (8674221) has deleted/unused inode 24274320. Clear? yes

Entry '2' in /rsnapshot/daily.9/localhost/var/www/web136/files/g2data/cache/entity/3 (24400027) has deleted/unused inode 24400033. Clear? yes

Entry '1359872706,S=12695,W=12855:2,S' in /rsnapshot/daily.9/localhost/home/mail/web98p1/Maildir/cur (8671256) has deleted/unused inode 24273584. Clear? yes

Heyyy.. these entries look familiar. Don't they? So these files were referenced in an inode which does not exist anymore. To my luck this is an older backup and I can permit myself to clear the inodes from the file system. Well... what choice would I have anyway. The fsck continues:

Pass 3: Checking directory connectivity

Pass 4: Checking reference counts
Inode 939825 ref count is 10, should be 9. Fix? yes
Inode 939826 ref count is 10, should be 9. Fix? yes
Inode 939827 ref count is 10, should be 9. Fix? yes
Inode 939843 ref count is 10, should be 9. Fix? yes

On Pass 4 there came a lot of Inode ref count fix suggestions. Inode ref counts are nothing more than a count of hard links to a particular file. rsnapshot users know that rsnapshot works with hard links to save disk space with daily backups.

Pass 5: Checking group summary information
Block bitmap differences: -(97030728--97030731) -(97033838--97033839) -97529271
Fix? yes

Free blocks count wrong for group #2961 (3626, counted=3632).
Fix? yes

Free blocks count wrong for group #2976 (8930, counted=8931).
Fix? yes

Free blocks count wrong (50341877, counted=50341884).
Fix? yes

Inode bitmap differences: -24273584 -24274320 -24400033
Fix? yes

Free inodes count wrong for group #2963 (2017, counted=2019).
Fix? yes

Free inodes count wrong for group #2978 (3021, counted=3022).
Fix? yes

Directories count wrong for group #2978 (4051, counted=4050).
Fix? yes

Free inodes count wrong (26902779, counted=26902782).
Fix? yes

/dev/mapper/vg0-backup: ***** FILE SYSTEM WAS MODIFIED *****
/dev/mapper/vg0-backup: 5865218/32768000 files (0.1% non-contiguous), 80730116/131072000 blocks

Pass 5 fixed the count of free inodes and free blocks on the file system. Must also be an artifact from the failed raid recovery. That was the end of the fsck.
Another check to see if everything is alright now:

fsck.ext4 /dev/mapper/vg0-backup
e2fsck 1.41.12 (17-May-2010)
/dev/mapper/vg0-backup: clean, 5865218/32768000 files, 80730116/131072000 blocks

So the file system should now be clean. Falsely referenced inodes were deleted and rsnapshot should do its job correctly now. Let me check the rsnapshot logs:

echo 9385 > /var/run/rsnapshot.pid
/bin/rm -rf /backup/rsnapshot/daily.9/
mv /backup/rsnapshot/daily.8/ /backup/rsnapshot/daily.9/
[...]

Yep, rsnapshot continues its fine job once more. Over and out.

Add a comment

Show form to leave a comment

Comments (newest first)

Jake from wrote on Nov 27th, 2015:

Thank you for this well written walk through!

Blog Tags:

AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Observability Office OpenSearch PHP Perl Personal PostgreSQL PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder Linux