On some servers I use rsnapshot as a backup method. It's fast, easy to manage and reliable. But a backup a few days ago returned the following error in the daily backup log:
echo 5190 > /var/run/rsnapshot.pid
/bin/rm -rf /backup/rsnapshot/daily.9/
/bin/rm: cannot remove `/backup/rsnapshot/daily.9/localhost/home/mail/web98p1/Maildir/cur/1359872706S=12695,W=12855:2,S': Input/output error
/bin/rm: cannot remove `/backup/rsnapshot/daily.9/localhost/home/mail/web98p8/Maildir/cur/1360095843S=4225,W=4321:2,S': Input/output error
/bin/rm: cannot remove `/backup/rsnapshot/daily.9/localhost/var/www/web136/files/g2data/cache/entity/3/2': Input/output error
----------------------------------------------------------------------------
rsnapshot encountered an error! The program was invoked with these options:
/usr/bin/rsnapshot daily
[Backflash] This particular server suffered a disk defect in the past days and the raid recovery had some issues to resynchronize. I removed the defect disk out of the raid array and told the guys in the data center to replace the defect disk. And here's comes the problem: They made a big mistake by making a short SMART test which resulted in no errors and decided, the disk was good and rebooted the server. As the server booted, the Kernel recognized two disks with each a raid configuration on them. But guess what? mdraid tells the kernel that these disks are not in the same raid (because I previously removed the defect disk, remember?).
So by accident, the kernel took the defect disk, created a weird new raid array (called md127) and mounted the filesystems on it while the good disk was just mounted as read-only block device. During a day or so the server was running like this until I realized what happened. I then hoped I could at least resynchronize the defect disk with the good disk, so the new data would be synchronized. But due to too many I/O errors from the bad disk (because the disk is bad, told ya!!) the raid recovery failed at around 20%. So some data was resynchronized, some not. Some probably even contain invalid data as the disk was going to die. After a lot of nightly effort I could restore the server to a more or less stable system, thanks to the help of a very good friend. But one question remained open: What did the failed raid recovery do to the good disk? [/Backflash]
Back to present. My guess is that the input/output errors on that filesystem originate from the failed raid rebuild where some corrupt data might have come from the defect disk. I tried to manually remove the daily.9 folder to see what would happen:
rm -r daily.9
rm: cannot remove `daily.9/localhost/home/mail/web98p1/Maildir/cur/1359872706,S=12695,W=12855:2,S': Input/output error
rm: cannot remove `daily.9/localhost/home/mail/web98p8/Maildir/cur/1360095843,S=4225,W=4321:2,S': Input/output error
rm: cannot remove `daily.9/localhost/var/www/web136/files/g2data/cache/entity/3/2': Input/output error
Same issue. Let's see if dmesg tells us something useful:
dmesg
[116836.420063] EXT4-fs error (device dm-5): ext4_lookup: deleted inode referenced: 24274320
[116842.656399] EXT4-fs error (device dm-5): ext4_lookup: deleted inode referenced: 24400033
[116842.574064] EXT4-fs error (device dm-5): ext4_lookup: deleted inode referenced: 24273584
The "fs error" tells it all. There's some issues on that file system. Next step: File system check.
As this is the backup file system, separately mounted, I can unmount it and make the file system check while the server continues to do its job.
umount /backup
fsck.ext4 /dev/mapper/vg0-backup
e2fsck 1.41.12 (17-May-2010)
/dev/mapper/vg0-backup contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Entry '1360095843,S=4225,W=4321:2,S' in /rsnapshot/daily.9/localhost/home/mail/web98p8/Maildir/cur (8674221) has deleted/unused inode 24274320. Clear
Entry '2' in /rsnapshot/daily.9/localhost/var/www/web136/files/g2data/cache/entity/3 (24400027) has deleted/unused inode 24400033. Clear
Entry '1359872706,S=12695,W=12855:2,S' in /rsnapshot/daily.9/localhost/home/mail/web98p1/Maildir/cur (8671256) has deleted/unused inode 24273584. Clear
Heyyy.. these entries look familiar. Don't they? So these files were referenced in an inode which does not exist anymore. To my luck this is an older backup and I can permit myself to clear the inodes from the file system. Well... what choice would I have anyway. The fsck continues:
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Inode 939825 ref count is 10, should be 9. Fix
Inode 939826 ref count is 10, should be 9. Fix
Inode 939827 ref count is 10, should be 9. Fix
Inode 939843 ref count is 10, should be 9. Fix
On Pass 4 there came a lot of Inode ref count fix suggestions. Inode ref counts are nothing more than a count of hard links to a particular file. rsnapshot users know that rsnapshot works with hard links to save disk space with daily backups.
Pass 5: Checking group summary information
Block bitmap differences: -(97030728--97030731) -(97033838--97033839) -97529271
Fix
Free blocks count wrong for group #2961 (3626, counted=3632).
Fix
Free blocks count wrong for group #2976 (8930, counted=8931).
Fix
Free blocks count wrong (50341877, counted=50341884).
Fix
Inode bitmap differences: -24273584 -24274320 -24400033
Fix
Free inodes count wrong for group #2963 (2017, counted=2019).
Fix
Free inodes count wrong for group #2978 (3021, counted=3022).
Fix
Directories count wrong for group #2978 (4051, counted=4050).
Fix
Free inodes count wrong (26902779, counted=26902782).
Fix
/dev/mapper/vg0-backup: ***** FILE SYSTEM WAS MODIFIED *****
/dev/mapper/vg0-backup: 5865218/32768000 files (0.1% non-contiguous), 80730116/131072000 blocks
Pass 5 fixed the count of free inodes and free blocks on the file system. Must also be an artifact from the failed raid recovery. That was the end of the fsck.
Another check to see if everything is alright now:
fsck.ext4 /dev/mapper/vg0-backup
e2fsck 1.41.12 (17-May-2010)
/dev/mapper/vg0-backup: clean, 5865218/32768000 files, 80730116/131072000 blocks
So the file system should now be clean. Falsely referenced inodes were deleted and rsnapshot should do its job correctly now. Let me check the rsnapshot logs:
echo 9385 > /var/run/rsnapshot.pid
/bin/rm -rf /backup/rsnapshot/daily.9/
mv /backup/rsnapshot/daily.8/ /backup/rsnapshot/daily.9/
[...]
Yep, rsnapshot continues its fine job once more. Over and out.
Jake from wrote on Nov 27th, 2015:
Thank you for this well written walk through!
AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Database Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Office PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMWare VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder