Usually I tend to rant about bad hardware. Given my personal "closeness" to physical servers, such failed hardware parts are mostly hard drives but also broken solid state drives.
But in fairness, there are also some very good products out there. This post is about such a surprisingly well behaving SSD.
All our physical hard and solid state drives are constantly being monitored. With the check_smart.pl monitoring plugin, we collect metrics from the SMART attributes of each drive.
In a nice (and fancy!) dashboard we represent the data and we can go back years to see historical statistics. Besides keeping an eye on the typical pre-failure indications (such as Pending Sectors, Reassigned Sectors, etc), we also keep track on the TBW (Terrabytes Written) for SSDs and the Load Cycle Count for HDDs.
The TBW value can (usually) be calculated by using the attribute number 241 (Total LBA Written) of the SSD and multiply it with the drive's sector size (usually 512 bytes). On some SSD models, smartctl already does the calculation for you and the resulting attribute name is then called Total_Writes_GiB or Host_Writes_GiB (or similar).
The screenshot above shows a particular SSD in a server having reached over 300K "Total_Writes_GiB", therefore achieved a TBW value of over 300 TB.
Depending on the workload and purpose of a server, this is an impressive value. Even more so as there were never any faults or defect sectors on this SSD.
In general the warranty of SSDs is based on two different approaches:
Most SSDs sell nowadays with a 5 year warranty. If the drive dies within a 5 year period (starting with the purchase date), you should be able to receive a free replacement under the warranty clause.
On the other hand, if within this 5 year warranty window the TBW limit has been reached, the warranty is void. Meaning: If you have a workload doing extreme write operations and you've already reached the TBW limit of this SSD model within one year, the full warranty is void.
I've created a SSD warranty overview for some very widely used SSDs on GitHub.
Hard and solid state drive vendors are very prudent about warranty. Replacing products is costly and is a damage to a vendor's reputation. Therefore you can assume that the TBW warranty threshold is the last known "point of safe operation". Anything beyond is cannot be guaranteed.
For our SSD in question, we've just crossed the 300 TBW value. According to product specs and the SSD warranty overview mentioned above, the endurance of this SSD model is 400 TBW.
Better safe than sorry is our advise in such a case; replace the drive. Especially with the nowadays very low prices for SSD drives there shouldn't be much thinking about it. We've now replaced this SSD, which had been running in this server since 2018.
A good performance deserves praise. Therefore: Thank you, SanDisk Ultra 3D SSD (link to Amazon)!
You deserve your retirement for the remaining few TBW in a lab or test server :-).
For reference, here's the last smartctl output of this SSD:
ck@mint ~ $ sudo smartctl -a /dev/sdd -q noserial
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.8.0-60-generic] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Marvell based SanDisk SSDs
Device Model: SanDisk SDSSDH31000G
Firmware Version: X61170RL
User Capacity: 1’000’204’886’016 bytes [1.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available, deterministic, zeroed
Device is: In smartctl database 7.3/5528
ATA Version is: ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Fri May 23 08:54:09 2025 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x11) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 10) minutes.
SMART Attributes Data Structure revision number: 4
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0032 100 100 --- Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 --- Old_age Always - 53669
12 Power_Cycle_Count 0x0032 100 100 --- Old_age Always - 11
165 Total_Write/Erase_Count 0x0032 100 100 --- Old_age Always - 41496533834574
166 Min_W/E_Cycle 0x0032 100 100 --- Old_age Always - 587
167 Min_Bad_Block/Die 0x0032 100 100 --- Old_age Always - 43
168 Maximum_Erase_Cycle 0x0032 100 100 --- Old_age Always - 1669
169 Total_Bad_Block 0x0032 100 100 --- Old_age Always - 763
170 Unknown_Marvell_Attr 0x0032 100 100 --- Old_age Always - 0
171 Program_Fail_Count 0x0032 100 100 --- Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 --- Old_age Always - 0
173 Avg_Write/Erase_Count 0x0032 100 100 --- Old_age Always - 824
174 Unexpect_Power_Loss_Ct 0x0032 100 100 --- Old_age Always - 9
184 End-to-End_Error 0x0032 100 100 --- Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 --- Old_age Always - 0
188 Command_Timeout 0x0032 100 100 --- Old_age Always - 0
194 Temperature_Celsius 0x0022 078 040 --- Old_age Always - 22 (Min/Max 18/40)
199 SATA_CRC_Error 0x0032 100 100 --- Old_age Always - 0
230 Perc_Write/Erase_Count 0x0032 100 100 --- Old_age Always - 20275 21032 21032
232 Perc_Avail_Resrvd_Space 0x0033 100 100 004 Pre-fail Always - 100
233 Total_NAND_Writes_GiB 0x0032 100 100 --- Old_age Always - 827149
234 Perc_Write/Erase_Ct_BC 0x0032 100 100 --- Old_age Always - 964744
241 Total_Writes_GiB 0x0030 253 253 --- Old_age Offline - 304778
242 Total_Reads_GiB 0x0030 253 253 --- Old_age Offline - 110787
244 Thermal_Throttle 0x0032 000 100 --- Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged
No comments yet.
AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Database Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Observability Office OpenSearch PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMWare VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder