













硬碟故障可能是災難性的,也可能是漸進性的。災難性的故障表現為主板BIOS無法檢測到硬碟,或者硬碟無法通過POST自檢 ,此時作業系統完全無法感知到硬碟的存在。漸進性的故障相對難以診斷,因為其症狀,例如偶爾的數據損壞,或者電腦變得卡頓(這是壞道需要反覆嘗試讀取所致),不能明確指向硬碟故障,而可能是由許多其他原因引起的,例如惡意軟體。壞道數量不斷增加是硬碟可能出現故障的跡象。不過,硬碟自動將壞道添加到自己的重映射表中[4],這些跡象對ScanDisk這類檢查程序來說不是很明顯,而只有對能在硬碟自身之前發現的檢查程序來說才有可能暴露這些問題。一旦硬碟內部的缺陷管理系統保留的備份扇區用盡,故障將會徹底發生。磁頭尋道的重複模式,例如反覆出現快速或較慢的尋道結束噪音(咔噠聲)可能表明硬碟存在問題。[5]













為了進一步提升抗衝擊性能,IBMThinkPad筆記本電腦產品線還推出了配備有「主動保護系統」的硬碟。當電腦內置的加速度傳感器檢測到突然的劇烈運動時,硬碟會自動卸載磁頭,以減輕數據丟失和硬碟劃傷的風險。蘋果後來也為PowerBookiBookMacBook ProMacBook產品線推出了類似的技術,稱為突發運動傳感器英語Sudden Motion Sensor索尼[10]、惠普的「HP 3D DriveGuard」[11]以及東芝[12]等等,各大廠商後來都在他們的筆記本電腦產品線應用類似的技術。






  • 磁頭劃碰:外部有撞擊等原因導致磁頭接觸碟片,導致接觸區域的不可逆轉的機械損傷及數據丟失。在最壞的情況下,從接觸區域飛濺出的碎片污染磁頭和整個盤面,使硬碟完全損壞。即使損傷一開始是局部的,在硬碟繼續運行中,損壞區域會繼續擴大,直至硬碟完全報廢。[14]
  • 壞道:硬碟中有一些扇區的故障可能不會使整個硬碟無法訪問。壞道的出現是即將故障的徵兆,只要有一個壞道出現,接下去硬碟很快就完全故障的概率要大得多。
  • 粘滯:磁頭粘在碟片上無法啟動,這種現象稱為粘滯。這種問題除了磨損以外還可能有很多種原因,比如碟片的不當潤滑、錯誤設計或者生產缺陷。有些早期硬碟設計上具有這種問題,直到1990年代初才解決。
  • 電路故障:硬碟內的驅動板等電路損壞,導致硬碟無法訪問,一般是靜電等用戶錯誤導致。
  • 軸承和電機損壞:電機故障、燒毀,或者軸承過度磨損,導致硬碟無法正常運行。現代硬碟一般使用液體動壓軸承(FDB),所以這種問題不是很常見了。[15]
  • 機械故障:硬碟內的一些機械組件斷裂或損壞,尤其是可移動的組件,碎片還可能導致擴大損壞。



大部分的主流硬碟和主板都支持S.M.A.R.T.功能,可以獲取一些硬碟的溫度、運行時長和數據錯誤率之類的參數。 Most major hard disk and motherboard vendors support S.M.A.R.T, which measures drive characteristics such as operating temperature, spin-up time, data error rates, etc. Certain trends and sudden changes in these parameters are thought to be associated with increased likelihood of drive failure and data loss. However, S.M.A.R.T. parameters alone may not be useful for predicting individual drive failures.[16] While several S.M.A.R.T. parameters affect failure probability, a large fraction of failed drives do not produce predictive S.M.A.R.T. parameters.[16] Unpredictable breakdown may occur at any time in normal use, with potential loss of all data. Recovery of some or even all data from a damaged drive is sometimes, but not always possible, and is normally costly.

A 2007 study published by Google suggested very little correlation between failure rates and either high temperature or activity level. Indeed, the Google study indicated that "one of our key findings has been the lack of a consistent pattern of higher failure rates for higher temperature drives or for those drives at higher utilization levels.".[17] Hard drives with S.M.A.R.T.-reported average temperatures below 27 °C(81 °F) had higher failure rates than hard drives with the highest reported average temperature of 50 °C(122 °F), failure rates at least twice as high as the optimum S.M.A.R.T.-reported temperature range of 36 °C(97 °F) to 47 °C(117 °F).[16] The correlation between manufacturers, models and the failure rate was relatively strong. Statistics in this matter are kept highly secret by most entities; Google did not relate manufacturers' names with failure rates,[16] though it has been revealed that Google uses Hitachi Deskstar drives in some of its servers.[18]

Google's 2007 study found, based on a large field sample of drives, that actual annualized failure rates (AFRs) for individual drives ranged from 1.7% for first year drives to over 8.6% for three-year-old drives.[19] A similar 2007 study at CMU on enterprise drives showed that measured MTBF was 3–4 times lower than the manufacturer's specification, with an estimated 3% mean AFR over 1–5 years based on replacement logs for a large sample of drives, and that hard drive failures were highly correlated in time.[20]

A 2007 study of latent sector errors (as opposed to the above studies of complete disk failures) showed that 3.45% of 1.5 million disks developed latent sector errors over 32 months (3.15% of nearline disks and 1.46% of enterprise class disks developed at least one latent sector error within twelve months of their ship date), with the annual sector error rate increasing between the first and second years. Enterprise drives showed less sector errors than consumer drives. Background scrubbing was found to be effective in correcting these errors.[21]

SCSI, SAS, and FC drives are more expensive than consumer-grade SATA drives, and usually used in servers and disk arrays, where SATA drives were sold to the home computer and desktop and near-line storage market and were perceived to be less reliable. This distinction is now becoming blurred.

The mean time between failures (MTBF) of SATA drives is usually specified to be about 1 million hours (some drives such as Western Digital Raptor have rated 1.4 million hours MTBF),[22] while SAS/FC drives are rated for upwards of 1.6 million hours.[23] Modern helium-filled drives are completely sealed without a breather port, thus eliminating the risk of debris ingression, resulting in a typical MTBF of 2.5 million hours. However, independent research indicates that MTBF is not a reliable estimate of a drive's longevity (service life).[24] MTBF is conducted in laboratory environments in test chambers and is an important metric to determine the quality of a disk drive, but is designed to only measure the relatively constant failure rate over the service life of the drive (the middle of the "bathtub curve") before final wear-out phase.[20][25][26] A more interpretable, but equivalent, metric to MTBF is annualized failure rate (AFR). AFR is the percentage of drive failures expected per year. Both AFR and MTBF tend to measure reliability only in the initial part of the life of a hard disk drive thereby understating the real probability of failure of a used drive.[27]

The cloud storage company Backblaze produces an annual report into hard drive reliability. However, the company states that it mainly uses commodity consumer drives, which are deployed in enterprise conditions, rather than in their representative conditions and for their intended use. Consumer drives are also not tested to work with enterprise RAID cards of the kind used in a datacenter, and may not respond in the time a RAID controller expects; such cards will be identified as having failed when they have not.[28] The result of tests of this kind may be relevant or irrelevant to different users, since they accurately represent the performance of consumer drives in the enterprise or under extreme stress, but may not accurately represent their performance in normal or intended use.[29]


  1. IBM 3380 DASD, 1984 ca.[30]
  2. Computer Memories Inc. 20MB HDD for PC/AT, 1985 ca.[31]
  3. Fujitsu MPG3 and MPF3 series, 2002 ca.[32]
  4. IBM Deskstar 75GXP, 2001 ca.[33]
  5. Seagate ST3000DM001, 2012 ca.[34]



In order to avoid the loss of data due to disk failure, common solutions include:

  • Data backup, to allow restoration of data after a failure
  • Data scrubbing, to detect and repair latent corruption
  • Data redundancy, to allow systems to tolerate failures of individual drives
  • Active hard-drive protection, to protect laptop drives from external mechanical forces
  • S.M.A.R.T. (Self-Monitoring, Analysis, and Reporting Technology) included in hard-drives, to provide early warning of predictable failure modes
  • Base isolation used under server racks in data centers



Data from a failed drive can sometimes be partially or totally recovered if the platters' magnetic coating is not totally destroyed. Specialized companies carry out data recovery, at significant cost. It may be possible to recover data by opening the drives in a clean room and using appropriate equipment to replace or revitalize failed components.[35] If the electronics have failed, it is sometimes possible to replace the electronics board, though often drives of nominally exactly the same model manufactured at different times have different circuit boards that are incompatible. Moreover, electronics boards of modern drives usually contain drive-specific adaptation data required for accessing their system areas, so the related componentry needs to be either reprogrammed (if possible) or unsoldered and transferred between two electronics boards.[36][37][38]

Sometimes operation can be restored for long enough to recover data, perhaps requiring reconstruction techniques such as file carving. Risky techniques may be justifiable if the drive is otherwise dead. If a drive is started up once it may continue to run for a shorter or longer time but never start again, so as much data as possible is recovered as soon as the drive starts.


