草稿:硬盤故障
硬盤故障是指硬盤發生故障,導致計算機無法訪問其所存儲的信息。
硬盤在正常運行過程中就有可能會偶然發生故障,也有可能是火災、浸水、強磁、撞擊或污染(可能會引起磁頭劃碰)之類的外因。
此外,數據損壞、MBR損壞,又或是惡意軟件,雖然不是硬盤本身的故障,但也表現為計算機無法正常訪問硬盤。
原因
[編輯]硬盤故障的原因有很多種,例如說:人為原因、硬件損壞、固件損壞、介質損壞、高溫、浸水、電源問題,甚至是單純的巧合[1]。硬盤的製造商通常會列明平均故障間隔時間(MTBF)或年化故障率(AFR),這些是總體的統計數據,並不能夠預測某一個體的故障[2]。這些數據的計算方式是抽取硬盤樣本,在短時間內不斷運行樣本,分析其物理組件的磨損,藉此對其使用壽命進行合理推算。硬盤故障往往遵循浴缸曲線[3],也就是說如果生產的過程中存在問題,那在短時間內就應該開始出現故障。如果硬盤在開始使用後的幾個月內是可靠的,那麼它接下去一直保持可靠的可能性要大得多。即使經過長年累月的頻繁使用,硬盤也不太會表現出明顯的磨損跡象。但即使如此,硬盤隨時都可能突然發生故障。
硬盤故障最主要的直接原因是磁頭劃碰。硬盤內部的讀寫磁頭通常懸浮在盤片表面上方,一旦磁頭接觸到盤片,或是劃傷數據存儲的磁性表面,就會導致嚴重的數據丟失。在這種情況下,由於硬盤內部已經受損,數據恢復必須由專業人士通過適當的設備進行,否則還可能造成進一步的損壞。硬盤的盤片上塗有一層極薄的非靜電潤滑劑,所以在發生碰撞時,磁頭可能只是單純從盤片表面掠過。然而,磁頭平時就在距離盤片表面僅僅幾納米的地方,劃碰仍然是一個眾所周知的風險。
另一個可能的故障原因是空氣過濾器故障。現代硬盤上配備有空氣過濾器,可以平衡盤內外之間的氣壓和濕度。當過濾器不能濾清空氣時,灰塵就有可能落在盤片上,一旦磁頭掃過,就造成了磁頭劃碰。在碰撞發生後,損壞的盤片和磁頭飛濺出的的顆粒還可能進一步導致壞道。這些再加上盤片本身的損壞,會使硬盤很快報廢。
硬盤裡除了盤片還有控制器等電子設備,這些設備偶爾也會發生故障。不過在這種情況下,只需更換控制器板,即可恢復所有數據。
故障現象
[編輯]硬盤故障可能是災難性的,也可能是漸進性的。災難性的故障表現為主板BIOS無法檢測到硬盤,或者硬盤無法通過POST自檢 ,此時操作系統完全無法感知到硬盤的存在。漸進性的故障相對難以診斷,因為其症狀,例如偶爾的數據損壞,或者電腦變得卡頓(這是壞道需要反覆嘗試讀取所致),不能明確指向硬盤故障,而可能是由許多其他原因引起的,例如惡意軟件。壞道數量不斷增加是硬盤可能出現故障的跡象。不過,硬盤自動將壞道添加到自己的重映射表中[4],這些跡象對ScanDisk這類檢查程序來說不是很明顯,而只有對能在硬盤自身之前發現的檢查程序來說才有可能暴露這些問題。一旦硬盤內部的缺陷管理系統保留的備份扇區用盡,故障將會徹底發生。磁頭尋道的重複模式,例如反覆出現快速或較慢的尋道結束噪音(咔噠聲)可能表明硬盤存在問題。[5]
硬盤故障的現象不僅限於硬盤,還適用於其他類型的磁介質。艾美加在1990年代末發布的Zip驅動器中所使用的100MB「Zip磁盤」,就受到「死亡咔嗒聲」的影響,這麼稱呼是因為這種磁盤在發生故障時驅動器會不停發出咔噠聲。3.5英寸軟盤也可能會發生類似的故障,如果驅動器或磁介質受污染,用戶在嘗試訪問驅動器時會遇到「死亡嗡嗡聲」。
磁頭停靠技術
[編輯]正常運行時,硬盤的磁頭在盤片上空飛行,為了避免斷電或者其他故障發生時磁頭直接撞上數據區,現代硬盤通常會進行「着陸」或者「卸載」操作。接觸式啟停的硬盤將磁頭停靠在盤片上一塊不用於存儲數據的區域,稱為「着陸」。斜坡加載技術的硬盤將磁頭移動到盤外的磁頭架上,並通過機械結構鎖定,使磁頭遠離盤片,稱為「卸載」。一些早期的硬盤沒有在突然斷電時安全着陸的能力,導致磁頭錯誤降落在數據區上。還有一些早期硬盤需要由用戶手動執行着陸。
接觸式啟停
[編輯]接觸式啟停的硬盤在盤片接近中心的地方有一塊無數據區域,稱為「着陸區」。現代設計會將主電機暫時充當發電機為磁頭致動器供能,利用盤片的慣性在斷電時將磁頭推到着陸區。而較早的設計則依賴彈簧。
磁頭臂上的彈簧將磁頭滑塊推向盤片,當盤片開始旋轉後,磁頭由氣墊懸浮支撐,不會與盤片接觸或磨損。接觸式啟停硬盤的磁頭滑塊設計上可以多次接觸盤片表面,但長期的微觀磨損最終還是會造成損壞。大部分廠商設計的磁頭滑塊在損壞率超過50%之前至少可以容忍50000次啟停。不過由於使用時間較長的硬盤磁頭滑塊要在盤片上拖行一段時間才能建立氣墊,老硬盤每次啟動都有比新硬盤更高的損壞概率,所以磨損率並不是線性的。廠商一般會在測試後發布相關的可靠性數據,例如,希捷酷魚7200.10系列機械硬盤的可靠性評級具有50000次的啟停次數,也就是說在測試中至少50000次啟停後並沒有發現與磁頭表面接觸相關的故障。[6]
IBM在1995年左右率先推出了一項使用激光毛化工藝(LZT)對着陸區進行處理的技術,在硬盤的着陸區上加工出納米級的粗糙表面[7]以增強摩擦力並提高耐久。這項技術沿用至今,現在大多只用於低容量的希捷桌面級硬盤[8]。在小尺寸(2.5寸)、大容量、NAS專用以及企業級硬盤中已經被斜坡加載技術逐步淘汰。總的來說,使用接觸式啟停技術的硬盤會受到更大的環境影響,例如高濕環境可能導致磁頭粘滯在盤片上,由此產生過高的摩擦,對盤片、滑塊和電機造成物理損壞。
斜坡加載技術
[編輯]加載卸載技術將磁頭從盤片上舉起並移動到安全區域,既能減少磨損,同時又避開了接觸式啟停的粘滯風險。世界上第一塊硬盤RAMAC以及大多數同時代的早期硬盤都是使用了類似技術,但在當時還是一種非常複雜的機制。現代硬盤使用的則是Memorex在1967年發明的「斜坡加載技術」[9],在盤片外有一塊塑料磁盤架,當硬盤不用時,磁頭會沿坡道移動到固定位置保存,這個過程稱為「卸載」。剛開始僅有用於筆記本的小尺寸硬盤為了抗衝擊而選擇使用,後來為大多數桌面級硬盤所廣泛使用。
為了進一步提升抗衝擊性能,IBM為ThinkPad筆記本電腦產品線還推出了配備有「主動保護系統」的硬盤。當電腦內置的加速度傳感器檢測到突然的劇烈運動時,硬盤會自動卸載磁頭,以減輕數據丟失和硬盤劃傷的風險。蘋果後來也為PowerBook、iBook、MacBook Pro和MacBook產品線推出了類似的技術,稱為突發運動傳感器。索尼[10]、惠普的「HP 3D DriveGuard」[11]以及東芝[12]等等,各大廠商後來都在他們的筆記本電腦產品線應用類似的技術。
故障症狀
[編輯]硬盤有多種故障的症狀,有可能是突發的、逐漸惡化的或者自限的。可能會導致全部或部分數據丟失,或者沒有影響。
早期的硬盤在出廠時、以及在使用中很容易出現壞道,只要不是短時間內突然出現大量壞道,在當時是正常現象。可以使用「重映射」功能來屏蔽這些扇區,從而保證硬盤的正常運行。有的早期硬盤甚至還在出廠時附有一張表格,指示用戶手動進行重映射[13]。後來硬盤都可以無需用戶的介入自動重映射壞道。重映射後,硬盤仍可以繼續使用,但磁頭在遇到壞道時必須移動到重映射後的扇區才能完成存取,導致性能受到影響。S.M.A.R.T.功能可以提供關於重映射的日誌和統計數據。現代的硬盤出廠時已屏蔽壞道,重映射計數正常情況下是0,任何增加的重映射扇區都可能是硬盤即將故障的徵兆。
還有一些其他類型的故障,可能是逐漸惡化的,也可能是自限的。但無論如何,一旦這些症狀出現,就應該立刻考慮更換硬盤,數據損失的風險往往要遠遠大於更換硬盤節省的錢。反覆出現的讀寫錯誤、嚴重的噪音以及發熱等等都是可能會出現的症狀。
- 磁頭劃碰:外部有撞擊等原因導致磁頭接觸盤片,導致接觸區域的不可逆轉的機械損傷及數據丟失。在最壞的情況下,從接觸區域飛濺出的碎片污染磁頭和整個盤面,使硬盤完全損壞。即使損傷一開始是局部的,在硬盤繼續運行中,損壞區域會繼續擴大,直至硬盤完全報廢。[14]
- 壞道:硬盤中有一些扇區的故障可能不會使整個硬盤無法訪問。壞道的出現是即將故障的徵兆,只要有一個壞道出現,接下去硬盤很快就完全故障的概率要大得多。
- 粘滯:磁頭粘在盤片上無法啟動,這種現象稱為粘滯。這種問題除了磨損以外還可能有很多種原因,比如盤片的不當潤滑、錯誤設計或者生產缺陷。有些早期硬盤設計上具有這種問題,直到1990年代初才解決。
- 電路故障:硬盤內的驅動板等電路損壞,導致硬盤無法訪問,一般是靜電等用戶錯誤導致。
- 軸承和電機損壞:電機故障、燒毀,或者軸承過度磨損,導致硬盤無法正常運行。現代硬盤一般使用液體動壓軸承(FDB),所以這種問題不是很常見了。[15]
- 機械故障:硬盤內的一些機械組件斷裂或損壞,尤其是可移動的組件,碎片還可能導致擴大損壞。
故障指標
[編輯]大部分的主流硬盤和主板都支持S.M.A.R.T.功能,可以獲取一些硬盤的溫度、運行時長和數據錯誤率之類的參數。 Most major hard disk and motherboard vendors support S.M.A.R.T, which measures drive characteristics such as operating temperature, spin-up time, data error rates, etc. Certain trends and sudden changes in these parameters are thought to be associated with increased likelihood of drive failure and data loss. However, S.M.A.R.T. parameters alone may not be useful for predicting individual drive failures.[16] While several S.M.A.R.T. parameters affect failure probability, a large fraction of failed drives do not produce predictive S.M.A.R.T. parameters.[16] Unpredictable breakdown may occur at any time in normal use, with potential loss of all data. Recovery of some or even all data from a damaged drive is sometimes, but not always possible, and is normally costly.
A 2007 study published by Google suggested very little correlation between failure rates and either high temperature or activity level. Indeed, the Google study indicated that "one of our key findings has been the lack of a consistent pattern of higher failure rates for higher temperature drives or for those drives at higher utilization levels.".[17] Hard drives with S.M.A.R.T.-reported average temperatures below 27 °C(81 °F) had higher failure rates than hard drives with the highest reported average temperature of 50 °C(122 °F), failure rates at least twice as high as the optimum S.M.A.R.T.-reported temperature range of 36 °C(97 °F) to 47 °C(117 °F).[16] The correlation between manufacturers, models and the failure rate was relatively strong. Statistics in this matter are kept highly secret by most entities; Google did not relate manufacturers' names with failure rates,[16] though it has been revealed that Google uses Hitachi Deskstar drives in some of its servers.[18]
Google's 2007 study found, based on a large field sample of drives, that actual annualized failure rates (AFRs) for individual drives ranged from 1.7% for first year drives to over 8.6% for three-year-old drives.[19] A similar 2007 study at CMU on enterprise drives showed that measured MTBF was 3–4 times lower than the manufacturer's specification, with an estimated 3% mean AFR over 1–5 years based on replacement logs for a large sample of drives, and that hard drive failures were highly correlated in time.[20]
A 2007 study of latent sector errors (as opposed to the above studies of complete disk failures) showed that 3.45% of 1.5 million disks developed latent sector errors over 32 months (3.15% of nearline disks and 1.46% of enterprise class disks developed at least one latent sector error within twelve months of their ship date), with the annual sector error rate increasing between the first and second years. Enterprise drives showed less sector errors than consumer drives. Background scrubbing was found to be effective in correcting these errors.[21]
SCSI, SAS, and FC drives are more expensive than consumer-grade SATA drives, and usually used in servers and disk arrays, where SATA drives were sold to the home computer and desktop and near-line storage market and were perceived to be less reliable. This distinction is now becoming blurred.
The mean time between failures (MTBF) of SATA drives is usually specified to be about 1 million hours (some drives such as Western Digital Raptor have rated 1.4 million hours MTBF),[22] while SAS/FC drives are rated for upwards of 1.6 million hours.[23] Modern helium-filled drives are completely sealed without a breather port, thus eliminating the risk of debris ingression, resulting in a typical MTBF of 2.5 million hours. However, independent research indicates that MTBF is not a reliable estimate of a drive's longevity (service life).[24] MTBF is conducted in laboratory environments in test chambers and is an important metric to determine the quality of a disk drive, but is designed to only measure the relatively constant failure rate over the service life of the drive (the middle of the "bathtub curve") before final wear-out phase.[20][25][26] A more interpretable, but equivalent, metric to MTBF is annualized failure rate (AFR). AFR is the percentage of drive failures expected per year. Both AFR and MTBF tend to measure reliability only in the initial part of the life of a hard disk drive thereby understating the real probability of failure of a used drive.[27]
The cloud storage company Backblaze produces an annual report into hard drive reliability. However, the company states that it mainly uses commodity consumer drives, which are deployed in enterprise conditions, rather than in their representative conditions and for their intended use. Consumer drives are also not tested to work with enterprise RAID cards of the kind used in a datacenter, and may not respond in the time a RAID controller expects; such cards will be identified as having failed when they have not.[28] The result of tests of this kind may be relevant or irrelevant to different users, since they accurately represent the performance of consumer drives in the enterprise or under extreme stress, but may not accurately represent their performance in normal or intended use.[29]
先天高故障率的硬盤
[編輯]- IBM 3380 DASD, 1984 ca.[30]
- Computer Memories Inc. 20MB HDD for PC/AT, 1985 ca.[31]
- Fujitsu MPG3 and MPF3 series, 2002 ca.[32]
- IBM Deskstar 75GXP, 2001 ca.[33]
- Seagate ST3000DM001, 2012 ca.[34]
預防
[編輯]In order to avoid the loss of data due to disk failure, common solutions include:
- Data backup, to allow restoration of data after a failure
- Data scrubbing, to detect and repair latent corruption
- Data redundancy, to allow systems to tolerate failures of individual drives
- Active hard-drive protection, to protect laptop drives from external mechanical forces
- S.M.A.R.T. (Self-Monitoring, Analysis, and Reporting Technology) included in hard-drives, to provide early warning of predictable failure modes
- Base isolation used under server racks in data centers
數據恢復
[編輯]Data from a failed drive can sometimes be partially or totally recovered if the platters' magnetic coating is not totally destroyed. Specialized companies carry out data recovery, at significant cost. It may be possible to recover data by opening the drives in a clean room and using appropriate equipment to replace or revitalize failed components.[35] If the electronics have failed, it is sometimes possible to replace the electronics board, though often drives of nominally exactly the same model manufactured at different times have different circuit boards that are incompatible. Moreover, electronics boards of modern drives usually contain drive-specific adaptation data required for accessing their system areas, so the related componentry needs to be either reprogrammed (if possible) or unsoldered and transferred between two electronics boards.[36][37][38]
Sometimes operation can be restored for long enough to recover data, perhaps requiring reconstruction techniques such as file carving. Risky techniques may be justifiable if the drive is otherwise dead. If a drive is started up once it may continue to run for a shorter or longer time but never start again, so as much data as possible is recovered as soon as the drive starts.
引用
[編輯]- ^ Top 7 Causes Of Hard Disk Failure. ADRECA. 2015-08-05 [December 23, 2019].
- ^ Scheier, Robert. Study: Hard Drive Failure Rates Much Higher Than Makers Estimate. PC World. 2007-03-02 [9 February 2016].
- ^ How long do hard drives actually live for?. ExtremeTech. [August 3, 2015].
- ^ Definition of:hard disk defect management. PC Mag.
- ^ Quirke, Chris. Hard Drive Data Corruption. (原始內容存檔於26 December 2014).
- ^ Barracuda 7200.10 Serial ATA Product Manual (PDF). [26 April 2012].
- ^ IEEE.org, Baumgart, P.; Krajnovich, D.J.; Nguyen, T.A.; Tam, A.G.; IEEE Trans. Magn.
- ^ Seagate Barracuda 3.5" Desktop HDD Datasheet (PDF).
- ^ Pugh et al.; "IBM's 360 and Early 370 Systems"; MIT Press, 1991, pp.270
- ^ Sony | For Business | VAIO SMB. B2b.sony.com. [13 March 2009].
- ^ HP.com (PDF). [26 April 2012].
- ^ Toshiba HDD Protection measures. (PDF). [26 April 2012].
- ^ Adaptec ACB-2072 XT to RLL Installation Guide A defect list "may be put in from a file or entered from a keyboard."
- ^ Hard Drives. escotal.com. [16 July 2011].
- ^ How to Manage for Hard Drive Failures and Data Corruption. Backblaze Blog | Cloud Storage & Cloud Backup. 2019-07-11 [2021-10-12] (美國英語).
- ^ 16.0 16.1 16.2 16.3 Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz André Barroso. Failure Trends in a Large Disk Drive Population (PDF). 5th USENIX Conference on File and Storage Technologies (FAST 2007). February 2007 [15 September 2008].
- ^ Conclusions: Failure Trends in Large Disk Drive Population, p. 12
- ^ Shankland, Stephen. CNet.com. News.cnet.com. 1 April 2009 [26 April 2012].
- ^ AFR broken down by age groups: Failure Trends in Large Disk Drive Population, p. 4, figure 2 and subsequent figures.
- ^ 20.0 20.1 Bianca Schroeder and Garth A. Gibson. "Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You?". Proceedings 5th USENIX Conference on File and Storage Technologies. 2007.
- ^ L.N. Bairavasundaram, GR Goodson, S. Pasupathy, J.Schindler. "An analysis of latent sector errors in disk drives". Proceedings of SIGMETRICS'07, June 12-16,2007. (PDF).
- ^ WD VelociRaptor Drive Specification Sheet (PDF) (PDF). [26 April 2012].
- ^ Jay White. Technical Report: Storage Subsystem Resiliency Guide (TR-3437) (PDF). NetApp: 5. May 2013 [6 January 2016].
- ^ Everything You Know About Disks Is Wrong. StorageMojo. 20 February 2007 [29 August 2007].
- ^ "One aspect of disk failures that single-value metrics such as MTTF and AFR cannot capture is that in real life failure rates are not constant. Failure rates of hardware products typically follow a "bathtub curve" with high failure rates at the beginning (infant mortality) and the end (wear-out) of the lifecycle."(Schroeder et al. 2007)
- ^ David A. Patterson; John L. Hennessy. Computer Organization and Design, Revised Fourth Edition: The Hardware/Software Interface. Section 6.12. Elsevier. 13 October 2011: 613–. ISBN 978-0-08-088613-8. – "...disk manufacturers argue that the calculation [of MTBF] corresponds to a user who buys a disk and keeps replacing the disk every five years- the planned lifetime of the disk."
- ^ Decrypting hard-drive failures – MTBF and AFR. snowark.com.
- ^ This is the case of Software RAID and desktop drives without ERC configured. The problem is known as timeout mismatch.
- ^ Brown, Cody. What Are the Most Dependable Hard Drives? Understanding the Backblaze Tests. March 25, 2022 [November 15, 2022].
So surely, the data they provide is invaluable to average consumers… right? Well, maybe not.
- ^ Henkel, Tom. IBM 3380 damage: Tip of a larger problem?. ComputerWorld. December 24, 1984: 41.
- ^ Burke, Steven. Drive Problems Continue in PC AT. InfoWorld. 18 November 1985.
- ^ Krazit, Tom. Fujitsu hard disk lawsuit settlement proposed. ComputerWorld. 22 October 2003 (英語).
- ^ IBM 75GXP: The infamous Deathstar (PDF). Computer History Museum. 2000.
- ^ Hruska, Joel. Seagate faces class-action lawsuit over 3TB hard drive failure rates. ExtremeTech. 2 February 2016 (美國英語).
- ^ HddSurgery - Professional tools for data recovery and computer forensics experts. [April 10, 2020].
- ^ Hard Drive Circuit Board Replacement Guide or How To Swap HDD PCB. donordrives.com. [May 27, 2015]. (原始內容存檔於May 27, 2015).
- ^ Firmware Adaptation Service – ROM Swap. pcb4you.com. [May 27, 2015]. (原始內容存檔於April 18, 2015).
- ^ How to Maximize the Lifespan of Your Computer's Hard Drive.
其他條目
[編輯]外部連結
[編輯]- Backblaze: Hard Drive Annual Failure Rates, 2019, Q2-2020
- Failure Trends in a Large Disk Drive Population – Google, Inc. February 2007
- A Clean-Slate Look at Disk Scrubbing
- Noises made by defective and failing hard disk drives
- Hard disk drive anatomy: Logical and physical failures