草稿:硬盘故障
硬盘故障是指硬盘发生故障,导致计算机无法访问其所存储的信息。
硬盘在正常运行过程中就有可能会偶然发生故障,也有可能是火灾、浸水、强磁、撞擊或污染(可能会引起磁头划碰)之类的外因。
此外,数据损坏、MBR损坏,又或是恶意软件,虽然不是硬盘本身的故障,但也表现为计算机无法正常访问硬盘。
原因
[编辑]硬盘故障的原因有很多种,例如说:人为原因、硬件损坏、固件损坏、介质损坏、高温、浸水、电源问题,甚至是单纯的巧合[1]。硬盘的制造商通常会列明平均故障间隔时间(MTBF)或年化故障率(AFR),这些是总体的统计数据,并不能够预测某一个体的故障[2]。这些数据的计算方式是抽取硬盘样本,在短时间内不断运行样本,分析其物理组件的磨损,借此对其使用寿命进行合理推算。硬盘故障往往遵循浴缸曲線[3],也就是说如果生产的过程中存在问题,那在短时间内就应该开始出现故障。如果硬盘在开始使用后的几个月内是可靠的,那么它接下去一直保持可靠的可能性要大得多。即使经过长年累月的频繁使用,硬盘也不太会表现出明显的磨损迹象。但即使如此,硬盘随时都可能突然发生故障。
硬盘故障最主要的直接原因是磁头划碰。硬盘内部的读写磁头通常悬浮在盘片表面上方,一旦磁头接触到盘片,或是划伤数据存储的磁性表面,就会导致严重的数据丢失。在这种情况下,由于硬盘内部已经受损,数据恢复必须由专业人士通过适当的设备进行,否则还可能造成进一步的损坏。硬盘的盘片上涂有一层极薄的非静电润滑剂,所以在发生碰撞时,磁头可能只是单纯从盘片表面掠过。然而,磁头平时就在距离盘片表面仅仅几纳米的地方,划碰仍然是一个众所周知的风险。
另一个可能的故障原因是空气过滤器故障。现代硬盘上配备有空气过滤器,可以平衡盘内外之间的气压和湿度。当过滤器不能滤清空气时,灰尘就有可能落在盘片上,一旦磁头扫过,就造成了磁头划碰。在碰撞发生后,损坏的盘片和磁头飞溅出的的颗粒还可能进一步导致坏道。这些再加上盘片本身的损坏,会使硬盘很快报废。
硬盘里除了盘片还有控制器等电子设备,这些设备偶尔也会发生故障。不过在这种情况下,只需更换控制器板,即可恢复所有数据。
故障现象
[编辑]硬盘故障可能是灾难性的,也可能是渐进性的。灾难性的故障表现为主板BIOS无法检测到硬盘,或者硬盘无法通过POST自检 ,此时操作系统完全无法感知到硬盘的存在。渐进性的故障相对难以诊断,因为其症状,例如偶尔的数据损坏,或者电脑变得卡顿(这是坏道需要反复尝试读取所致),不能明确指向硬盘故障,而可能是由许多其他原因引起的,例如恶意软件。坏道数量不断增加是硬盘可能出现故障的迹象。不过,硬盘自动将坏道添加到自己的重映射表中[4],这些迹象对ScanDisk这类检查程序来说不是很明显,而只有对能在硬盘自身之前发现的检查程序来说才有可能暴露这些问题。一旦硬盘内部的缺陷管理系统保留的备份扇区用尽,故障将会彻底发生。磁头寻道的重复模式,例如反复出现快速或较慢的寻道结束噪音(咔哒声)可能表明硬盘存在问题。[5]
硬盘故障的现象不仅限于硬盘,还适用于其他类型的磁介质。艾美加在1990年代末发布的Zip驱动器中所使用的100MB“Zip磁盘”,就受到“死亡咔嗒声”的影响,这么称呼是因为这种磁盘在发生故障时驱动器会不停发出咔哒声。3.5英寸软盘也可能会发生类似的故障,如果驱动器或磁介质受污染,用户在尝试访问驱动器时会遇到“死亡嗡嗡声”。
磁头停靠技术
[编辑]正常运行时,硬盘的磁头在盘片上空飞行,为了避免断电或者其他故障发生时磁头直接撞上数据区,现代硬盘通常会进行“着陆”或者“卸载”操作。接触式启停的硬盘将磁头停靠在盘片上一块不用于存储数据的区域,称为“着陆”。斜坡加载技术的硬盘将磁头移动到盘外的磁头架上,并通过机械结构锁定,使磁头远离盘片,称为“卸载”。一些早期的硬盘没有在突然断电时安全着陆的能力,导致磁头错误降落在数据区上。还有一些早期硬盘需要由用户手动执行着陆。
接触式启停
[编辑]接触式启停的硬盘在盘片接近中心的地方有一块无数据区域,称为“着陆区”。现代设计会将主电机暂时充当发电机为磁头致动器供能,利用盘片的惯性在断电时将磁头推到着陆区。而较早的设计则依赖弹簧。
磁头臂上的弹簧将磁头滑块推向盘片,当盘片开始旋转后,磁头由气垫悬浮支撑,不会与盘片接触或磨损。接触式启停硬盘的磁头滑块设计上可以多次接触盘片表面,但长期的微观磨损最终还是会造成损坏。大部分厂商设计的磁头滑块在损坏率超过50%之前至少可以容忍50000次启停。不过由于使用时间较长的硬盘磁头滑块要在盘片上拖行一段时间才能建立气垫,老硬盘每次启动都有比新硬盘更高的损坏概率,所以磨损率并不是线性的。厂商一般会在测试后发布相关的可靠性数据,例如,希捷酷鱼7200.10系列机械硬盘的可靠性评级具有50000次的启停次数,也就是说在测试中至少50000次启停后并没有发现与磁头表面接触相关的故障。[6]
IBM在1995年左右率先推出了一项使用激光毛化工艺(LZT)对着陆区进行处理的技术,在硬盘的着陆区上加工出纳米级的粗糙表面[7]以增强摩擦力并提高耐久。这项技术沿用至今,现在大多只用于低容量的希捷桌面级硬盘[8]。在小尺寸(2.5寸)、大容量、NAS专用以及企业级硬盘中已经被斜坡加载技术逐步淘汰。总的来说,使用接触式启停技术的硬盘会受到更大的环境影响,例如高湿环境可能导致磁头粘滞在盘片上,由此产生过高的摩擦,对盘片、滑块和电机造成物理损坏。
斜坡加载技术
[编辑]加载卸载技术将磁头从盘片上举起并移动到安全区域,既能减少磨损,同时又避开了接触式启停的粘滞风险。世界上第一块硬盘RAMAC以及大多数同时代的早期硬盘都是使用了类似技术,但在当时还是一种非常复杂的机制。现代硬盘使用的则是Memorex在1967年发明的“斜坡加载技术”[9],在盘片外有一块塑料磁盘架,当硬盘不用时,磁头会沿坡道移动到固定位置保存,这个过程称为“卸载”。刚开始仅有用于笔记本的小尺寸硬盘为了抗冲击而选择使用,后来为大多数桌面级硬盘所广泛使用。
为了进一步提升抗冲击性能,IBM为ThinkPad笔记本电脑产品线还推出了配备有“主动保护系统”的硬盘。当电脑内置的加速度传感器检测到突然的剧烈运动时,硬盘会自动卸载磁头,以减轻数据丢失和硬盘划伤的风险。苹果后来也为PowerBook、iBook、MacBook Pro和MacBook产品线推出了类似的技术,称为突发运动传感器。索尼[10]、惠普的“HP 3D DriveGuard”[11]以及东芝[12]等等,各大厂商后来都在他们的笔记本电脑产品线应用类似的技术。
故障症状
[编辑]硬盘有多种故障的症状,有可能是突发的、逐渐恶化的或者自限的。可能会导致全部或部分数据丢失,或者没有影响。
早期的硬盘在出厂时、以及在使用中很容易出现坏道,只要不是短时间内突然出现大量坏道,在当时是正常现象。可以使用“重映射”功能来屏蔽这些扇区,从而保证硬盘的正常运行。有的早期硬盘甚至还在出厂时附有一张表格,指示用户手动进行重映射[13]。后来硬盘都可以无需用户的介入自动重映射坏道。重映射后,硬盘仍可以继续使用,但磁头在遇到坏道时必须移动到重映射后的扇区才能完成存取,导致性能受到影响。S.M.A.R.T.功能可以提供关于重映射的日志和统计数据。现代的硬盘出厂时已屏蔽坏道,重映射计数正常情况下是0,任何增加的重映射扇区都可能是硬盘即将故障的征兆。
还有一些其他类型的故障,可能是逐渐恶化的,也可能是自限的。但无论如何,一旦这些症状出现,就应该立刻考虑更换硬盘,数据损失的风险往往要远远大于更换硬盘节省的钱。反复出现的读写错误、严重的噪音以及发热等等都是可能会出现的症状。
- 磁头划碰:外部有撞击等原因导致磁头接触盘片,导致接触区域的不可逆转的机械损伤及数据丢失。在最坏的情况下,从接触区域飞溅出的碎片污染磁头和整个盘面,使硬盘完全损坏。即使损伤一开始是局部的,在硬盘继续运行中,损坏区域会继续扩大,直至硬盘完全报废。[14]
- 坏道:硬盘中有一些扇区的故障可能不会使整个硬盘无法访问。坏道的出现是即将故障的征兆,只要有一个坏道出现,接下去硬盘很快就完全故障的概率要大得多。
- 粘滞:磁头粘在盘片上无法启动,这种现象称为粘滞。这种问题除了磨损以外还可能有很多种原因,比如盘片的不当润滑、错误设计或者生产缺陷。有些早期硬盘设计上具有这种问题,直到1990年代初才解决。
- 电路故障:硬盘内的驱动板等电路损坏,导致硬盘无法访问,一般是静电等用户错误导致。
- 轴承和电机损坏:电机故障、烧毁,或者轴承过度磨损,导致硬盘无法正常运行。现代硬盘一般使用液体动压轴承(FDB),所以这种问题不是很常见了。[15]
- 机械故障:硬盘内的一些机械组件断裂或损坏,尤其是可移动的组件,碎片还可能导致扩大损坏。
故障指标
[编辑]大部分的主流硬盘和主板都支持S.M.A.R.T.功能,可以获取一些硬盘的温度、运行时长和数据错误率之类的参数。 Most major hard disk and motherboard vendors support S.M.A.R.T, which measures drive characteristics such as operating temperature, spin-up time, data error rates, etc. Certain trends and sudden changes in these parameters are thought to be associated with increased likelihood of drive failure and data loss. However, S.M.A.R.T. parameters alone may not be useful for predicting individual drive failures.[16] While several S.M.A.R.T. parameters affect failure probability, a large fraction of failed drives do not produce predictive S.M.A.R.T. parameters.[16] Unpredictable breakdown may occur at any time in normal use, with potential loss of all data. Recovery of some or even all data from a damaged drive is sometimes, but not always possible, and is normally costly.
A 2007 study published by Google suggested very little correlation between failure rates and either high temperature or activity level. Indeed, the Google study indicated that "one of our key findings has been the lack of a consistent pattern of higher failure rates for higher temperature drives or for those drives at higher utilization levels.".[17] Hard drives with S.M.A.R.T.-reported average temperatures below 27 °C(81 °F) had higher failure rates than hard drives with the highest reported average temperature of 50 °C(122 °F), failure rates at least twice as high as the optimum S.M.A.R.T.-reported temperature range of 36 °C(97 °F) to 47 °C(117 °F).[16] The correlation between manufacturers, models and the failure rate was relatively strong. Statistics in this matter are kept highly secret by most entities; Google did not relate manufacturers' names with failure rates,[16] though it has been revealed that Google uses Hitachi Deskstar drives in some of its servers.[18]
Google's 2007 study found, based on a large field sample of drives, that actual annualized failure rates (AFRs) for individual drives ranged from 1.7% for first year drives to over 8.6% for three-year-old drives.[19] A similar 2007 study at CMU on enterprise drives showed that measured MTBF was 3–4 times lower than the manufacturer's specification, with an estimated 3% mean AFR over 1–5 years based on replacement logs for a large sample of drives, and that hard drive failures were highly correlated in time.[20]
A 2007 study of latent sector errors (as opposed to the above studies of complete disk failures) showed that 3.45% of 1.5 million disks developed latent sector errors over 32 months (3.15% of nearline disks and 1.46% of enterprise class disks developed at least one latent sector error within twelve months of their ship date), with the annual sector error rate increasing between the first and second years. Enterprise drives showed less sector errors than consumer drives. Background scrubbing was found to be effective in correcting these errors.[21]
SCSI, SAS, and FC drives are more expensive than consumer-grade SATA drives, and usually used in servers and disk arrays, where SATA drives were sold to the home computer and desktop and near-line storage market and were perceived to be less reliable. This distinction is now becoming blurred.
The mean time between failures (MTBF) of SATA drives is usually specified to be about 1 million hours (some drives such as Western Digital Raptor have rated 1.4 million hours MTBF),[22] while SAS/FC drives are rated for upwards of 1.6 million hours.[23] Modern helium-filled drives are completely sealed without a breather port, thus eliminating the risk of debris ingression, resulting in a typical MTBF of 2.5 million hours. However, independent research indicates that MTBF is not a reliable estimate of a drive's longevity (service life).[24] MTBF is conducted in laboratory environments in test chambers and is an important metric to determine the quality of a disk drive, but is designed to only measure the relatively constant failure rate over the service life of the drive (the middle of the "bathtub curve") before final wear-out phase.[20][25][26] A more interpretable, but equivalent, metric to MTBF is annualized failure rate (AFR). AFR is the percentage of drive failures expected per year. Both AFR and MTBF tend to measure reliability only in the initial part of the life of a hard disk drive thereby understating the real probability of failure of a used drive.[27]
The cloud storage company Backblaze produces an annual report into hard drive reliability. However, the company states that it mainly uses commodity consumer drives, which are deployed in enterprise conditions, rather than in their representative conditions and for their intended use. Consumer drives are also not tested to work with enterprise RAID cards of the kind used in a datacenter, and may not respond in the time a RAID controller expects; such cards will be identified as having failed when they have not.[28] The result of tests of this kind may be relevant or irrelevant to different users, since they accurately represent the performance of consumer drives in the enterprise or under extreme stress, but may not accurately represent their performance in normal or intended use.[29]
先天高故障率的硬盘
[编辑]- IBM 3380 DASD, 1984 ca.[30]
- Computer Memories Inc. 20MB HDD for PC/AT, 1985 ca.[31]
- Fujitsu MPG3 and MPF3 series, 2002 ca.[32]
- IBM Deskstar 75GXP, 2001 ca.[33]
- Seagate ST3000DM001, 2012 ca.[34]
预防
[编辑]In order to avoid the loss of data due to disk failure, common solutions include:
- Data backup, to allow restoration of data after a failure
- Data scrubbing, to detect and repair latent corruption
- Data redundancy, to allow systems to tolerate failures of individual drives
- Active hard-drive protection, to protect laptop drives from external mechanical forces
- S.M.A.R.T. (Self-Monitoring, Analysis, and Reporting Technology) included in hard-drives, to provide early warning of predictable failure modes
- Base isolation used under server racks in data centers
数据恢复
[编辑]Data from a failed drive can sometimes be partially or totally recovered if the platters' magnetic coating is not totally destroyed. Specialized companies carry out data recovery, at significant cost. It may be possible to recover data by opening the drives in a clean room and using appropriate equipment to replace or revitalize failed components.[35] If the electronics have failed, it is sometimes possible to replace the electronics board, though often drives of nominally exactly the same model manufactured at different times have different circuit boards that are incompatible. Moreover, electronics boards of modern drives usually contain drive-specific adaptation data required for accessing their system areas, so the related componentry needs to be either reprogrammed (if possible) or unsoldered and transferred between two electronics boards.[36][37][38]
Sometimes operation can be restored for long enough to recover data, perhaps requiring reconstruction techniques such as file carving. Risky techniques may be justifiable if the drive is otherwise dead. If a drive is started up once it may continue to run for a shorter or longer time but never start again, so as much data as possible is recovered as soon as the drive starts.
引用
[编辑]- ^ Top 7 Causes Of Hard Disk Failure. ADRECA. 2015-08-05 [December 23, 2019].
- ^ Scheier, Robert. Study: Hard Drive Failure Rates Much Higher Than Makers Estimate. PC World. 2007-03-02 [9 February 2016].
- ^ How long do hard drives actually live for?. ExtremeTech. [August 3, 2015].
- ^ Definition of:hard disk defect management. PC Mag.
- ^ Quirke, Chris. Hard Drive Data Corruption. (原始内容存档于26 December 2014).
- ^ Barracuda 7200.10 Serial ATA Product Manual (PDF). [26 April 2012].
- ^ IEEE.org, Baumgart, P.; Krajnovich, D.J.; Nguyen, T.A.; Tam, A.G.; IEEE Trans. Magn.
- ^ Seagate Barracuda 3.5" Desktop HDD Datasheet (PDF).
- ^ Pugh et al.; "IBM's 360 and Early 370 Systems"; MIT Press, 1991, pp.270
- ^ Sony | For Business | VAIO SMB. B2b.sony.com. [13 March 2009].
- ^ HP.com (PDF). [26 April 2012].
- ^ Toshiba HDD Protection measures. (PDF). [26 April 2012].
- ^ Adaptec ACB-2072 XT to RLL Installation Guide A defect list "may be put in from a file or entered from a keyboard."
- ^ Hard Drives. escotal.com. [16 July 2011].
- ^ How to Manage for Hard Drive Failures and Data Corruption. Backblaze Blog | Cloud Storage & Cloud Backup. 2019-07-11 [2021-10-12] (美国英语).
- ^ 16.0 16.1 16.2 16.3 Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz André Barroso. Failure Trends in a Large Disk Drive Population (PDF). 5th USENIX Conference on File and Storage Technologies (FAST 2007). February 2007 [15 September 2008].
- ^ Conclusions: Failure Trends in Large Disk Drive Population, p. 12
- ^ Shankland, Stephen. CNet.com. News.cnet.com. 1 April 2009 [26 April 2012].
- ^ AFR broken down by age groups: Failure Trends in Large Disk Drive Population, p. 4, figure 2 and subsequent figures.
- ^ 20.0 20.1 Bianca Schroeder and Garth A. Gibson. "Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You?". Proceedings 5th USENIX Conference on File and Storage Technologies. 2007.
- ^ L.N. Bairavasundaram, GR Goodson, S. Pasupathy, J.Schindler. "An analysis of latent sector errors in disk drives". Proceedings of SIGMETRICS'07, June 12-16,2007. (PDF).
- ^ WD VelociRaptor Drive Specification Sheet (PDF) (PDF). [26 April 2012].
- ^ Jay White. Technical Report: Storage Subsystem Resiliency Guide (TR-3437) (PDF). NetApp: 5. May 2013 [6 January 2016].
- ^ Everything You Know About Disks Is Wrong. StorageMojo. 20 February 2007 [29 August 2007].
- ^ "One aspect of disk failures that single-value metrics such as MTTF and AFR cannot capture is that in real life failure rates are not constant. Failure rates of hardware products typically follow a "bathtub curve" with high failure rates at the beginning (infant mortality) and the end (wear-out) of the lifecycle."(Schroeder et al. 2007)
- ^ David A. Patterson; John L. Hennessy. Computer Organization and Design, Revised Fourth Edition: The Hardware/Software Interface. Section 6.12. Elsevier. 13 October 2011: 613–. ISBN 978-0-08-088613-8. – "...disk manufacturers argue that the calculation [of MTBF] corresponds to a user who buys a disk and keeps replacing the disk every five years- the planned lifetime of the disk."
- ^ Decrypting hard-drive failures – MTBF and AFR. snowark.com.
- ^ This is the case of Software RAID and desktop drives without ERC configured. The problem is known as timeout mismatch.
- ^ Brown, Cody. What Are the Most Dependable Hard Drives? Understanding the Backblaze Tests. March 25, 2022 [November 15, 2022].
So surely, the data they provide is invaluable to average consumers… right? Well, maybe not.
- ^ Henkel, Tom. IBM 3380 damage: Tip of a larger problem?. ComputerWorld. December 24, 1984: 41.
- ^ Burke, Steven. Drive Problems Continue in PC AT. InfoWorld. 18 November 1985.
- ^ Krazit, Tom. Fujitsu hard disk lawsuit settlement proposed. ComputerWorld. 22 October 2003 (英语).
- ^ IBM 75GXP: The infamous Deathstar (PDF). Computer History Museum. 2000.
- ^ Hruska, Joel. Seagate faces class-action lawsuit over 3TB hard drive failure rates. ExtremeTech. 2 February 2016 (美国英语).
- ^ HddSurgery - Professional tools for data recovery and computer forensics experts. [April 10, 2020].
- ^ Hard Drive Circuit Board Replacement Guide or How To Swap HDD PCB. donordrives.com. [May 27, 2015]. (原始内容存档于May 27, 2015).
- ^ Firmware Adaptation Service – ROM Swap. pcb4you.com. [May 27, 2015]. (原始内容存档于April 18, 2015).
- ^ How to Maximize the Lifespan of Your Computer's Hard Drive.
其他条目
[编辑]外部链接
[编辑]- Backblaze: Hard Drive Annual Failure Rates, 2019, Q2-2020
- Failure Trends in a Large Disk Drive Population – Google, Inc. February 2007
- A Clean-Slate Look at Disk Scrubbing
- Noises made by defective and failing hard disk drives
- Hard disk drive anatomy: Logical and physical failures