[ The PC Guide | Systems and Components Reference Guide | Hard Disk Drives | Hard Disk Geometry and Low-Level Data Structures | Hard Disk Data Error Management and Recovery ] Read Error Severities and Error Management Logic The hard disk's controller employs a sequence of sophisticated techniques to manage errors that occur when reading data from the disk. In a way, the system is kind of like a troubleshooting flowchart. When a problem occurs, the simplest techniques are tried first, and if they don't work, the problem is escalated to a higher level. Every manufacturer uses different techniques, so this is just a rough example guideline of how a hard disk will approach error management:
Any problems occurred during a read, even if recovery is successful, are potentially cause for concern, and error notification or logging may be performed. Even before the matter of actually reading the data comes up, drives can have problems with locating the track where the data is. Such a problem is called a seek error. In the event of a seek error, a similar management program is instituted as that used for read errors. Normally a series of retries is performed, and if the seek still cannot be performed, an unrecoverable seek error is generated. This is considered a drive failure, since the data may still be present, but it is inaccessible. Every hard disk model has analysis done on it to determine the likelihood of these various errors. This is based on actual tests on the drive, on statistical analysis, and on the error history of prior models. Each drive is given a rating in terms of how often each error is likely to occur. Looking again at the Quantum Fireball TM, we see the following error rate specifications:
Drives also typically specify the rate of data miscorrection. This situation arises if the ECC algorithm detects and corrects an error but itself makes a mistake! Clearly this is a very bad situation, since an error would be returned to the system and the fact that an error occurred would not even be known. Fortunately, it is very, very rare. A typical value for this occurrence is less than 1 bit in 1021. That means a miscorrection occurs every trillion gigabits read from the disk--on average you could read the entire contents of a 40 GB drive over a million times before it happened! I find the numbers above--even the "smaller" ones--pretty impressive. While your hard disk does a lot of reads and writes, 100,000 gigabits is a pretty enormous number! This is why the reliability of modern hard disks is so high. Interestingly, the error rates on drives haven't changed all that much in the last few years. Presumably, any improvements in error rates are "used up" by pushing the performance envelope. Meanwhile, the reliability concerns associated with individual drives are typically addressed through the use of multiple drive arrays.
|