Read Error Severities and Error Management Logic

[ The PC Guide | Systems and Components Reference Guide | Hard Disk Drives | Hard Disk Geometry and Low-Level Data Structures | Hard Disk Data Error Management and Recovery ]

Read Error Severities and Error Management Logic

The hard disk's controller employs a sequence of sophisticated techniques to manage errors that occur when reading data from the disk. In a way, the system is kind of like a troubleshooting flowchart. When a problem occurs, the simplest techniques are tried first, and if they don't work, the problem is escalated to a higher level. Every manufacturer uses different techniques, so this is just a rough example guideline of how a hard disk will approach error management:

ECC Error Detection: The sector is read, and error detection is applied to check for any read errors. If there are no errors, the sector is passed on to the interface and the read is concluded successfully.
ECC Error Correction: The controller will attempt to correct the error using the ECC codes read for the sector. The data can be corrected very quickly using these codes, normally "on the fly" with no delay. If this is the case, the data is fixed and the read considered successful. Most drive manufacturers consider this occurrence common enough that it is not even considered a "real" read error. An error corrected at this level can be considered "automatically corrected".
Automatic Retry: The next step is usually to wait for the disk to spin around again, and retry the read. Sometimes the first error can be caused by a stray magnetic field, physical shock or other non-repeating problem, and the retry will work. If it doesn't, more retries may be done. Most controllers are programmed to retry the sector a certain number of times before giving up. An error corrected after a straight retry is often considered "recovered" or "corrected after retry".
Advanced Error Correction: Many drives will, on subsequent retries after the first, invoke more advanced error correction algorithms that are slower and more complex than the regular correction protocols, but have an increased chance of success. These errors are "recovered after multiple reads" or "recovered after advanced correction".
Failure: If the sector still cannot be read, the drive will signal a read error to the system. These are "real", unrecoverable read errors, the kind that result in a dreaded error message on the screen.

Any problems occurred during a read, even if recovery is successful, are potentially cause for concern, and error notification or logging may be performed.

Even before the matter of actually reading the data comes up, drives can have problems with locating the track where the data is. Such a problem is called a seek error. In the event of a seek error, a similar management program is instituted as that used for read errors. Normally a series of retries is performed, and if the seek still cannot be performed, an unrecoverable seek error is generated. This is considered a drive failure, since the data may still be present, but it is inaccessible.

Every hard disk model has analysis done on it to determine the likelihood of these various errors. This is based on actual tests on the drive, on statistical analysis, and on the error history of prior models. Each drive is given a rating in terms of how often each error is likely to occur. Looking again at the Quantum Fireball TM, we see the following error rate specifications:

Error Severity	Worst-Case Frequency of Error (Number of Bits Read Between Occurrences)
Automatically Corrected	Not Specified
Recovered Read Errors	1 billion (1 Gb)
Recovered After Multiple Reads (Full Error Correction)	1 trillion (1,000 Gb)
Unrecoverable Read Errors	100 trillion (100,000 Gb)

Drives also typically specify the rate of data miscorrection. This situation arises if the ECC algorithm detects and corrects an error but itself makes a mistake! Clearly this is a very bad situation, since an error would be returned to the system and the fact that an error occurred would not even be known. Fortunately, it is very, very rare. A typical value for this occurrence is less than 1 bit in 10²¹. That means a miscorrection occurs every trillion gigabits read from the disk--on average you could read the entire contents of a 40 GB drive over a million times before it happened!

I find the numbers above--even the "smaller" ones--pretty impressive. While your hard disk does a lot of reads and writes, 100,000 gigabits is a pretty enormous number! This is why the reliability of modern hard disks is so high. Interestingly, the error rates on drives haven't changed all that much in the last few years. Presumably, any improvements in error rates are "used up" by pushing the performance envelope. Meanwhile, the reliability concerns associated with individual drives are typically addressed through the use of multiple drive arrays.

Next: Error Notification and Defect Mapping

Home - Search - Topics - Up

Not responsible for any loss resulting from the use of this site.
Please read the Site Guide before using this material.