[ The PC Guide | Systems and Components Reference Guide | System Memory | Memory Errors, Detection and Correction ] Memory Errors Memory is an electronic storage device, and all electronic storage devices have the potential to incorrectly return information different than what was originally stored. Some technologies are more likely than others to do this. DRAM memory, because of its nature, is likely to return occasional memory errors. DRAM memory stores ones and zeros as charges on small capacitors that must be continually refreshed to ensure that the data is not lost. This is less reliable than the static storage used by SRAMs. Every bit of memory is either a zero or a one, the standard in a digital system. This in itself helps to eliminate many errors, because slightly distorted values are usually recoverable. For example, in a 5 volt system, a "1" is +5V and a "0" is 0V. If the sensor that is reading the memory value sees +4.2V, it knows that this is really a "1", even though the value isn't +5V. Why? Because the only other choice would be a "0" and 4.2 is much closer to 5 than to 0. However, on rare occasions a+5V might be read as +1.9V and be considered a "0" instead of a "1". When this happens, a memory error has occurred. There are two kinds of errors that can typically occur in a memory system. The first is called a repeatable or hard error. In this situation, a piece of hardware is broken and will consistently return incorrect results. A bit may be stuck so that it always returns "0" for example, no matter what is written to it. Hard errors usually indicate loose memory modules, blown chips, motherboard defects or other physical problems. They are relatively easy to diagnose and correct because they are consistent and repeatable. The second kind of error is called a transient or soft error. This occurs when a bit reads back the wrong value once, but subsequently functions correctly. These problems are, understandably, much more difficult to diagnose! They are also, unfortunately, more common. Eventually, a soft error will usually repeat itself, but it can take anywhere from minutes to years for this to happen. Soft errors are sometimes caused by memory that is physically bad, but at least as often they are the result of poor quality motherboards, memory system timings that are set too fast, static shocks, or other similar problems that are not related to the memory directly. In addition, stray radioactivity that is naturally present in materials used in PC systems can cause the occasional soft error. On a system that is not using error detection, transient errors often are written off as operating system bugs or random glitches. The exact rate of errors returned by modern memory is a matter of some debate. It is agreed that the DRAMs used today are far more reliable than those of five to ten years ago. This has been the chief excuse used by system vendors who have dropped error detection support from their PCs. However, there are factors that make the problem worse in modern systems as well. First, more memory is being used; 10 years ago the typical system had 1 MB to 4 MB of memory; today's systems usually have 16 MB to 64 MB--or much more, since RAM prices have fallen dramatically in the last three years. Second, systems today are running much faster than they used to; the typical memory bus is running from 3 to 10 times the speed of those of older machines. Finally, the quality level of the average PC is way down from the levels of 10 years ago. Cheaply thrown-together PCs, made by assembly houses whose only concern is to get the price down and the machine out the door, often use RAM of very marginal quality. Regardless of how often memory errors occur, they do occur. How much damage they create depends on when they happen and what it is that they get wrong. If you are playing your favorite game and one of the bits controlling the color of the pixel at screen location (520, 277) is inverted from a one to a zero on one screen redraw, who cares, right? However, if you are defragmenting your hard disk and the memory location containing information to be written to the file allocation table is corrupted, it's a whole different ball game... The only true protection from memory errors is to use some sort of memory detection or correction protocol. (Well, that's not totally true. The other form of protection is prevention: buying quality components and not abusing or neglecting your system.) Some protocols can only detect errors in one bit of an eight-bit data byte; others can detect errors in more than one bit automatically. Others can both detect and correct memory problems, seamlessly.
|