The Reliability of Flash Drives
By Umesh Maheshwari – Co-founder and CTO

Despite the increasing popularity of flash drives in the data center, very little has been published on their failure characteristics. So, it was a welcome relief to see a paper on Flash Reliability in Production based on flash drives in Google’s data centers.

On the one hand, the paper confirmed what many of us have expected and some of us have observed in practice. For instance, flash drives fail differently from disk drives for each of the two major failure modes:

  • Whole drive failure (requiring drive replacement): Flash drives have a lower annual failure rate than disk drives.
  • Partial data loss: Flash drives have a higher rate of uncorrectable errors than disk drives. (Each sector, typically from 0.5KB to 4KB, is protected by an error correction code, or ECC. When there are too many bit errors within a sector, ECC is unable to correct the errors, resulting in the loss of that sector.)

Storage system vendors need to understand these failure modes because, when drives fail or lose data, the system must compensate for the failures so that it does not lose data.

On the other hand, the paper uncovered a mystery: that a high raw bit error rate (RBER) is not predictive of uncorrectable errors. Here, “raw” means before applying error correction. A high RBER is generally considered a harbinger of uncorrectable errors. Unfortunately, the paper did not offer much explanation for the apparent lack of correlation between RBER and uncorrectable errors.

Below I explain how Nimble Storage systems are designed to avoid data loss. I also offer a plausible explanation for the mysterious lack of correlation between RBER and data loss.

Whole Drive Failures

It is not surprising that flash drives have a lower replacement rate than disk drives. Disk drives have mechanical and magnetic components that are more likely to result in a whole-drive failure than the mostly solid-state components within a flash drive.

The paper reports the following replacement rates:

  • Flash drives: 4% to 10% over four years, or 1% to 2.5% annually on average;
  • Disk drives: 2% to 9% annually, based on a previous study.

Nimble Storage has been selling and monitoring systems with flash and disk drives for over 5 years. Our observed failure rates are lower than those reported by the paper, but the relative ratio is consistent with the reported studies: the replacement rate of flash drives is about a third of the replacement rate of disk drives.

But, there is a catch. While disk drives have more complex hardware with mechanical and magnetic components, flash drives have more complex firmware to conduct address translation and garbage collection. Google’s flash drives run proprietary firmware that is likely streamlined and ruggedized for their file system. Off-the-shelf flash drives need to support general-purpose applications, requiring more complex firmware.

So, while the average replacement rate of flash drives is low, it can vary greatly by the make and firmware version, despite rigorous testing by drive and system vendors.

Another factor that introduces risk for flash drives is that the industry is still in its infancy. Disk drives were invented 60 years ago, yet we are still learning new facts about their failure characteristics! For instance, another paper published in the same conference points to how relative humidity plays a big role in disk failures. In contrast, flash drives became popular only about 10 years ago, so we should expect to run into some surprises and road bumps over the next few years or even decades.

How can we deal with this uncertainty? At Nimble Storage, we are paranoid about reliability. While other systems employ dual parity RAID or triple mirroring (each of which is able to tolerate the failure of any two drives), Nimble systems employ triple parity RAID (which is able to tolerate the failure of any three drives).

In addition to triple parity, our All Flash arrays include a reserved spare. When a drive fails, the array is able to rebuild the failed drive without needing to wait for a replacement. This shrinks the window when reliability and performance are degraded. The spare is reserved so that it is always available, even when the system is full. (In some other systems, the spare space disappears as the system is filled with data. That is like tossing the spare tire when the truck is loaded.)

One might think that all this parity and sparing would hurt the usable capacity relative to a system that uses only dual parity without reserved spares. That would indeed be true if the RAID group in the two systems had the same number of drives. But we leverage the higher degree of protection from triple parity and reserved sparing to support a wider RAID group with more drives, which reduces the relative overhead. The net effect is a win-win: the system has higher reliability as well as higher usable capacity.

Partial Data Loss

The biggest concern with flash drives is partial data loss from uncorrectable errors. The paper reports that a whopping 20% to 63% of flash drives lost some data in a four-year period, compared to only 3.5% of disk drives in a 2.5-year period based on a previous study.

An uncorrectable error happens when there are so many bit errors within a sector that ECC cannot recover it. Flash drives include a healthy dose of ECC within each sector, thus correcting many more bit errors than traditional ECC in disk drives. But the relentless drive towards smaller flash cell size and degradation from erase cycles and age result in a net increase in uncorrectable errors.

How can we compensate for uncorrectable errors? The same RAID parities that protect against whole drive failures also protect against partial data loss, because the lost data can be reconstructed from the other drives in the RAID group including the parities.

However, if we truly want to support the failure of up to K drives and support uncorrectable errors at the same time, the system would need K+1 parities. To see why, consider what happens when K drives have failed and the system has only K parities. The system rebuilds the failed drives by reading all of the data in the surviving drives. If any of the surviving drives harbors even a single uncorrectable error, the odds of which are high, the system cannot reconstruct the corresponding stripe.

An additional parity can solve the problem, but a RAID parity is an expensive way to reconstruct a few damaged sectors. Depending on the number of drives in the RAID group, each parity can reduce usable capacity by 5% to 15%.

A more efficient mechanism is to introduce what may be called “intra-drive” or “vertical” parity as opposed to the traditional “inter-drive” or “horizontal” parity. The system divides each drive into chunks, where each chunk includes say a hundred sectors. The system appends one or more parity sectors to each chunk, which is able to recover the loss of as many sectors within the chunk. This intra-drive parity uses much less space than inter-drive parity, typically about 1% of the raw capacity. The figure below shows a RAID stripe with both inter-drive and intra-drive parities.

Triple Plus

This is why we call our parity “triple plus”: there are three inter-drive parities and some intra-drive parity. And, there is an additional reserved spare, which is not reflected in the name. Our marketing folks say we should call it “triple plus plus”.  Really, we need industry-standard terminology for quantifying attributes such as inter-drive parities, intra-drive parities, and spares.

All this talk on parities pre-supposes that the loss of data is detectable in the first place. When a whole drive fails, the failure is relatively evident. When a sector is uncorrectable, the drive normally returns an error. But we know not to trust the drives to always catch the problem.

At Nimble, we have added system-level checksums to detect silent corruptions within drives. The system protects every block, whether data or metadata, whether on disk or flash, with a checksum and a self-id. Every time the system reads a block, it validates the checksum and the self-id. If either does not match, the system reconstructs the block from inter-drive and intra-drive parities. (The self-id catches a class of drive errors where reads or writes might be misdirected to a false address.)

Together, the checksums and multiple parities provide strong reliability against whole drive failures and partial data loss.

Raw Bit Error Rates and Uncorrectable Errors

One would expect RBER and uncorrectable errors to be correlated. After all, an uncorrectable error happens when there are so many bit errors within a sector that ECC cannot recover it.

Yet, the study on Google drives found no correlation between the two! The authors sliced and diced the information in ten different ways and still found no correlation.

A plausible explanation lies in the way RBER is calculated, which is also how the study calculated it: sum up the number of bit errors across all sectors read and divide by the total number of bits read. The problem is that this gives the average number of bit errors across all sectors, when what matters more is how those bit errors are distributed. If the bit errors are distributed somewhat uniformly across all sectors, ECC has a good chance of correcting every sector. But if some sectors have a higher concentration of bit errors, ECC might not be able to correct them.

Specifically, the paper reports that the 99 percentile RBER across all drives is on the order of 1/10^7 to 1/10^5. Even at the high end of the range, the average number of bit errors in a 1KB sector is only 1/10, which means that if the bit errors were evenly spread, only one in 10 sectors will have a single bit error. On the other hand, the ECC in modern flash drives can correct many 10s of bit errors within each sector. No wonder then that the average number of bit errors is not correlated with the rate of uncorrectable errors.

Here is an analogy. We expect places at higher altitude to have more snow. Now, imagine a world where each continent is mostly low and flat at some fixed altitude between say 0 to 100 meters (varying by the continent). The snowline is at about 1000 meters. The continents have some high mountains that rise above the snowline, but the mountains are so narrow and the continents so expansive that the mountains do not much alter the average altitude of the continent. One would find that there is not much correlation between the average altitude of the continent and the amount of snow it has. To uncover the correlation, one could measure the number of acres in different bands of altitude (e.g., 1—100m, 101—200m, 201—300m, etc.).

Similarly, in the world of flash drives, one could measure the number of sectors in different bands of bit errors (e.g., 1—10, 11—20, 21—30, etc.). Ideally, the drive should provide such a histogram as part of its S.M.A.R.T. interface. At the least, it should provide the number of sectors in the “danger zone”, the band just below the threshold number of bit errors that ECC is unable to recover from. Then we will be able to study how the distribution of bit errors changes with factors such as erase cycles and age, and thereby predict the incidence of uncorrectable errors. I discussed this possibility with the authors of the paper after it was presented at USENIX FAST 2016, and they seemed to agree.

The paper is a huge step forward in understanding the failure characteristics of flash drives. Based on the history of disk drives, we should expect to learn new facts about flash for many years to come.