Storage Performance, Capacity, and ReliabilityThe Way We Were

When 500GBs and 1TB drives appeared on the horizon a little over a decade ago, storage vendors realized they had a problem. Existing RAID schemes (such as RAID 4 or RAID 5) would no longer reliably protect a shelf of drives. Many good write-ups already explain the rationale (and the math, links are given further down in the blog), so I’ll stick to a layman’s version:

  • Drive bit error rates were not getting better, but drive capacities (i.e. the number of bits) were growing rapidly each year. Multiply these two, and you could see how error rates per drive (and in turn the probabilities of bad sectors per drive) were growing.
  • Compounding this was the fact that larger drives took longer to rebuild from parity, leaving the “survivors” unprotected for a longer window.
  • A combination of the two factors above meant that the risk of data loss was predicted to be unacceptably high. This could be quantified in multiple ways, including metrics such as MTTDL (Mean Time to Data Loss). A better metric for the average IT pro is the probability of data loss for a RAID group of drives, within 5 years of use (a reasonable estimate of the practical life for a storage array). With single-parity RAID schemes, this was expected to approach a fraction of a percent with 1TB HDDs. This may not seem high until you consider that tens of thousands of new shelves are shipped in the storage market each year, meaning that a significant number would encounter data loss.

The Traditional Approach: Choose Your Poison

Raid CompromiseThe storage market did produce at least one solution to the problem: dual-parity RAID, better known as RAID 6. Most vendors started to offer it in their arrays, and many made it the recommended option for larger (SATA or Nearline SAS) drives. There was one problem: for most conventional architectures, RAID 6 carries a much bigger write performance penalty than RAID 4 or RAID 5  (see this explanation). In fact, RAID schemes for traditional architectures generally come with painful tradeoffs – you can optimize for performance, capacity usage, or reliability, but not all of them simultaneously. Unfortunately, these tradeoffs pushed customers to compromise, and get by with the less robust RAID schemes. Anyone who has been in storage long enough has seen examples of customers losing data due to inadequate RAID protection (e.g. using RAID5 on 1TB HDDs), a wholly preventable misfortune.

Nimble Solution: The No-Compromise Solution

By contrast, Nimble has never required compromises. One of the unique attributes of our Cache Accelerated Sequential Layout (CASL™) architecture is that parity calculations are virtually free (no read-modify-write penalty). This has allowed us to offer a single (dual-parity) RAID scheme that was simultaneously optimized for performance, reliability, and capacity usage – no menus, no tradeoffs, and no pretense of gift wrapping ugly tradeoffs as “flexibility.” Fast-forward to the future and we anticipate widespread deployment of 4TBs HDDs (starting later this year), and growing 6TBs-HDD shipments in 2015. When we started planning for these larger drives, we redid the math on the probability of data loss. Several good write-ups and models explain how to predict this (sadly, one is no longer active, but, thanks to web archives, can be accessed here and here).  We extended these models and concluded that we had to do something to strengthen parity protection; otherwise shelves with larger drives would experience an unacceptably high risk of data loss (approaching a tenth of a percent).

So we did. Starting with Nimble OS 2.1 (available now), every new system will ship with triple-parity RAID as the default RAID protection (anyone upgrading to the equivalent software release will be able to take advantage of it, too.) CASL’s unique sequential layout allowed us to do this without adding any “read/modify/write” performance penalty (see earlier blog for performance numbers). Moreover, this higher level of protection still only takes up 20 percent of raw capacity in a disk expansion shelf, and maintains the fast rebuild times characteristic of CASL.

How reliable is triple-parity RAID? Although it’s hard to demonstrate reliability in real-time, the improvement can be modeled, statistically, just like it was for RAID 6. For 6TBs-HDDs, the risk of a RAID 6 shelf losing data (due to multiple correlated failures) is projected to approach 0.1 percent within 5 years of use. If that doesn’t sound high to you, think of it this way: It would likely experience a data loss event within 5 years. Given that tens of thousands of shelves are sold each year, those are not great odds.

Triple-parity RAID reduces that risk by over 3 orders of magnitude, well below the level of risk with 1TB-HDDs and RAID 6, allowing you to safely take advantage of the falling cost per GB offered by upcoming generations of larger HDDs. Done right (as with CASL), this can be implemented without burdensome performance or space penalties. Safe, simple, and efficient – that’s the kind of compromise-free reliability everyone likes.