Understanding Storage Capacity
by Umesh Maheshwari – Co-founder and CTO
Storage capacity is an important factor when deploying infrastructure in the data center. Storage vendors describe capacity in different ways, and it’s essential you know how to interpret them properly – both for sizing the infrastructure your organization really needs, and for comparing the cost of capacity ($/GB) between different storage systems.
However, interpreting the capacity of modern storage systems can be surprisingly tricky. Different vendors use wildly different metrics to advertise the capacity of their systems, including:
- Raw capacity. This is the total capacity of the storage media in the system. For example, if the system contains 20 drives of 5TB each, the raw capacity is 100TB.
- Usable capacity. This is how much data can be stored in the system in the absence of any data reduction. Usable capacity is lower than raw capacity because of overheads such as RAID and flash over-provisioning.
- Effective capacity. This is how much data can be stored in the system assuming some amount of data reduction using techniques such as compression and deduplication. Effective capacity is generally higher than usable capacity.
The use of terms “usable capacity” to mean capacity before data reduction and “effective capacity” to mean capacity after data reduction is a useful convention for storage vendors as well as users. It’s important to differentiate between the two metrics, because usable capacity is highly predictable and can be relied upon, while effective capacity varies based on the data and should be used with caution.
So, how should one use these metrics?
Raw capacity is of little use. Vendors sometimes price their systems based on raw capacity, but users should focus on usable capacity instead because usable capacity can be significantly lower than raw capacity.
A minor note: raw capacity is generally reported in decimal terabytes (10^12 bytes), while usable capacity may be reported in either decimal or binary terabytes (2^40 bytes). A binary terabyte, sometimes written as “TiB”, is about 10% larger than a decimal terabyte. This can make usable capacity look smaller, but the drop is not real and is small relative to the real drop in usable capacity originating from real overheads.
Usable capacity is lower than raw capacity because of system overheads such as RAID, flash over-provisioning, metadata, system software, system logs, and so on. For concreteness, let’s refer to the ratio of the usable capacity to the raw capacity as the “usable ratio”, which by definition is less than or equal to one.
usable_ratio = usable_capacity / raw_capacity
Typically, the usable ratio is most impacted by RAID and flash over-provisioning. (Over-provisioning is the opposite of thin-provisioning; it means that the underlying capacity is more than the advertised capacity.) Storage vendors over-provision flash capacity to reduce internal write amplification from garbage collection, and different vendors use different amounts of over-provisioning based on the needs of their system architecture. In the rest of this post, I will ignore flash over-provisioning and focus on RAID, because RAID overhead is more overt and predictable. But, whenever possible, users should consider the system-reported usable capacity that includes all system overheads.
The impact of RAID varies significantly based on whether it employs mirroring or parity. For example, triple-mirroring RAID reduces the usable ratio to 0.33. On the other hand, parity RAID provides a much higher usable ratio. For example, if there are 10 drives in a RAID group with dual parity, the usable ratio is (10-2)/10 or 0.8. Note that triple-mirroring RAID and dual-parity RAID are similar in fault tolerance because each can tolerate the failure of any two drives.
Some storage systems expose a choice of different RAID levels to the user, because their internal design causes parity RAID to slow down random writes. Sometimes the slowdown is so severe that while the system supports parity RAID in concept, users actually are forced to choose mirroring RAID for most applications. Worse, the system might advertise high usable capacity (assuming parity RAID) and high performance (assuming mirroring RAID), even though applications cannot make use of both at the same time.
Some systems employ RAID within a node, some employ it across multiple nodes, and some employ it both within a node and across nodes. When employed across nodes, RAID is often called by other names: mirroring RAID may be called “remote mirroring”, and parity RAID may be called “erasure coding”. Regardless of whether RAID is employed within a node or across nodes, parity RAID provides a higher usable fraction than mirroring RAID. However, the performance impact of parity is amplified when used across nodes, so most systems employing RAID across nodes stick with mirroring, except possibly for archiving-style applications that perform mostly sequential writes.
Effective Capacity (With Data Reduction)
Data reduction refers to techniques such as compression and deduplication that reduce the amount of space used by a dataset. The ratio of the space used by a dataset without any reduction and the space used by the dataset after reduction is called “data reduction rate” or “data reduction ratio”, which by definition is greater than or equal to one. If a system employs multiple techniques for data reduction, the total reduction ratio is the product of the technique-specific reduction ratios. For example, if compression reduces space usage by 2x and deduplication reduces space usage by 2.5x, the total reduction ratio is 2×2.5 or 5.
While RAID and other system overheads reduce the usable capacity, data reduction techniques increase the effective capacity. Specifically, the effective capacity of a system is the product of its raw capacity, usable ratio, and data reduction ratio.
effective_capacity = raw_capacity * usable_ratio * data_reduction_ratio
The overall storage efficiency of a system can be quantified as the ratio of its effective capacity to the raw capacity. We can call it the “effective ratio”, and it is equal to the product of the usable ratio and the data reduction ratio. For example, if the usable ratio is 0.8 and the data reduction ratio is 5, the effective ratio is 0.8×5 or 4.
effective_ratio = effective_capacity / raw_capacity = usable_ratio * data_reduction_ratio
While the usable capacity of a system is generally well known, its effective capacity is generally unpredictable. Most systems report the usable capacity out of the box; otherwise the usable capacity can be predicted reasonably accurately as the product of the raw capacity and the usable ratio, and the usable ratio can be predicted reasonably accurately based on the RAID configuration. On the other hand, the data reduction ratio is far less predictable and also less intuitive than the usable ratio, and it makes the effective ratio unpredictable. There are several reasons for this.
First, the data reduction ratio depends largely on the reducibility of the specific datasets stored on the system. Both compressibility and dedupability vary significantly in effect, but dedupability is the more fickle of the two. This is because compression is a “local” function, generally working over a window of less than 1MB at a time. Therefore, compression can often be predicted based on the type of data: executables generally compress by 1.5x, English text by 2x, and relational databases by 2x to 6x. Audio, images, and videos typically don’t compress because they are already compressed by the application.
On the other hand, deduplication is a more “global” function, generally working over a large set of blocks or objects. Thus, it depends not only on the type of data, but the actual set of data, varying from instance to instance for the same application and also varying over time. For instance, in a system storing many virtual machine images, the deduplication ratio depends on the number of similar images, and this ratio can change over time if the images are updated differently.
Second, if multiple datasets are stored on the same system, the overall reduction ratio is not a simple average of the individual reduction ratios. Instead, it is dominated by datasets that are larger and datasets that do not reduce well. Consider a system that stores 10TB of virtual machine images with a reduction ratio of 20, and 100TB of databases with a reduction ratio of 2. The overall reduction ratio will be dominated by the databases—specifically, it will be (10+100)/(10/20+100/2) or 2.2.
Even if two datasets are the same size but have different reduction ratios, the overall reduction ratio is closer to the lower reduction ratio. Consider a system that stores 10TB of virtual machine images with a reduction ratio of 20 and 10TB of databases with a reduction ratio of 2. One might intuit that the overall reduction ratio will be the mean of 20 and 2, or 11. In reality, the overall reduction ratio will be much lower—at (10+10)/(10/20+10/2), or 3.6. Mathematically, the overall reduction ratio is the harmonic mean, not the arithmetic mean, of the dataset-specific reduction ratios weighted by the dataset size. Thus, having a very high reduction ratio on one dataset has limited benefit to the overall reduction ratio in the presence of other datasets that do not reduce so well. This effect is an example of Amdahl’s Law, which predicts the improvement to a multi-part system when different parts are improved to different levels. (Here’s another example. If my commute involves 5 miles of city streets where I drive at 20 miles per hour and 5 miles of highway where I drive at 50 miles per hour, increasing my speed to 100 miles per hour on the highway is not going to make a big improvement to my commute time.)
Third, storage systems vary in their ability to realize data reduction with sustainable performance. For instance, some systems are able to compress data, but suffer such a big drop in the performance for random writes that their vendors recommend turning compression off for applications such as transactional databases – which is unfortunate because databases compress very well. Similarly, some systems are able to deduplicate data inline, but are unable to sustain performance over long ingests of data, causing the system to either stop or defer deduplication. Other systems attempt to sustain performance by keeping the index needed for deduplication in DRAM, but that puts a limit on the storage capacity they can support.
Finally, there are additional techniques, besides compression and deduplication, that can reduce space usage significantly and thereby increase the effective capacity. Different storage systems support these techniques with different effectiveness. Even systems that support them well might or might not count them as “data reduction”, making it difficult to compare the data reduction ratios reported by different systems. These techniques include the following:
- Zero-block pruning – the system does not store blocks that are filled with zeroes. This technique can be seen as an extreme case of either compression or deduplication. Also, some systems generalize this technique to avoid storing blocks that are filled with any repetitive byte pattern.
- Thin clones – the system shares blocks between clones of a data volume. Thin clones can be implemented through either address-based sharing or content-based sharing, where the latter is similar to deduplication. (Address-based sharing is generally more efficient, while deduplication is more flexible.) Systems implementing thin clones through content-based sharing typically include the savings from thin clones in the deduplication ratio. Systems implementing thin clones through address-based sharing might or might not include the savings from thin clones in the reported data reduction ratio. Such systems might appear to have a lower data reduction ratio, even though they use no more space than systems implementing thin clones through deduplication.
- Thin provisioning – the system does not consume space for blocks that were never written or that have been explicitly unmapped. Some systems include the savings in the reported data reduction ratio, some don’t.
- Thin snapshots – sharing blocks between multiple versions of a data volume. Most storage systems do not count savings from thin snapshots in the reported data reduction ratio, even though the savings are significant and can vary greatly from system to system.
Modern storage systems involve a complex tradeoff between capacity, reliability, and performance, so in addition to storage capacity one must also be aware of the reliability and performance characteristics of a system.
Here is my recommendation:
- Ignore the raw capacity, because it is of little use to you.
- Start with the usable capacity. It forms a reliable basis for how much data you can store on the system regardless of data reduction. This is an important step because the benefit of some data reduction techniques can vary wildly across datasets and over time.
- Look at the usable capacity reported by the live system, because the live system is likely to account for all system overheads whereas a datasheet might not.
- Some storage systems report only raw and effective capacities, avoiding any mention of usable capacity, perhaps because it is the lowest of the three. 🙂 In this case, you can estimate the usable capacity in one of two ways. First, you can multiply the raw capacity by the usable fraction, which you can estimate based on the RAID configuration, although you would miss other overheads. Second, you can divide the effective capacity by the assumed rate of data reduction, which the vendor may have provided as a footnote.
- If the storage system supports different kinds of RAID, ask the vendor about the usable capacity with the specific kind of RAID you would be using for your applications and performance goals. Be wary of systems whose usable capacity is advertised assuming parity RAID and whose performance is advertised assuming mirroring RAID.
- You can now compute the cost of usable capacity ($/GB) by dividing the system cost by this usable capacity. Why is this a useful metric? See step 3d below.
- Next, estimate the savings from data reduction to determine the potential upside for effective capacity.
- Take any “assumed” or “average” reduction ratio advertised by the vendor with a grain of salt. Assumed ratios might be based on datasets that do not represent your datasets, and average ratios might be inflated if they are not weighted in accordance to your dataset sizes or if they are calculated as the arithmetic mean instead of the harmonic mean.
- Better: estimate data reduction on your real datasets. Some vendors provide tools for such estimation. If you have multiple datasets, focus on the ones that are larger and the ones that do not reduce well, because the overall reduction ratio will be dominated by those datasets. If in doubt, estimate the overall reduction ratio as the size-weighted harmonic mean of the individual ratios. (Ideally, the vendor-provided tool would do this for you.)
- You can now compute the effective capacity of the system for your datasets by multiplying the usable capacity and your overall data reduction ratio. You can also compute the cost of your effective capacity ($/GB) by dividing the system cost by this effective capacity.
- Best: try storing your real datasets on the storage system. If you are comparing multiple storage systems, compare the amount of space used, not the data reduction ratio reported by each system. Reason: some systems might include savings from techniques such as cloning and thin provisioning while other systems do not. Also, the ratio does not account for space used by snapshots. What matters finally is the amount of space actually used. To compute the actual cost of storing your datasets, multiply the cost of usable capacity (as computed in step 2d) and the amount of space actually used.
As you can see, storage capacity is a nuanced subject. But paying close attention in this area will enable you to select storage systems and size them optimally to match your requirements and budget.
- Umesh Maheshwari