Running Performance Benchmarks on Dedupe Capable Storage
By Stephen Daniel – Product Management

Increasingly, IT architects and leaders are evaluating and using storage arrays that are capable of data deduplication. For some workloads dedupe technology can provide a substantial reduction in physical storage usage. However, dedupe technology comes at a cost. Depending on the implementation, dedupe may increase storage controller cost, reduce storage controller throughput, or limit the volume of storage a controller can manage. Furthermore arrays vary in whether they reduce duplicate data in-line, as a post-processing step, or both. Some storage arrays are not able to catch and reduce all duplicate data.

For all of these reasons IT teams need reliable ways of testing storage systems while utilizing dedupe technology. While there are a bewildering array of tools for measuring storage system performance, few, if any, do a good job of benchmarking while providing a controlled amount of data duplication. Providing controlled amounts of duplication turns out to be a much harder problem than most people would expect.

In fact, a close inspection of storage benchmarking tools shows that they’re useful for comparing some performance metrics, but when you do the math they simply can’t model dedupe effectively – for that you must rely on real world data.

Problems Using Duplicated Data for Benchmarks

Designing a benchmark to deliver a controlled read/write ratio is pretty easy. Typically a random number generator is used to control the fraction of operations that are reads or writes. The most straightforward way to add data duplication to a benchmark extends this approach; a parameter controls the fraction of writes that should be duplicates.

On reflection, this doesn’t fully specify the behavior of the benchmark. If, at one extreme, every time a duplicate block is written, the same contents are written, then we expect that requesting 50% duplicate blocks would generate approximately a 2:1 dedupe ratio, with half of the blocks having unique content and half all identical. In contrast, if every time a duplicate block is written the contents have been seen exactly once before, we get a 1.5:1 dedupe ratio: half of the blocks containing unique content, and all of the rest containing contents that are used exactly twice.

Worse, the distribution of keys on the storage array for these two examples is dramatically different. In the first case there is only one hash key that is ever used repeatedly. In the second case there are many, many hash keys used more than once. For any array that attempts to optimize deduplication by trading off perfect dedupe for smaller data structures, the difference between these two cases is profound.

The key difference between designing a benchmark to drive controlled data duplication and one to drive a controlled read/write ratio is this: the read/write ratio is a property of the stream of requests flowing to the storage array. The dedupe ratio is a property of the data on the array, and represents the sum of requests seen since the storage was initialized. As the example above shows, controlling the data duplication ratio on storage is far more complex than controlling the data duplication ratio in the stream of writes given to the storage.

Using Vdbench to Generate Duplicated Data

Perhaps the most popular benchmark tool that attempts to control dedupe ratios is Vdbench, available from Oracle. Vdbench is a command line utility that lets you generate disk I/O workloads to be used for validating storage performance and storage data integrity. Vdbench execution parameters may also be specified via an input text file.

Vdbench allows its users to configure a dedupe ratio. Unfortunately, under most circumstances this ratio is an upper bound on the delivered duplicate data ratio, not an exact specification.

Consider this graph:


Figure 1

This graph was generated as follows:

  • Create an uninitialized (empty) volume.
  • Initialize it using Vdbench, requesting a sequential write and a data duplication ratio of 4. This step generates the first point on the graph, where the data duplication ratio is exactly the requested 4x.
  • Run Vdbench at a controlled rate, performing 100% random writes, requesting a data duplication ratio of 4.
  • While Vdbench is running, periodically take point-in-time snapshots of the volume.
  • After the run, scan each snapshot and measure the data duplication ratio.

In order to make the experiment take a manageable amount of time, I ran this on a 16GiB volume and used an I/O rate of 16K IOPS, capturing a snapshot every 16 seconds. A more realistic test might use a 2 TiB data set and run at 128K IOPS and use a 50/50 read/write mixture.

I believe that a benchmark dataset will age at a rate proportional to the ratio of write-rate to dataset size. By this metric my experiment runs 32 times faster than the “realistic” example cited above. For this reason the time scale in the graph has been expanded by 32x.

As we can see, configuring Vdbench to generate a 4x data duplication ratio doesn’t actually deliver the requested 4x data duplication.

One reason for this is straightforward. The first instance of Vdbench laid down a pattern using contents known to it, and the second used block contents known to the second run. As the contents delivered by the second run begin to replace the contents from the first run, the data duplication ratio falls. Once the dataset is largely overwritten by the second run the data duplication ratio begins to climb again.

The time to get back to a reasonable data duplication ratio is equivalent to the time it takes to largely overwrite the entire data set using random writes. In short, we gained no savings by initializing the data set in the first place. Furthermore, every time we wish to run a new performance test we must either run it until the entire dataset has been overwritten, or give up on accurately maintaining any particular data duplication ratio.

The reason the run asymptotically approaches a data duplication ratio of 3.33 : 1 rather than the requested ratio of 4:1 is unique to Vdbench, and covered, if sparsely, in the Vdbench Users’ Guide.

Other Data-Duplicating Benchmarks

The only other benchmark I know of that produces controlled data duplication is the Storage Performance Council’s new version of SPC-1. SPC-1’s generated data duplication rate varies by less than 3% over long runs, proving the problem is solvable.

Unfortunately SPC-1s is available to SPC members only, and is not easily tunable to data duplication ratios other than the built-in 1.5:1.

Using Synthetic Data to Measure Data Reduction

In addition to the operational problems that make it very difficult for a benchmark to deliver the desired data duplication ratio, there are other fundamental problems with using synthetic data to measure data reduction.

In real-world data sets duplicate data appears in runs. If block X is a duplicate of Y, the there is a high probability that block X+1 will be a duplicate of block Y+1. Furthermore, blocks X and X+1 are likely to have been written at about the same time, as are blocks Y and Y+1. We summarize this behavior by saying the duplicates “flock”, meaning they have good locality in both space and time.

This behavior makes sense. Duplicate data arrives on a storage system not by chance, but because objects or fragments of objects are stored multiple times. These objects are rarely a single block in size.

A clever storage system designer will take advantage of flocking to make the system more efficient. However, a user who attempts to measure how effectively a storage system can reduce data will be unable to reproduce this effect. The available tools for generating duplicate data do not generate flocks of duplicates. Optimizing for duplicates that flock is one of the key technologies that allow Nimble Storage all-flash arrays to scale dramatically deeper than other competing products.

Lacking a tool that generates appropriate flocking behavior, it is almost impossible to accurately measure the data reduction ability of a storage array with anything other than real-world datasets created and updated by real-world applications.

Key Lessons

Neither Vdbench nor any other publicly available tool can reliably generate an I/O stream that holds the storage system’s data duplication ratio constant across multiple runs. Users should not expect more than very approximate control over data duplication ratios, and should use caution in believing they can measure the impact of data duplication on performance.

Benchmarks do not write duplicate data to a storage array in a way that even approximately mimics real-world applications. For this reason users should expect that synthetic workloads will not generate a meaningful cross-platform comparison of the ability to detect and reduce duplicated data. Such comparisons will require real-world workloads.