Busting the Myth of Storage Block Size
By David Adamson PhD – Data Scientist
Determining the needs of a business application can be a daunting process. Different applications (and even different deployments of the same application) can vary significantly in terms of the demands they place on their underlying hardware. How can you reconcile real-world storage performance with conflicting guidelines and marketing claims?
Nimble’s InfoSight platform provides a unique vantage point into the storage demands of real world applications. Other data storage companies may claim to have copied its capabilities, but cannot match either the scope or the volume of data we have collected. With InfoSight, we have analyzed hundreds of trillions of anonymized customer data points from thousands of customer applications in just the past year alone. Our sensors not only monitor the operational health of our array, but also characterize real-world application needs at very fine granularity.
In 2015, we used InfoSight to analyze in ensemble the IO requests of the applications running across our installed base and presented our results. This has become one of our most well-read and referenced blogs. We have now updated and extended that analysis so that we not only show how common applications differ from one another in aggregate, but also how the IO requests from the same application vary from deployment-to-deployment.
We will be publishing comprehensive results of this analysis in an upcoming white paper, which includes deep dives into the needs of several common business applications, (e.g. VDI, SQL Server, Oracle, Exchange, Sharepoint and others). Here we have excerpted some key highlights.
Don’t be Fooled by an Average
Part of the motivation for our original analysis was to disprove the inaccurate claim by some vendors in the storage industry that small transactional IO block sizes (e.g. 4 KB or 8 KB) are not representative of real-world applications. In other words, don’t be fooled by an average – 4 KB is still the most prevalent IO block size in real-world workloads.
The flaw in their analysis involves a misleading aggregation: it stems from the fact that although it is rare to find production arrays where the overall average IO block size is in the 4 KB to 8 KB range (a true statement; Fig 1 right), concluding that applications do not perform a large amount of IO in the 4 KB to 8 KB range is quite wrong (Fig 1 left).
Figure 1: Left: Cumulative Distribution of IO Block Sizes from Real-World Applications. Right: A Misleading Histogram of Averages. Left: A month’s worth of user IOs from >7500 Nimble customers is sorted and counted by block size. We can see that the majority of the IO performed fell into the less-than-16 KB range (52% of all read operations, 74% of all write operations). A second smaller population of IO fell into the 64k+ range (31% of read operations, 14% of write operations), but it should be noted that those large-block operations carried the vast majority of the data (84% of all data read and 72% of all data written). Right: When the same month’s user IO is averaged for each array – the detailed behavior of each array is lost and consolidated into a single number per system. With every array doing some mixture of small and large IO, this average smears together the small with the large and gives a value somewhere in between. A distribution of these per-array averages is easy to misinterpret. On its own, it might be taken to suggest that there isn’t much activity in the [4-8) KB range but we know from the left panel that this just isn’t true; 23% of reads and 43% of writes are performed in that interval. Note that brackets  indicate inclusive interval boundaries while parentheses () indicate exclusive interval boundaries; as an example: the interval [4-8) KB includes all IO sized greater than or equal to 4 KB and strictly less than 8 KB.
So what happened? The problem here is one that anyone working with data encounters all the time: the average of a distribution does not necessarily look anything like the values in the distribution itself. This disparity is greatest when:
- the distribution contains a small set of values that are very large, and/or
- the distribution is bimodal (i.e. contains one concentration of small values and a second concentration of larger values that are well separated from one another).
As we see in Fig 1 left, the distribution of real world IO block sizes exhibits both of these properties.
As an analogy, one can imagine a city block with 10 single-story buildings and a single 100-story skyscraper. The average height of a building on the block is 10 stories tall – even though the closest building to that average size is 10-fold away. Rather than give the average, it would be more informative to say that >90% of buildings are one story tall but >90% of the floors are in buildings with at least 100 stories.
Situations like this one are why we frequently hear about median home prices rather than average home prices. Because the average is not particularly representative of the distribution, a few very expensive houses will skew the average. That skewed average will be significantly greater than what the typical homebuyer would pay.
Why a Bimodal Block Size Distribution Makes Sense
Our analysis shows that user IO contains separate small block and large block groups – confirming that benchmark testing at small block sizes and very large ones is an important part of performance testing. To many this is likely unsurprising, because from first principles, there are really two fundamental ways of characterizing performance:
- When transaction efficiency is the priority – we measure performance by the number of transactions that can be completed in a period of time (e.g. in IO/sec) and the latency of those individual transactions (e.g. in milliseconds).
- In contrast, when data transfer efficiency is the priority – we measure performance in throughput (e.g. in MB/sec).
Increasing transaction efficiency requires a trade-off with transfer efficiency and vice versa (Fig 2). As an example, more transactions are possible when the transaction payload is smaller (i.e. IO/sec is higher when IO size is smaller) while, in contrast, aggregating data into fewer transactions improves data throughput (i.e. MB/sec is higher when IO size is larger).
Figure 2: Two limiting cases for performance. Left: Transaction efficiency (IO/sec) improves as IO size decreases. Right: Transfer efficiency (MB/sec) improves as IO size increases.
“Split the Difference” vs. “Divide and Conquer”
Because of this tradeoff, it is easy to understand why IO block sizes might generally divide into two camps. Applications can be thought of as following the “divide and conquer” principle: splitting their interactions with the storage into two categories that are separately optimized for either transaction performance or for data transfer performance. In contrast, looking at a distribution of average IO size would paint a very different picture: if interpreted as the distribution of actual IO block sizes, it would tell a story of a scenario where application developers “split the difference” between transaction-centric and data-transfer-centric performance, performing few operations optimized specifically for either.
The big picture: block size distributions from individual deployments
While it is clear that overall, the “divide and conquer” principle seems to have won out – we don’t want to assume that just because the overall block size distribution divides strongly into two categories that the block size distribution of each deployed application does as well. To determine whether that is the case, we need to look at the block size distribution from distinct customer deployments individually.
Density Maps 101 (To Explain the Charts Below)
In order to informatively summarize information about the block size distribution from individual customer deployments, we visually represent each deployment as a point on the planes below (Fig 3). Because many points tend to cluster – and we don’t want to give excessive weighting to outliers, we create a density map of the deployments to visually show where they congregate with respect to the chosen axes.
Figure 3: Interpreting Density Maps: A Summary of Deployment-to-Deployment Variability. Left: An individual point indicates the aggregate x-axis and y-axis values from a single application deployment. In this view, points can cluster on top of one another and make it difficult to determine where deployments concentrate most. Center: By converting the leftmost plot to a density map, the ambiguity resulting from overlaying points is removed and darker regions show where deployments are concentrated. Histograms on the top and right side of the panel show how deployments are distributed relative to the x- and y-axis dimensions respectively. Right: to allow the density map to be interpreted quantitatively, we add three contour lines to each density map; the innermost contour contains 25% of the observed deployments, the middle contour contains 50%, and the outermost contains 75%. We note that for some applications, one or more contours may be split across multiple regions of the density map; because the contours are drawn at locations of equal density – this splitting occurs when the deployments for that application cluster into distinct regions.
Don’t Benchmark in the Wrong Part of the Map
Having established this framework for visualizing individual deployments – we still need to meaningfully summarize the IO distribution for each deployment. We have established that we can quantify how optimized a workload is for transactional efficiency based on the fraction of operations performed at small block sizes (e.g. <= 8 KB); additionally, we can quantify the how optimized a workload is for data transfer efficiency based on the proportion of data throughput performed at large block sizes (e.g. >= 64 KB). As such – we have chosen these two metrics as the coordinates used to map the location of each deployment we observe. With this mapping, workloads focused on data transfer are close to the upper left corner while workloads that focus on transactions reside near the lower right (Fig 4 left). Similarly, workloads that divide their IO and throughput activities into small and large block IO respectively reside in the upper right corner quadrant while workloads that concentrate both their IO and throughput activity towards intermediate block sizes appear closer to the lower left. We have created two separate maps of our customer deployments: one each for both reads and writes. The intensity of the green color in Fig 4 indicates where observed deployments are most concentrated.
Figure 4: Real-World Application Deployments – Mapped. X-axes: the proportion of all operations (i.e. IOs) performed at block sizes less than 8 KB. Y-axes: the proportion of all data transferred (i.e. throughput) at block sizes larger than 64 KB. Left: indication of the significance of each quadrant. Center & Right: read & write data respectively. Green (real world case) density & contours indicate the proportion of deployments observed with the corresponding X- and Y- axis values as described in Fig 3. Orange (non-real world case) circles indicate where a deployment would appear if it were to exhibit an IO distribution corresponding to the histogram shown in Fig 1 right.
Given the bimodality of the cumulative block size distribution (Fig 1 left) it isn’t too surprising that most systems fall into the “transfer specialized”, “transaction specialized” or “divide & conquer” quadrants (Fig 4 center, right). The orange circles (Fig 4 center, right) show a hypothetical deployment with the incorrect block size distribution claimed by other vendors (Fig 1 right). As can be seen, these hypothetical deployments fall on the map at a location far away from where real-world deployments (green contours) congregate. In other words, not only does the average block size provide a distorted picture of how real-world deployments behave in general – there are also virtually no individual deployments that behave in the way that global average block size would describe.
The Importance of Thorough Data Collection
It only takes four distinct measurements (read count, write count, MB read, MB written) from each array to compose the average and ultimately incorrect picture of IO block sizes (Fig 1 right). In contrast, to construct an accurate distribution (Fig 1 left) it takes at least 24 distinct measurements (one for each histogram bin, separately for reads and writes). On top of that – our array software includes additional sensors to quantify the proportion of operations that align precisely with the histogram bin edge (information necessary to produce Fig 2). To properly assess how real-world workloads behave, it takes a platform designed to collect a very thorough ensemble of well-selected sensor types. As this analysis has shown, a world view composed improperly from incomplete data may not just be incomplete – it can be completely incorrect and misleading.
So What Should I Do?
In conclusion, real world deployments do not “split the difference” – that is, they do not focus their transactional and data-transmission activity into intermediate block size operations. Instead, real-world applications perform transactional work using small block operations and transfer data using much larger-block chunks. As such, performance benchmarks should be performed in the way the work is done.
Since ~59% of operations have block sizes less than or equal to 8 KB, transactional performance should be quantified at those block sizes. Similarly, since 81% of data is transferred using block sizes 64 KB or more: throughput performance should be measured using 64k+ operations. Since real-world applications are mixtures of both of these activities, it is important to take into account both small block (<=8 KB) and large block (>=64 KB) performance when comparing storage arrays.