Whenever an enterprise or a cloud service provider deploys a Nimble Storage array, one of the first questions they ask is usually: How much cache do I need? Often followed by: And how do I figure out how much cache I need, when different workloads put different requirements on the cache?
Great questions, because it’s true, you really do need to characterize each workload in order to properly size a cache requirement. Our research in this area has revealed some interesting things about how customers use their storage. As Nimble’s chief data scientist, I’ll share with you one of the many insights we’ve gained, but first, a little background.
Nimble Storage’s new InfoSight cloud-based service tells customers, as a percentage of their current cache, how much they really need in order to maintain good performance. For each installed array, a customer can pull up a chart in InfoSight that looks like this:
The left side of the chart shows that this example array had been running on empty through September and October of 2012 – it already needed a bit more cache, as cache utilization was hovering around the 100% mark, and well into the yellow. To make matters worse, around the end of November, the customer doubled their workload against this array, as seen on the right side of the chart. As a result, their cache was no longer delivering the performance benefits it should.
InfoSight now lets them know what kind of scale-to-fit upgrade would accommodate their new workload, so that they can reclaim the performance they had enjoyed around the beginning of 2012. This example illustrates the importance of dynamic cache resizing: with scale-to-fit, customers can continue to grow workload while maintaining the price-performance benefit of the Nimble solution.
How do we compute this figure? We need to understand the customer’s dynamic environment well enough to put a number to how much more cache is needed, or how much headroom is available.
To accomplish this feat, we have developed a model of how our cache works. We take the set of sensor data we’ve collected from an array – more than 30 million values per array per day, on average – and from this dataset we estimate values for the model parameters. Then we plug these estimates into the model, and compute necessary cache.
How detailed is this model? Well, for starters we recognize that some workload components are pretty much random, such as large working set of blocks being touched over and over again, in no particular order. We model a number of facets of this component.
Then there are the periodic components – workflows, cron and batch jobs, which deterministically touch the same blocks at particular times: at 8:00 a.m. each day, for example, a user logs in; at midnight, a data extract is run, and so on.
As it turns out, when it comes to sizing the cache, these periodic access patterns are as important to understand as random working set size: if your cache age (average amount of time data stays in cache before being evicted in favor of hotter data) is 24 hours, you may get significantly more benefit than you would if your cache age were 23 hours and 59 minutes, when it comes to these workload components.
Our model captures and extracts for us, among other things, how much IO for a given workload is against periodic workload components. For this IO, we can extract the incremental benefit associated with increasing cache age, at each such periodicity. The process looks something like this:
On the right side of this figure, you’ll notice a spikey chart with peaks at 24, 48, 72 hours. This figure shows that, when we isolate the periodic components of a typical workload, we generally want a cache age of at least 72 hours.
<p”>In fact, the chart you’re looking at is an average incremental benefit by cache age across the periodic components of all our customers’ workloads combined. So, it really is a universal lesson: all other things being equal, it pays to size your cache for an honest day’s workload – or two or three.