Disaster preparedness: using data change rates for capacity and protection planning
by Shannon Loomis, PhD – Data Scientist
Knowing how fast your data is changing is necessary for multiple facets of data storage management, from coming up with a data protection scheme to planning out future capacity needs. As part of a disaster recovery plan, data protection is based on application-consistent snapshots that are written on Nimble arrays and can be restored locally or replicated with a partner at another location. These snapshots revert back to a previous point in time, providing IT administrators with access to historical data.
Despite extensive engineering optimizations used to minimize the data footprint, there are still capacity and bandwidth costs associated with storing and replicating these snapshots. In order to better understand the drivers behind snapshot growth and its impact on capacity and replication, we used InfoSight analytics from over 8,100 customers to explore how various system parameters affect data change rates.
Data change rates – and therefore snapshot sizes – vary greatly with write workload and frequency of snapshots, but the rate at which data changes with respect to these parameters is very application specific. Below, I dive further into these observations in order to help you better plan out future capacity and data-protection infrastructure needs.
How write workloads affect snapshot sizes
Unsurprisingly, the more data you write, the larger your snapshot size is. However, this is by no means a 1:1 mapping, due to a combination of compression, deletes, and overwrites. Multiple edits to the same data source in a given snapshot interval will only register as a single change.
The relative size of a snapshot compared to the amount of data written also decreases as you write more data. For example, if you were to write 10 GiB of data, the typical snapshot would be 1.4 GiB and its size would represent 14 percent of the amount of data written. If you were to write 1000 GiB of data, the snapshot size would be on average 96 GiB, or 9.6 percent the size of the amount of data written.
This varying relationship between snapshot size and data written is also extremely application specific (Figure 1). Oracle, Virtual Desktops, and Sharepoint have the largest snapshot size and SQL Server and Exchange have the smallest snapshot size per unit of data written. This is all driven by how fast the ratio of writes to snapshot size is changing. The snapshot-to-write ratio increases in a linear fashion with Oracle, Virtual Desktops and Sharepoint but decreases very quickly in Exchange and SQL server applications as the amount of data written increases.
This means that you’ll likely get much more bang for your buck if you plan to take Exchange and SQL Server snapshots after conducting a lot of writes, as one snapshot taken after writing 100 GiB of data will save you a significant amount of space compared to ten snapshots taken after writing 10 GiB of data. On the other hand, the change in ratio of snapshot size to data written is very small for Oracle, SQL, and Sharepoint volumes. As a result, space savings is likely to be minimal when taking several snapshots after writing a small amount of data, compared to taking a single snapshot that captures a larger amount of data.
Figure 1: Snapshot size vs. data written by application. Solid lines show the best estimate for application specific regressions, and shaded regions show 68% confidence intervals on the regressions.
How snapshot frequency affects size
While it is interesting to know the rate of data change with variable amounts of data written, capacity and snapshot planning are tied to a temporal cadence, so it is arguably more beneficial to know how change rates vary on different time scales. Since the number of writes and data written increases with time, it is again unsurprising that snapshot size as a whole also increases with time and that this relationship is very application specific.
Across our entire installed base, Virtual Desktop, Virtual Server, and Exchange volumes all tend to acquire 0.1-0.2 GiB of new data over the course of an hour, while all other volume applications tend to accumulate under 0.05 GiB (Figure 2, left). When this is extrapolated out to a week, snapshot sizes for Virtual Desktop volumes tend to be four times larger than those of any other application type (Figure 2, right). This is due to its near linear change rate with respect to both time and amount of data written. As discussed above, this near-linear growth rate also applies to Oracle volumes, so while snapshot sizes tend to be relatively small compared to other application types over the course of an hour, Oracle snapshot sizes are second only to Virtual Desktop sizes on a weekly scale.
As noted for writes, this means that there is little space savings when taking many frequent snapshots of Virtual Server and Oracle volumes. Because of this, more frequent snapshotting of Virtual Server and Oracle volumes will tend to provide better temporal backup coverage without adding a lot of overhead. Conversely, SQL Server, Sharepoint, and File Server tend to accumulate new data at a fairly slow rate, and snapshot growth over time is sublinear, so longer snapshot intervals will help to save both space and replication bandwidth. However, picking snapshot intervals for Virtual Server and Exchange are a little trickier: data on the volumes running these applications tends to change a lot in a short period of time, but there is a significant space and bandwidth savings associated with snapshotting over longer periods of time.
Figure 2: Snapshot size vs. snapshot interval by application. The left chart limits the snapshot interval to 1 hour to highlight details of data change rates on minutely levels, while the right chart shows the regression out to 1 week to highlight change rates on longer time intervals. Solid lines show the best estimate from application specific regressions, and shaded regions show 68% confidence intervals on the regressions.
Untangling write/size codependency
The plots above show that snapshot sizes increase with both time and write activity, but how do we disentangle the fact that the number of writes tends to increase with time? And what should you expect your snapshot sizes to be if the amount of data you write differs from the norm? In order to better understand this, we look at how the ratio of snapshot size to data written changes at different time intervals.
Overall, snapshot size per data written tends to decrease as the snapshot interval increases. Over the course of an hour, Exchange and SQL Server tend to have a ratio of snapshot size to data written over 0.25, while this ratio is closer to 0.20 for all other applications (Figure 3, left).
What is most interesting, however, is the fact that File Server is the one application where snapshot size per data written is almost invariant, with a near constant ratio of 0.22 of snapshot size per data written, regardless of the snapshot interval (Figure 3, right). This means that users are fundamentally using File Server volumes differently than they are using volumes running other applications. While data associated with other volume types tends to change via deletes and overwrites as snapshot interval increases, data tends to be kept in its original state once added to a File Server volume.
Figure 3: Snapshot size per data written vs. snapshot interval by application. The left chart limits the snapshot interval to 1 hour to highlight details of data change rates per write on minutely levels, while the right chart shows the regression out to one week to highlight change rates per write on longer time intervals. Solid lines show the best estimate from application specific regressions and shaded regions show 68% confidence intervals on the regressions.
Optimizing data protection
The above plots and discussions provide insights on how to optimize data backup and disaster recovery planning, but they cannot take into account the cost of data loss on your particular array. For example, the results show that you can reduce replication overhead by increasing the snapshot interval on your Virtual Server, but can you really afford to lose a week’s worth of data in the event of a disaster in your server room? As a data scientist, I believe in utilizing conclusions drawn from hard data to enhance different processes as much as possible, but make sure to be cautious – after all, caution is the basis for data protection.