In a traditional storage environment, primary and backup storage are separate, and backups are based on copying data. Typically, the whole volume is copied from primary storage to backup storage every week or every day. If stored on backup storage without any capacity optimization, these backups can easily use up many times the space used on primary storage. Capacity-optimized backup storage systems overcome this problem using various techniques:
- Deduplication, aka dedupe. Successive full backups have mostly the same content because the change rate is generally small. Dedupe removes this duplication in content by sharing blocks across backups. Global dedupe goes a step further and enables sharing of identical blocks regardless of where they are, including identical blocks at different locations within a backup.
- Compression. Compression works on an individual block of data (generally less than 1 MB) at a time, and crunches it down based on commonality within the block. Examples include the
gziputility and various
Ajay wrote about the reasons for moving towards converged primary and backup storage. With converged storage, backups are based on volume snapshots. A snapshot is logically a point-in-time copy of the volume, but physically it shares all unchanged blocks with the primary state and other snapshots. There is no copying or duplication of data to begin with, so there is no need to de duplicate. This provides huge savings in CPU, network, disk, and memory utilization than first copying the whole volume and then deduping it back down. One might say that snapshot-based backups are not duped in the first place and don’t need dedupe—they are unduped.
In addition, in Nimble’s converged storage model, all data is compressed, including the primary state and backups. This provides a huge advantage compared to most primary storage systems, which do not compress randomly-accessed application data at all.
Next, I will focus on space usage—not because it is the most important difference, but because many interesting questions arise around it.
Proponents of deduping might assume that dedupe is more space optimized than unduped, because global dedupe is able to share identical blocks across backups as well as within a single backup at different locations, while unduped snapshots only share blocks at the same location. The intra-backup sharing does provide a small advantage for dedupe. However, unduped storage benefits from a bigger advantage: the sharing of blocks between the primary state and backups! In essence, unduped converged storage keeps only one baseline copy of the volume, while separate deduped storage keeps two—one on primary storage and one on backup storage. As we will see, the primary-backup sharing outweighs the intra-backup sharing. Therefore, compared to the total space used with separate primary and deduped backup storage, converged storage uses even less space.
Below I present a mathematical comparison of the total space usage (including the primary state) between the following four types of storage:
- Unoptimized daily incremental and weekly full backups
- Global dedupe with compression (as in optimized backup storage)
- Unduped without compression (as in optimized primary storage)
- Unduped with compression (as in Nimble converged storage)
The following chart plots the capacity optimization ratio for each of the three optimized storage types. Capacity optimization is computed as the ratio of the total space used in unoptimized storage over the total space used in the specific optimized storage type. Higher values are better. (This ratio ignores the higher cost of primary storage compared to backup storage, and therefore significantly understates the advantage of converged storage, which uses less expensive storage.) The x-axis indicates the days of backup retention. In general, capacity optimization improves with retention.
The chart shows the following:
- Deduping is a fine and necessary optimization for separate backup storage.
- Unduped converged storage without compression is not as effective as deduped storage with compression.
- Unduped converged storage with compression saves significantly more space than deduped storage with compression for typical backup retention periods of 30–90 days. In fact, dedupe would catch up with unduped in terms of capacity savings only if backup retention is longer than 8 months.
Of course, data protection is not complete without provision for disaster recovery, which requires an off-site replica. Comparisons similar to the one above can easily be made that include the space used on the replica. Unduped converged storage with replica retains a lead over separate deduped storage with replica, regardless of whether the primary or the backup storage is replicated. This is because unduped storage with replica has two baseline copies of data (one on converged storage and the other on replica), while deduped storage with replica has three (one on primary storage, one on backup storage, and one on replica).
Interestingly, matching the space saving of dedupe was not our top motivation for building converged storage. The major motivations were the following:
- Ability to directly use backups and replicas without having to convert the data from backup to primary format.
- Avoid massive transfer of data from primary to backup storage.
- Enable significant space savings without performance impact for randomly accessed data, such as databases.
Nevertheless, it is good to demonstrate that unduped storage is not just as good as deduped storage in saving space, it is even better!