by Umesh Maheshwari – Co-founder and CTO

Back in 1991, Mendel Rosenblum, who later founded VMware, made a remarkably far-sighted observation in his PhD thesis:

“Increasing memory sizes will make the caches more and more effective at satisfying read requests. As a result, disk traffic will become dominated by writes.”

After 20 years of modest growth in cache sizes, that prediction is poised to hit the storage industry. This is largely due to the advent of caches based on flash memory, which are generally 10 times larger than the traditional DRAM-based caches. It is also due to the advent of data reduction technologies such as compression and deduplication, which further increase the effective size of the cache. I will refer to such caches as “hyper caches” to differentiate them from the much smaller, traditional caches.

Hyper caches may be integrated into server hosts and also into storage systems. A hyper cache within a server host may consist of SSDs managed by the host OS/hypervisor, or it may consist of special-purpose hardware such as EMC VFCache, Marvel Dragonfly, etc. Recently, VMware announced a feature called View Storage Accelerator, previously code-named Content-Based Read Cache (CBRC). “Content-based” refers to content-based deduplication, which works really well for VDI boot storms.  Today CBRC is implemented with DRAM, but it will be a small leap for VMware to evolve it to use commodity SSDs.

A common theme among host-attached hyper caches is that they are focused on accelerating reads, not writes. In theory, a hyper cache could buffer writes, which is also known as write-back caching. However, most implementations do not buffer writes at all, and even those that do make it optional and have enough warnings attached that the average user is likely to leave it turned off. In particular, CBRC, as the name suggests, does not accelerate writes. VM blogger Brian Madden wrote, “It doesn’t nothing to help with writes. (Well, other than the fact that taking a lot of these reads off your primary storage might free up some IOs for more writes. This also means that you might be able to tune your storage for writes.)”

The major reason for not buffering writes within the host is that it can jeopardize storage reliability and consistency. What if the host fails? Would a snapshot or backup taken on the storage be point-in-time consistent? The impact on consistency is even more grave when write buffering is optimized to re-sort the writes.

diagram3This leaves the burden of optimizing writes on storage. So, how does one build a write-optimized storage system?

Most disk-based storage systems employ a write buffer internally to reduce latency and absorb bursts of writes. But it does not help so much with sustainable throughput, because buffered writes need to be drained to the underlying disk subsystem at some point. This is different from reads serviced from a cache, which never go to disk! Therefore, sustainable write throughput is limited by the ingest rate of the disk subsystem.

Another solution is to use flash, either as a large write buffer (which still suffers from the fundamental limitations of write buffers), or as an endpoint of storage by itself. But the benefit of using flash for write optimization is small relative to its high cost. The fundamental reason for this is that flash memory is nowhere nearly as good with accelerating writes as it is with accelerating reads. This is apparent in its lower write performance (especially for sustained random writes), limited write endurance, and need for overprovisioning to limit write amplification.

Therefore, the burden of optimizing writes goes all the way to disk. Fortunately, Mendel suggested a solution in his thesis:

“We have devised a new disk storage management technique called a log-structured file system, which uses disks an order of magnitude more efficiently than current file systems.”

Indeed, most flash SSDs use this same technology, renamed “write coalescing”, to make random writes more palatable for flash. However, write coalescing hasn’t been very successful for disk-based file systems. File systems such as NetApp’s WAFL and SUN’s ZFS do write coalescing opportunistically, which works well initially when the disk space is largely free, but degrades to random writes over time. Coalescing all writes all the time, as suggested by Mendel, requires an efficient process to defragment the free space.

This is the engineering challenge that Nimble has addressed to finally deliver a disk-based file system that is truly optimized for writes. You can read more about it here.