When first introduced to the Nimble Adaptive Flash platform, people sometimes miss a key element that differentiates it from others in the market. CASL™ (Nimble’s Cache-Accelerated Sequential Layout) architecture, which accelerates write performance using a (deterministically) sequential layout, is often confused with architectures that use a write cache to accelerate writes. At first glance, some similarities may be observed between write cache and sequential layout; but if you dig deeper, they are fundamentally different approaches with very different outcomes.

Here is a set of illustrations that should clarify the differences. On the left is a traditional storage architecture where logical block addresses (LBAs) map to fixed physical locations on disk (numbered 1 to 100 for simplicity); on the right is Nimble’s CASL file system with no predetermined mapping of LBAs to physical locations. You’ll notice that there aren’t even any block boundaries because CASL supports variable block sizes. The little rectangle at the top is meant to represent the write cache or non-volatile memory. In each case, imagine that there are 10 incoming random writes into the system (with logical addresses in random order).

Incoming writes held in non-volatile memory/cache

The two systems now behave differently. The write cache looks for opportunities to re-order blocks so that the logical addresses are in sequence. CASL, on the other hand, doesn’t bother with logical addresses; it just assembles a large set of blocks into a large physical stripe.

CASL assembles a large set of blocks into a large physical stripe.

Here’s the final outcome: Notice that the block addresses on the left have been re-ordered as well as possible based on the cache contents. The stripe on the right is still in a jumbled logical order, but as we’ll see in the next step, that doesn’t matter. As a bonus, it has been compressed to save space on disk (and flash) using inline compression.

Afterwards, it’s time to drain the cache/NVRAM to the disk layer. Thanks to the re-sorting, the cache on the left has managed to reduce what would have been 10 random writes down to 7 random writes (because some blocks are now grouped together and can be written as one operation) – in this example, a 30 percent improvement. CASL, on the other hand, has managed to reduce all random write operations into one (regardless of the logical address sequencing) – a 10x improvement.

CASL reduces all random write operations into one.

And here’s how it all looks once it’s on disk; you can see how the illustration on the left uses more write operations to update random physical locations.

Traditional storage layout uses more write operations to update random physical locations.

Now that we know how they work, the table below summarizes the key attributes of the two illustrations. It should be pretty clear by now that they are very different approaches. A write cache attempts to reduce load on the back-end storage layer by absorbing some of the “randomness” in a front-end cache. CASL, on the other hand, fundamentally speeds up the back-end storage layer by exploiting the asymmetry in its physical characteristics (random vs sequential).


Write Cache

Sequential Layout

Goal Attempt to re-sort block addresses in cache to reduce load on back-end disks. Accelerate back-end disks by turning random IO into sequential IO (low RPM disks are about 25,000 times faster at ingesting sequential IO compared with random IO).
Cache size impact The larger the better (to create more opportunities for re-sorting), but practically limited by cost because write caches are far more expensive than back-end disk. Not a big factor. What matters is how fast you can drain writes to the disk layer.
Benefit Deends on cache size and workload, but usually, the cache comprise a tiny fraction of disk back-end and so keeps filling up quickly, limiting re-sorting opportunities. The benefit is generally small enough that performance is still dictated by number of disks on the back-end. Large (30x-200x) and deterministic (doesn’t depend on workload). Exact gain is limited by CPU horsepower to drain large stripes to disk. Refer to the CPU blog for examples.