In a previous blog on storage efficiency, I had suggested that mainstream enterprise applications that need both performance and capacity are best addressed by hybrid storage systems, and that the effectiveness of blending is what distinguishes various storage systems in terms of their ability to cost-effectively deliver performance AND capacity.
This raises the question of what criteria one would use to judge the effectiveness of blending. At this juncture it is really important to point out that our entire focus is on efficiency. There are many ways to deliver absolute high performance, but our focus is on delivering performance at the lowest cost. Similarly, when it comes to capacity optimization, our focus is on delivering very cost-effective usable capacity for Enterprise applications, but not on matching JBOD price-points.
Optimization starts with playing to the strengths of flash and disks
Let us briefly revisit some core properties of low-cost, high-density, near-line drives (“Fat HDDs”) and Flash SSDs, since that forms a key ingredient to assessing how you maximize the benefits of both.
- Fat HDDs have the lowest cost per GB, are not good at random I/O, but perform fairly well at sequential I/O.
- Flash SSDs are very good at random reads. They are better than Fat HDDs at random writes, but random writes degrade Flash SSD life. Lastly, they are not that different from Fat HDDs for sequential I/O performance.
- SLC versus MLC versus eMLC SSDs. SLC flash SSDs are 4-6 times more expensive than MLC flash SSDs, and they provide a much higher number (~10X) of write cycles compared to MLC SSDs (more on this below). eMLC SSDs are somewhere between the two on write endurance and on prices.
The core optimization parameters we used to maximize efficiency
With the above characteristics in mind, we designed our system from the ground up to optimize the following parameters, automatically with no user intervention: (i) real-time decision making about data placement; (ii) achieving the maximum performance acceleration for every $ spent on flash; (iii) achieving the maximum useable capacity for every $ spent on disks; and (iv) ensuring that the system optimizes writes (“sequential layout”) to disks, so as complement flash perfectly.
How real-time are system decisions about data placement?
We make real-time decisions about whether to place data on flash, with every read and write I/O so that cold data does not needlessly go to flash, crowding out other “hotter” data. When we considered other approaches that optimize placement on a less frequent basis (e.g., once a day or once every few hours), they are easier to implement but we concluded that they would have been less efficient. They are less efficient in that critical data can often not make it into the flash tier in time or they would need a larger amount of flash as a buffer between the periods of decision making.
How can we maximize the performance benefits of flash while minimizing cost?
We maximize performance acceleration for every $ of spend on flash by focusing on the following parameters:
- Ensure fine-grained blending. Our system is able to make decisions about whether to place data on flash on a very granular basis – units as small as 4KB in size. Had we chosen a unit of say, 1MB, we would need a larger amount of flash because even a small 4KB block that becomes “hot” would force the placement of 1MB of data into flash.
- Leverage inline compression / de-duplication. Since we have designed our system to achieve inline compression without a performance penalty, our flash capacity effectively is double that of the physical flash capacity. Furthermore, in instances where cloned images are being accessed (e.g., virtual desktop boot images), our ability to have de-duplication (block sharing) across clones multiplies the effective flash capacity as well.
- Use inexpensive MLC flash SSDs. When SSDs receive random writes, the actual write activity within the SSD itself is higher than the number of writes issued to the SSD (a.k.a write amplification), which eats into the number of write cycles that the SSD can endure. Traditional storage systems deal with this problem by using SLC SSDs (and eMLC SSDs soon) which provide for a much larger number of write cycles, but this raises the cost of flash 4-6X. We approached the problem differently. Our file-system is optimized to aggregate a large number of random writes into a sequential I/O, and we only write in multiples of full erase block width sizes to flash. Sequential writes minimize write amplification, allowing us to achieve the desired endurance while still using MLC SSDs.
- Minimize overhead (RAID, etc.) in how flash SSDs are configured. In our system, all data that we write to flash also lives on disks. This is in contrast to most other storage systems that have data on their SSD tier that has not yet been written into the disk tier. Consequently, while such storage systems have to protect the SSD tier using RAID 10 which consumes 50% of the capacity as an overhead, we do not have to worry about RAID protecting our SSDs. In our system, if we lose an SSD, you lose a certain percentage of performance acceleration until you replace that SSD but your data is never lost.
Can the system be designed to always write serially to disks, to perfectly complement flash?
Given that flash as a disruptive technology started making inroads only in the last couple of years, most storage systems had already designed their disk data layout schemes without the benefit of knowing that flash would be around. We had the luxury of asking ourselves how we would optimize data placement on Fat HDDs given the availability of flash.
Our system is optimized to drastically lower the cost of flash in our systems, and to ensure that most random reads are served out of flash. Therefore, the task that the Fat HDDs have to perform very well is to efficiently handle all the writes in our hybrid storage system.
Disks are not very good at random I/O, but they are as good as flash at sequential I/O. We therefore use the same file-system optimization I highlighted earlier of aggregating over a 1000 random writes into a single sequential write so that our system issues only sequential write requests to the disks. This allows us to achieve very good write performance from Nearline drives that have traditionally been regarded as incapable of good random write I/O.
The result: A system that optimizes read performance *and* write performance *and* capacity efficiency simultaneously
When we translate all of the architectural choices that we made in our system architecture, we believe that we have addressed a thorny problem in storage – how do you simultaneously deliver high, cost-efficient read and write performance, while also optimizing cost-effectiveness in terms of $/GB.
It is often difficult for customers to contrast storage choices on $/IOPS (easier to compare $/GB although getting apples to apples comparisons is sometimes hard even for $/GB), but an alternative approach would be to understand how flash SSDs and HDDs are being used in the system and what factors drove those architectural choices.
- Suresh Vasudevan