By Amirul Islam – Technical Director at NG-IT
When building out or refreshing a data center, IT leaders grapple with a lot of complexity. One of the most challenging problems is sizing the storage system – buying enough performance and capacity to meet actual current requirements and growth, but not spending on more than you need.
I’ve had the opportunity to configure and install many storage arrays over the years, from a variety of vendors, and have learned first-hand the importance of getting storage sizing right. I’m sharing some of my observations in the hope it will make life easier for my colleagues in IT shops everywhere.
As I see it, there have been four main stages in the art and science of storage sizing – the RAID era, the virtualization era, the tiering era, and the modern era.
The RAID Era
When I started out as a sysadmin, I learned that the storage system had to be sized appropriately for the host, the network, and the applications. But sizing was typically done a little differently back then as it was largely based on satisfying capacity, with less concern for performance.
Much of the interaction with server and database administrators revolved around how RAID sets should be configured. This was at a time when we had to create RAID groups using a number of disks, and using a particular RAID type – typically RAID5 or RAID10. RAID5 provided adequate protection for RAID sets consisting of five to ten drives, but RAID10 offered much better write performance, and thus was the choice for write-intensive requirements like database log volumes. However, RAID10 did come at a higher capacity cost. So immediately there was a trade-off that had to be understood and sized against a three-way axis of performance, cost and reliability.
Back then we were creating a RAID set per volume for a particular host, and lots of volumes per server translated to lots of physical spindles on a storage array for its own use. Disk capacities were much smaller in those days, so many disks were required for even the most basic applications. The result was that, even in a modest-sized environment, there were soon many storage objects to manage, maintain and expand.
These systems required careful planning on how to layout the data in the most cost effective and optimised manner, something that’s still necessary with some legacy systems today. It effectively siloed the storage array, in that if plenty of resources were available in the RAID1 pool but if my requirement was for RAID5, one was left with trapped utilisation even behind a single controller. Services such as encryption only exacerbated this dilemma further as I would now have more pools to manage.
The problem is that using disks to create lots of discrete RAID sets was very inefficient; it led to wasted space, potentially wasted backend IOPS and made capacity expansion more difficult. This also meant that every time they spun up a new project, the infrastructure admins had to carefully plan the configuration of the array, with particular attention to how the data was striped and protected to be resilient to hardware failures.
The Virtualization Era
To overcome these constraints, vendors introduced storage virtualisation, which allowed RAID sets to be aggregated to create large pools of capacity, from which volumes were carved out and presented to servers. This allowed us to create volumes of the actual size we required, without needing to consider headroom. Volumes could be dynamically expanded as and when required, leading to more efficient use of disk capacity in the array and providing more volumes to more servers.
Whilst previously we had specific disks assigned to servers, and thus could guarantee a certain level of IOPS to that server, with storage pools every server now had volumes derived from all of the disks in the storage array, thus sharing the IOPS capability of the entire array. This was not necessarily a bad thing since every server could now benefit from the aggregated IOPS capability of all of the disks; however, it did introduce disk contention, since all applications were now competing for disk access. This improvement in capacity efficiency had the effect of making sizing for performance much more important.
The proliferation of virtual machines, now a de facto standard in modern data centers, means that there are now potentially hundreds (or thousands) of virtual machines, all competing for disk access in a random fashion – a term commonly referred to as the IO blender effect. This creates a major challenge for legacy disk vendors in terms of adequately satisfying host IOPS requirements. Spinning disk might deliver good performance when reading and writing sequentially, but it’s pretty inefficient at delivering performance for random read and write access.
As disk capacities increased significantly, their performance did not: the fastest magnetic disks have spun at 15,000 RPM for many years now, in fact many storage vendors are now seeing spindle performance reduce with the industry largely consolidating on 10,000 RPM and 7,200 RPM drives. Since sizing for capacity could now be satisfied with so few disks, it became much more important to size arrays for performance. We now had to provide lots of physical disks simply to meet the IO demands, even if it meant providing lots of unused storage capacity as a consequence (wasteful, costly, inefficient).
Nevertheless, this was the only way of providing performance, which led to the plethora of considerations that now have to be taken into account when designing a storage system. These include: the number of disks for capacity and performance, how the disks will be assigned, what pools they will form, whether data ought to be encrypted, and which logical volumes will reside in which pools. This required clear understanding of the applications, with accurate performance profiles to adequately perform volume placement, to reduce disk contention and thus provide good application performance. This was not an easy task when the application performance metrics were not readily available, and could lead to time consuming application analysis and profiling.
The Tiering Era
As storage virtualization proliferated, flash storage also grew quickly, migrating from consumer devices to the enterprise. With the use of enterprise-grade solid state drives (SSDs), legacy storage vendors could now deliver random IO much more effectively from solid state storage than spinning magnetic drives.
But integrating SSDs into legacy architecture leads to compromises in design; it’s not easy modifying 20 million lines of code for a 20 year old filesystem. When some vendors started integrating SSDs, there were limitations around how they could be used and they typically had to reside in their own pool. This high performance pool would deliver great performance but offer limited capacity, so now the administrator had to decide which volumes would reside in this pool and benefit from its performance, and which remained on the slower disks. Introducing expensive SSD disks didn’t improve performance for all the applications running on the system, just a select few that were fortunate enough to reside in the SSD pool.
Some vendors addressed this by creating pools of mixed disk types and introducing tiering technology. Others modified the way that data flows through their storage systems. However, data optimisation techniques such as automated tiering bring their own challenges since the data movement between tiers is often post-process. While analysis of data activity is performed in real-time to identify the hot and cold blocks, the actual data movement to the relevant tier of disk was scheduled at a quiet time of day (typically overnight). Not only did this introduce additional disk activity for moving the data blocks between the disks (which could interfere with backup activity), but ironically, it was unable to deliver performance when the applications needed it.
The Modern Era
Today, my job is much easier. Whereas sizing a storage solution used to be a time consuming process that required a clear understanding of the applications that would reside on it, today it’s much simpler. In fact, I typically need to ask just four main questions:
- How much performance do you need?
The response doesn’t even have to be exact. No more clarifying if they need 5000 IOPs or 7000 IOPS, as that could fundamentally change the solution. Nimble simply provides a maximum performance figure per device, starting at 15,000 IOPs.
- How much capacity do you need?
Depending on how much capacity is needed, I configure the Nimble with the relevant disks to provide a single pool of capacity. There is no need to think about RAID set sizes, disk assignment, pool sizes, and so on.
- What connectivity do you require?
Is it Fibre Channel or iSCSI? If its iSCSI, do you want 1gb or 10gb Ethernet?
- What is the latency of your current applications?
Are there any applications that can benefit from lower latency? In some cases, this can return huge benefits to a business.
The storage industry sometimes fixates on how many IOPS a platform can deliver, however often reducing the latency from 40ms to 1ms will not just speed up batch processing or queries, but create a much more compelling experience for their end customers.
Without all the time spent on storage sizing, I’m free to work with customers on their overall infrastructure strategy, including virtualisation, backup, disaster recovery, and application optimisation. The Nimble solution doesn’t just make my life easier, it also simplifies operations for the end customer, giving them full visibility into the array’s performance.