Blog

Blog Directory

  • The Growing Need for Write-Optimized Storage »
  • Notes from the First-Ever Nimble User Group »
  • Getting Up to 12x Better Efficiency Using Nimble with Exchange »
  • Guest Post: Ron Kanter, Berkeley Research Group – Our Experiences Using Nimble Storage for Exchange 2010 »
  • Guest Post: Using Exchange on Nimble – The Foster Pepper Story »
  • Guest Post: Benjamin Craig SVP and CTO of Northrim Bank, on Nimble Technology »
  • Are SSD-based Arrays a Bad Idea? »
  • Nimble Storage Certified for VMware VDI »
  • The Emperor’s New (Flashy) Clothes »
  • Our First Full Fiscal Year of Operations »
  • Snapshots + Backup Management = the Best of Both Worlds »
  • M.C. Escher and Storage: True Efficiency »
  • Write Caching in Flash: A Dubious Distinction »
  • Solving the VDI Storage Paradox »
  • Non-Disruptive Software Upgrades: Statistics from Our Latest OS Version »
  • Can young companies deliver better business critical support than large companies? »
  • Evaluating Effectiveness of Hybrid Flash and Disk Storage Systems »
  • Storage Efficiency: $/IOPS, and Not Just $/GB »
  • My First Few Weeks as CEO »
  • Extended Snapshots and Replication As Backup »
  • The Nightmare of Incremental Backup is Over »
  • A Comparison of Filesystem Architectures »
  • How Snappy and Skinny Are Your Snapshots? »
  • What Defines Converged Storage and Backup? »
  • Why Does Enterprise Storage Cost So Much? »
  • Better Than Dedupe: Unduped! »
  • Why Converge Primary and Backup Storage? »
  • A Clean-Slate Approach to Converging Primary and Backup Storage »
  • CEO’s Introduction to Nimble Storage »
  • by Radhika Krishnan
    Head of Solutions and Alliances

    By all indications, today’s data centers are actively embracing snapshots for backup.  While most are using this to augment traditional backups, several are actively replacing their traditional backup solutions with snapshot-based backup and recovery solutions.

    Check out these survey results from a survey Gartner conducted. Nearly two-thirds of the respondents stated that they have plans to augment backup software with snapshot and replication solutions, and one-third of those polled indicated that they plan to replace backup software with snapshots and replication:

     

    Source: “The Future of Backup May Not Be Backup”
    Gartner ID Number: G00218917 Author: Dave Russell Date: 22 Sep, 2011

    What triggers this trend?  Simply put, snapshots can be generated quickly and frequently, yet, there is none of the overhead of resource requirements typical of traditional backup. Most of all, recoveries are instantaneous and painless.  Combining snapshots with bandwidth-efficient replication provides additional protection, safeguarding against local site outages. For a detailed discussion on why traditional backup methods are failing to meet recovery needs, refer to this discussion

    What’s not to like?  Well, here’s the big caveat. NOT ALL SNAPSHOTS ARE CREATED EQUAL. Moving the onus of backups to a storage system that isn’t optimized for this foundational capability would be like leaping from the frying pan to the fire.

    Firstly, in order to allow frequent recovery points, snapshots should be inexpensive both in terms of storage capacity and performance.  Otherwise, you could end up consuming expensive storage capacity. Not to mention bogging down performance of the array.  Secondly, the implementation needs to be able to support a large number of snapshots to truly support frequent recovery points. (See here for further discussion on snapshot implementations.) So make sure to read the fine print and ask the hard questions.

    Rounding out the Solution

    While snapshots can greatly improve backup efficiency, there is tangible value in combining this foundational capability with an intelligent backup management system for an end-to-end solution.

    This is what the Nimble-Commvault partnership is centered around.

    Commvault has been one of the first vendors to recognize and embrace the industry trend towards snapshots. The Commvault IntelliSnap Connect program facilitates a data and information management approach that leverages hardware based snapshot and replication technologies.

    Following is what the joint solution would deliver to end customers:

    • Centralized management that allows admins to configure, execute, and monitor backup and restore operations, as well as perform monitoring and auditing from a single console
    • Support for long-term retention, compliance and e-discovery through archival to tape and virtual tape media
    • Integrated catalog that allows indexing and tracking of snapshot copies across primary, secondary, and other forms of media
    • Granular recoveries of snapshots, application objects, files, VMs, and volumes
    • Reduced cost and complexity through elimination of redundant server, networking and storage resources required by traditional backup architectures
    • Frequent recovery points irrespective of the volume of data to be protected
    • Rapid restores that require no data movement
    • Support for app and VM consistent recoveries speeding up application and VM restore times

    Stay tuned for more on this front.

    Twitter Linkedin Rss Youtube

    Ajay Singh
    Vice President, Product Management

    The Escher Stairs of Efficiency Claims

    An end user bombarded by the many efficiency claims made by storage vendors might be forgiven for being confused, skeptical, or both. How is it possible for so many vendors to claim they deliver storage with X% lower cost than other vendors? For all these claims to be true, the storage world would have to be the real world equivalent of M.C. Escher’s  mind-bending Penrose Stairs. What’s really going on here?

    Comparing Storage Efficiency

    Well, the problem is that most such claims are based on simplistic comparisons, such as only comparing capacity efficiency (usable capacity/raw capacity). Or just comparing raw performance.  And even these are often inflated with unrealistic assumptions.

    While interesting, such one-dimensional comparisons are typically only useful for niche applications such as archiving or HPC.  For mainstream applications you typically care about multiple dimensions of a storage solution such as price/performance, data protection, availability and capacity efficiency. Knowing this, the question then is – how does one construct more meaningful comparisons?

    A Better Comparison

    Assuming many solutions meet your threshold of reliability and availability, here are some dimensions of storage efficiency you might consider in comparing them:

    · Capacity AND Performance Efficiency

    A basic definition of capacity efficiency (usable capacity/raw capacity) can be too simplistic for a couple of reasons. Often it ignores capacity savings techniques like inline compression and cloning. More importantly, it ignores the inherent performance differences between architectures. If you could get 50% compression without a performance impact, that’s certainly nice. But if you could get the performance of high performance drives (15K rpm disks, or better, flash SSDs) and the capacity of high density drives (7.2K rpm disks) in a single tier of storage – that’s HUGE! When you consider 15K RPM drives cost 500% more per GB than 7.2K RPM drives, the above example translates to a 500% capacity advantage from the get go!  To capture such differences, a meaningful comparison of efficiency ought to consider both $/GB AND $/IOPS.

    · Data Protection Efficiency

    The most visible elements of efficient data protection are the capacity efficiency of backup storage (e.g. dedupe ratios), and the bandwidth and capacity efficiency of DR storage. It’s less common to see quantitative comparisons of the level of data protection – namely the RPOs and RTOs enabled by the system although these translate to very real and potentially big costs. And then there’s another part which is sometimes overlooked and typically harder to quantify: operational efficiency, in other words how easy is it to setup and manage backups and DR on a day to day basis. More on this topic next.

    · Operational Efficiency (i.e. Simplicity)

    This is the dimension that is hardest to measure, but no less important to consider. Operational Efficiency encompasses qualitative attributes like simplicity – can an admin just install and start using a storage technology without days of training, professional services and years of experience? Does the performance adapt quickly to changing workloads? Quantitative measures might be the time (or number of steps) required for common tasks.

    There’s another reason to pay close attention to operational efficiency – it helps you distinguish truly efficiently designed storage solutions from less efficiently “bundled” ones. Here’s a hypothetical example to illustrate:

    What if you had a shrink-wrapped solution that bundled a small amount of expensive but fast storage together with a lot of cheap but slow storage. And also threw in some software to slowly move data back and forth – to relocate the right data on the right tier. And some more software to do the same for backup purposes. On paper such a solution can appear to have it all– good $/IOPS, good $/GB and automation to simplify management. So what could be missing – potentially a LOT!

    If the data transfer process is slow and heavy duty – it might take hours to complete and impact performance while it’s happening. And since application workloads change dynamically, you’d be constantly monitoring workloads and over-allocating performance tiers to ensure bursty applications don’t experience bad performance for extended periods. Despite this, it’s virtually certain that some applications would experience poor performance. As for backups/restores – you’d be constantly battling backup windows and dealing with poor recovery points and slow, painful restores. So in reality, such a package would deliver much less than the sum of its parts.

    What This Means for You

    Not every application needs a multi-dimensional, well balanced storage solution. Perhaps for an archive tier $/GB is the one over-riding concern. Or maybe for a critical application you’re willing to pay a lot for performance, even if it means compromising on capacity and efficient data protection.  However the vast majority of mainstream applications need more versatile storage solutions.

    One approach to picking the right one is to assign explicit weights to your criteria: for example capacity efficiency, performance efficiency, data protection efficiency and operational efficiency might be all equally important in your environment and deserve equal weights. You can then compare storage solutions under each of these four criteria and rate each on a scale of 1-5. The overall weighted rating would give you a much better measure of storage efficiency for your applications than anything vendor marketing materials could. In upcoming blogs we will share real world data on how Nimble does on each of these criteria.

    Twitter Linkedin Rss Youtube

    Umesh Maheshwari
    Co-Founder and CTO

    Flash memory shines on reads: it reads 100 times faster than a disk. But its performance advantage is much weaker on writes, and its write endurance is much lower than disk’s. Therefore, Nimble OS uses flash only for accelerating reads, aka “read caching”. It uses NVRAM (a DRAM-based device) for accelerating writes, aka “write caching”.

    On the other hand, a few storage systems use flash memory for write caching. Here I describe what compels these systems to use flash in this manner and the cost-benefit tradeoff it entails.

    In general, storage systems implement write caching using a non-volatile “write buffer.” On a write request, the system stores the data into the write buffer anacknowledges the request. In the background, as the buffer fills up, the system drains the buffer to the underlying storage. The speed at which the write buffer can be drained to underlying storage constrains the sustainable write throughput.

    The write buffer helps in following ways:

    1. It enables the storage system to acknowledge a write request with very low latency.
    2. It can absorb a high-throughput burst of writes, while it drains less speedily to disk-based storage over a longer period of time.
    3. It absorbs overwrites (multiple writes to the same blocks), thereby reducing the amount of drainage, which may support a higher write throughput.
    4. It allows the data being drained to be sorted by logical addresses, thereby improving the sequentiality of drainage, which may improve the speed of draining and support a higher write throughput.

    The latency advantage depends on the buffering medium. NVRAM (DRAM made non-volatile with battery backup or flash backup) provides latency of a few tens of microseconds. Flash a few hundreds of microseconds. Disk a few milliseconds. Most storage systems use NVRAM for write buffering. However, file systems that are not tied to a hardware platform cannot assume the availability of NVRAM, and may buffer writes on flash or even on disk. E.g., the write buffer in ZFS, called ZFS Intent Log (ZIL), is generally stored on flash or disk.

    A few storage systems now use flash as a secondary write buffer in addition to using NVRAM. E.g., EMC “FAST cache” uses flash as both a read cache and a write buffer. In such systems, written data is staged through the NVRAM-based buffer, the flash-based buffer, and finally to disk. The flash-based buffer is much bigger than the NVRAM-based buffer, and therefore provides higher levels of burst absorption, overwrite absorption, and sequentiality improvement, which in turn may support a higher write throughput. These advantages are based on the assumption that the NVRAM-based buffer cannot be drained directly to disk-based storage at high throughput.

    Most storage systems employ a simplistic disk layout such that draining the write buffer results in random writes on disk. Furthermore, these systems amplify the IO load in order to support parity RAID and copy-on-write snapshots. The resulting load cripples the speed at which data can be drained to disk. (NetApp’s WAFL performs better by concatenating random data blocks and writing them into free space, but it too degenerates gradually as the free space becomes fragmented.) Because these systems cannot drain to disk at high speed, they stand to benefit from adding a larger write buffer. Even so, this benefit is limited because it does not eliminate random writes to disk—it only reduces them by some modest amount.

    Furthermore, many of these storage systems could instead use a disk-based write buffer, which would be similar to a write-ahead log used in database systems. The log is written sequentially, which disks perform just as well as flash drives (about 100MB/s per drive). One advantage of a flash-based buffer over a disk-based buffer is that it also serves as a read cache for newly written data. However, as described later, there are cheaper ways of building a read cache. Another advantage is that the draining process can read the flash-based buffer in random order, so it supports a more thorough sorting of the data, thereby extracting more sequentiality.

    Now consider the cost of write buffering. A flash-based buffer is expensive. First, because it holds the only copy of newly written data, it must employ the more expensive forms of flash and controllers, and also some RAID-like redundancy in the form of parity or mirroring. (In fact, a flash-based buffer needs to be even more reliable than an NVRAM-based buffer, because it is larger and the overwrite-absorption and re-sorting might make it difficult to recover the system to a consistent state upon loss.) On the other hand, a read cache does not ever store the only copy of any data, so it can be constructed inexpensively without sacrificing reliability: add a checksum to every block, verify the checksum on every read, and toss the cached block if the checksum does not match. Second, pushing the writes through flash burns through its limited write endurance, again requiring expensive, high-endurance, flash. Third, to obtain a significant edge over NVRAM-based log, the flash-based log must be much bigger. E.g., it may need to be large enough to absorb all writes during a busy period lasting hours.

    The questionability of using flash as a write cache for disk is epitomized by a research paper, Extending SSD Lifetimes with Disk-Based Write Caches, which states the following:

    “We present Griffin, a hybrid storage device that uses a hard disk drive (HDD) as a write cache for a Solid State Device (SSD).”

    In other words, the authors are proposing just the opposite of using a flash-based write cache for disk! These authors are reputable researchers from the academia and Microsoft Research, and they exhibit a deep understanding of flash characteristics as a storage medium. There are practical issues with following their proposal, but the mere existence of this proposal questions the wisdom of using flash for write caching.

    Nimble’s CASL™ filesystem uses the entire disk storage as a log, and always writes data to disk in large sequential chunks. This enables it to drain data from NVRAM buffer to disk storage at high throughput. This avoids the need for a secondary write buffer. It is as if the entire disk subsystem is at once a write buffer and the end point of storage.

    In summary, flash-based write caching addresses burst throughput but only partially improves sustained throughput, while a write-optimized disk layout addresses both with little cost. However, systems with legacy disk layouts are forced to cache writes in flash as a costly fix to improve their write performance partially.

    Twitter Linkedin Rss Youtube

    by Radhika Krishnan
    Head of Solutions and Alliances

    Nimble recently completed a survey of 599 respondents regarding business drivers and challenges in deploying VDI. While the interest in VDI continues to grow, costs and performance were flagged as the biggest storage-related challenges impeding VDI deployments.

    Those of us who are familiar with VDI would agree that this is not all that surprising. Storage performance heavily determines the responsiveness of virtual desktops and if the user experience is diminished, users will not accept VDI. And in these times of tightening budgets, costs are always subject to close scrutiny.

    On closer analysis though, you realize that addressing both cost and performance together is a paradoxical problem to overcome with traditional storage solutions. Let’s see why.

    VDI has some unique workload characteristics. At steady state, VDI behaves predictably with IOPS tied closely to the profile of desktop workloads being run. However, in the course of a normal day, VDI infrastructure also goes through boot-storms and login storms (the period when multiple desktop users try to boot or log in at the same time) which cause a peak in read IOs. There are also virus scanning and OS upgrade operations that occur from time to time, which triggers a spike in writes.

    VDI

    Traditional storage wisdom recommends throwing flash and expensive high-RPM drives to provision for these peak scenarios.

    Wouldn’t that then cause storage costs to shoot up? So how does one get around this conundrum?

    It would seem the crux of the problem comes down to efficiency. The Oxford dictionary defines efficiency as achieving maximum productivity with minimum wasted effort or expense. In other words if we can meet the performance demands of VDI “efficiently,” that would by definition mitigate the cost challenge.

    While efficiency has been used in the storage industry predominantly in the context of capacity efficiency (i.e. $/GB), it is equally critical to focus on performance efficiency as well (i.e. $/IOPS). A combination of those two sets of efficiencies would result in “affordable performance.” Unfortunately, most storage systems today tend to be optimized for one or the other, not both.

    But wait—that’s not all. We just talked about the tendency of VDI IO’s to fluctuate throughout the day. Clearly there is more to performance efficiency that goes beyond purely $/IOPS. A truly efficient solution needs to be able to deliver performance when needed without incurring high overheads i.e. “adaptive” performance. In essence, what you really need is “affordable” and “adaptive” performance.

    Unfortunately traditional tiered solutions fall short in this regard owing to the complexity and overhead associated with data classification, data movement, and the granularity of data movement. One could construct a system that is highly optimized around $/GB and $/IOPS, but has data trapped in low performance tier, rendering the system unresponsive to fluctuating workload demands.

    Can you architect a system that truly delivers “affordable” and “adaptive” performance? The answer is yes and it comes down to not just what resources the storage solution leverages for performance, but how it leverages those resources.

    Let’s shift gears and look at real-world examples of customers who have successfully deployed VDI, and what has worked for them. One approach is to start out with VDI by consolidating virtual desktop workloads with other workloads over the same storage infrastructure.

    No longer do you have to purchase silo’d infrastructure that has to be managed separately, thus cutting down on both capex and opex costs.

    Of course, the storage array needs to be able to deliver “adaptive, affordable performance” and simplified management to effectively handle the consolidated workload.

    And once you are ready to scale up to higher desktop numbers, rinse, repeat.

    Twitter Linkedin Rss Youtube

    by Rod Bagg
    Director of Support

    One of our customers called the other day and asked me to explain our Non-Disruptive Software Updates. I gave him the usual basics about our two-click process of download and update directly from the Array management GUI; you know, no downloading to your PC and copying files here or there – just a couple clicks.

    Then I continued with a technical explanation about the update process itself. The first step being a set of high-availability health-checks run by the update process to ensure the system and networking is in proper working order. Then how the standby controller unpacks the software image in a new location and reboots the new version into standby mode. Then the active controller unpacks, reboots and is taken-over by the standby. All unbeknownst to the applications.

    Since he was a new customer, I felt compelled to give him a few stats on our latest version of GA software to help put him at ease. We announced our latest version of Nimble OS two weeks earlier and had 170 systems already updated when the customer called me. Of those systems, 55% were updated during their prime time while serving production data. And 100% of all systems had no service disruption to any application.

    Now, I had to be completely honest and let him know there were 10 systems that took an extra step to update. A built-in high-availability health-check process had determined network connectivity mismatches on the standby controller could have caused interruption or degradation of service to applications after failover. A quick reconfiguration of the customer’s network and the updates were once again off to the races.

    By the way, the customer hit the update button about halfway through that last sentence…

    Twitter Linkedin Rss Youtube

    Suresh Vasudevan
    CEO..

    As the CEO of a young company that is growing rapidly, half my time is spent in the field with customers, channel partners and prospects. The initial discussion is primarily on our technology and value proposition. We have been fortunate in that many of those conversations turn quickly into strong interest from our prospective customers.

    The discussion then focuses on our ability to support the customer, given our smaller size. In many instances, it is at this stage of the engagement that our competitors start creating FUD (fear, uncertainty, and doubt) about our size. For instance, a few days ago, an account executive at a large competitor had sent our prospect (his customer) an email explaining in great depth as to why we would be unable to match their support expectations. He was basing this claim in part on the fact that even they, a much larger corporation, were hard pressed to satisfy the customer despite their size and resources.

    I think that every customer should rightly inspect why we think we can match or exceed his or her
    support expectations, and why we can perform better than much larger companies in this regard
    .

    Customer references are great proof points

    Ultimately, the best proof point comes from what our customers say about our support and our sales teams draw upon a large base of happy customers as references. So, let me start with some recent examples of unsolicited customer emails as proof-points before I describe how we deliver support:

    • “Thank you, team, for the detailed technical report as well as your assistance in proactively addressing the issue. You have a truly exceptional team and I can only wish that all of our vendors/partners were as responsive and effective as Nimble.”
    • “Dan, Got your vmail. Thanks for the update. I do appreciate Nimble finding potential issues and proactively resolving them before it affects our systems. Thank you!”
    • “This isn’t hard enough! I need thick manuals, on-site field engineers, complicated support web site, a-la-carte licensing charges for using the product’s advertised features, complicated spreadsheets to document configurations, maybe even a storage specialist, etc. Something is very wrong here.”
    • “Last Friday night around 8:30pm, one of our employees accidentally deleted some very important documents and I was with my family in a restaurant, can’t get to a computer right away. I told him to call Nimble support, he received help right away and the data was restored. Now that is customer service!!!”
    • “Chris, Just wanted to say that the below email from Nimble is very impressive. Never would I have seen the email below come from any other vendor that includes specific information pertaining to our device and letting us know what would fail and how to fix it before we have even performed the upgrade. All I can say is WOW.”

    Core philosophy driving our support approach

    We started off with a deeply held belief that product supportability needs to be thought of as an aspect of the product, and an area that is as ripe for innovation as any other aspect of the product. We hired support architects a year before the product shipped, and assembled a support team that has built and fine-tuned remote support automation multiple times in prior companies. There are some core beliefs that underpin our approach to delivering support:

    1. The lifetime value of a customer trumps any single deal, and support is crucial to a long-term relationship.
    2. In a wired world, we can be always connected to our systems. This should allow us to track events affecting our systems at the same time or sooner than our customers. Furthermore, being connected to our systems should allow us to directly access the systems and fix issues much faster by leveraging our deeper knowledge of our systems.
    3. As support teams grow, training and expertise cannot keep pace unless we automate as much as possible.

    These beliefs have led us to deliver a model of support that is
    dramatically ahead of the storage industry.

    Our approach to support

    I will start off by saying that while we have done a lot already as described below, what excites me more though is that we have laid a foundation for how we plan to scale support and every release and every month allows us to do more.

    1. Real-time telemetry: Can we know everything possible about our systems at our customers’ sites?

    We all have had instances where a support person starts by asking if the equipment has been powered on, and then proceeds to ask hundreds of mundane questions. We wanted to avoid this. Many vendors in the industry are able to receive daily logs from their systems. We have dramatically enhanced this capability and our systems send diagnostic, configuration related and operational health information every few minutes!! We have architected this such that our systems can send information even if they are unable to serve data in most cases. All of this information is parsed into a database infrastructure that maintains a detailed history for every system in the field.

    2. Comprehensive background health checks: How are our systems doing?

    We then run a variety of health checks that are constantly measuring the state of every system. We are always updating and increasing the set of checks to make them more comprehensive, as we discover any new issues. For example, during a recent release upgrade, one of our customers found that certain wild card characters in an encoded password would cause issues with authentication. We were able to query our database and rapidly identify the specific customers and the exact systems and volumes within those systems that would be affected and alert them with the corrective action before they encountered the issue!

    3. Automated alerting and case creation

    The main goal of the constant health checks is so that we can react rapidly to issues with the least amount of lost time and disruption. To this end, our support team is immediately notified of any error that is detected by the health checks, and they proactively call the customer when that happens. Our case management system is able to auto-create a case when the health check can definitively point to a known error or defect or disruption. Some of my best customer anecdotes are when a customer describes a call from our support organization enquiring about the health of a system when the customer had just been reconfiguring a test system for some internal tests.

    4. Automated case resolution and prevention

    Beyond being able to react quickly, we are increasingly focused on automated case resolution. To this end, our engineering and support teams are focused on correlating a health check alert to a definitive error and beyond that to known solutions that can correct the errors. For example, we can send the customer a knowledge base article if we can definitively correlate a health check alert to a known issue with a known resolution. Another example is that in our systems, recognizing that configuration changes can prevent many problems from occurring in the first place, we have the ability to target specific arrays or groups of arrays and send specific instructions remotely that reconfigures the arrays to avoid an issue.

    5. Secure remote access: Can we mimic onsite presence?

    Despite all of the above, there will always be instances where a problem has been escalated and needs engineering involvement. In those instances, our approach has been to try and recreate an experience that mimics us being onsite. To that end, our arrays are pre-configured (upon customer delegation) to create a VPN tunnel that allows our engineers to work on the remote system without asking the customer to collect a number of logs and other diagnostic information before engineers can be productive in engaging on the escalation. On numerous occasions, the customer completely delegates the array to us, as we diagnose and fix the issue remotely.

    People and culture: The first and last line of defense

    Ultimately, the support that a customer experiences when faced with a problem is perhaps more “people-dependent” than any other interaction between a company and its customers. While technology, tools and training can help, the quality of the support interaction depends very much on how the person delivering support feels about the company he or she is working for. In that respect, we at Nimble are fortunate in that we are intensely proud of what we are building and have a strong desire to make every customer successful. Maybe that is the real answer to why we can support our customers better than our larger competitors!

    Twitter Linkedin Rss Youtube

    Suresh Vasudevan
    CEO

    In a previous blog on storage efficiency, I had suggested that mainstream enterprise applications that need both performance and capacity are best addressed by hybrid storage systems, and that the effectiveness of blending is what distinguishes various storage systems in terms of their ability to cost-effectively deliver performance AND capacity.

    This raises the question of what criteria one would use to judge the effectiveness of blending.  At this juncture it is really important to point out that our entire focus is on efficiency.  There are many ways to deliver absolute high performance, but our focus is on delivering performance at the lowest cost.  Similarly, when it comes to capacity optimization, our focus is on delivering very cost-effective usable capacity for Enterprise applications, but not on matching JBOD price-points.

    Optimization starts with playing to the strengths of flash and disks

    Let us briefly revisit some core properties of low-cost, high-density, near-line drives (“Fat HDDs”) and Flash SSDs, since that forms a key ingredient to assessing how you maximize the benefits of both.

    • Fat HDDs have the lowest cost per GB, are not good at random I/O, but perform fairly well at sequential I/O.
    • Flash SSDs are very good at random reads. They are better than Fat HDDs at random writes, but random writes degrade Flash SSD life.  Lastly, they are not that different from Fat HDDs for sequential I/O performance.
    • SLC versus MLC versus eMLC SSDs. SLC flash SSDs are 4-6 times more expensive than MLC flash SSDs, and they provide a much higher number (~10X) of write cycles compared to MLC SSDs (more on this below). eMLC SSDs are somewhere between the two on write endurance and on prices.

    The core optimization parameters we used to maximize efficiency

    With the above characteristics in mind, we designed our system from the ground up to optimize the following parameters, automatically with no user intervention: (i) real-time decision making about data placement; (ii) achieving the maximum performance acceleration for every $ spent on flash; (iii) achieving the maximum useable capacity for every $ spent on disks; and (iv) ensuring that the system optimizes writes (“sequential layout”) to disks, so as complement flash perfectly.

    How real-time are system decisions about data placement?

    We make real-time decisions about whether to place data on flash, with every read and write I/O so that cold data does not needlessly go to flash, crowding out other “hotter” data.  When we considered other approaches that optimize placement on a less frequent basis (e.g., once a day or once every few hours), they are easier to implement but we concluded that they would have been less efficient. They are less efficient in that critical data can often not make it into the flash tier in time or they would need a larger amount of flash as a buffer between the periods of decision making.

    How can we maximize the performance benefits of flash while minimizing cost?

    We maximize performance acceleration for every $ of spend on flash by focusing on the following parameters:

    1. Ensure fine-grained blending.  Our system is able to make decisions about whether to place data on flash on a very granular basis – units as small as 4KB in size.  Had we chosen a unit of say, 1MB, we would need a larger amount of flash because even a small 4KB block that becomes “hot” would force the placement of 1MB of data into flash.
    2. Leverage inline compression / de-duplication.  Since we have designed our system to achieve inline compression without a performance penalty, our flash capacity effectively is double that of the physical flash capacity.  Furthermore, in instances where cloned images are being accessed (e.g., virtual desktop boot images), our ability to have de-duplication (block sharing) across clones multiplies the effective flash capacity as well.
    3. Use inexpensive MLC flash SSDs. When SSDs receive random writes, the actual write activity within the SSD itself is higher than the number of writes issued to the SSD (a.k.a write amplification), which eats into the number of write cycles that the SSD can endure. Traditional storage systems deal with this problem by using SLC SSDs (and eMLC SSDs soon) which provide for a much larger number of write cycles, but this raises the cost of flash 4-6X. We approached the problem differently. Our file-system is optimized to aggregate a large number of random writes into a sequential I/O, and we only write in multiples of full erase block width sizes to flash.  Sequential writes minimize write amplification, allowing us to achieve the desired endurance while still using MLC SSDs.
    4. Minimize overhead (RAID, etc.) in how flash SSDs are configured. In our system, all data that we write to flash also lives on disks.  This is in contrast to most other storage systems that have data on their SSD tier that has not yet been written into the disk tier.  Consequently, while such storage systems have to protect the SSD tier using RAID 10 which consumes 50% of the capacity as an overhead, we do not have to worry about RAID protecting our SSDs.  In our system, if we lose an SSD, you lose a certain percentage of performance acceleration until you replace that SSD but your data is never lost.

    Can the system be designed to always write serially to disks, to perfectly complement flash?

    Given that flash as a disruptive technology started making inroads only in the last couple of years, most storage systems had already designed their disk data layout schemes without the benefit of knowing that flash would be around.  We had the luxury of asking ourselves how we would optimize data placement on Fat HDDs given the availability of flash.

    Our system is optimized to drastically lower the cost of flash in our systems, and to ensure that most random reads are served out of flash.  Therefore, the task that the Fat HDDs have to perform very well is to efficiently handle all the writes in our hybrid storage system.

    Disks are not very good at random I/O, but they are as good as flash at sequential I/O.  We therefore use the same file-system optimization I highlighted earlier of aggregating over a 1000 random writes into a single sequential write so that our system issues only sequential write requests to the disks.  This allows us to achieve very good write performance from Nearline drives that have traditionally been regarded as incapable of good random write I/O.

    The result: A system that optimizes read performance *and* write performance *and* capacity efficiency simultaneously

    When we translate all of the architectural choices that we made in our system architecture, we believe that we have addressed a thorny problem in storage – how do you simultaneously deliver high, cost-efficient read and write performance, while also optimizing cost-effectiveness in terms of $/GB.

    It is often difficult for customers to contrast storage choices on $/IOPS (easier to compare $/GB although getting apples to apples comparisons is sometimes hard even for $/GB), but an alternative approach would be to understand how flash SSDs and HDDs are being used in the system and what factors drove those architectural choices.

    Twitter Linkedin Rss Youtube

    Suresh Vasudevan
    CEO

    Over the course of numerous customer meetings, I started noticing a pattern of comments on how their storage environments fared when it came to delivering the performance needs of applications and users (NOTE: I am referring to mainstream Enterprise applications and not to high-performance computing applications). Many customers chose Nimble Storage for how well we converge primary storage and data protection on the same array, but they were really excited by the fact that their applications are visibly faster and they are able to support many more demanding workloads – despite our use of inline compression and low-cost, high-capacity drives (SATA).

    Virtualization causes I/O to become more random

    Server virtualization has made performance of storage stand out in sharp contrast. For IT professionals that have implemented server virtualization initiatives, it is well understood that networked storage deployments are a significant part of the capital spend and complexity. As they start measuring success based on how many virtual servers are being consolidated within a single host, customers realize that the BIG storage bottleneck is not capacity related as much as a performance bottleneck. As multiple applications are consolidated on to fewer physical servers, the resulting I/O pattern that is manifested on the network from server to storage becomes a blend of various application specific patterns and increasingly taxes the storage system in terms of delivering IOPS rather serving up GBs of data.

    Lack of disk drive performance gains compounds this problem

    Against this backdrop of increasing need for random IO performance from storage, disk drives have done a very poor job of keeping pace with advances in the rest of the infrastructure. The table below shows the pace of evolution for mainstream environments, and not high-end environments:

    2001 2011 10-year Improvement
    Compute: CPU 1.3 GHz x 2 cores 3.8 GHz x 16-24 cores ~20-30 times better
    Compute: Memory 0.25-0.5 GB 24-48 GB ~50-100 times better
    Network 0.1 – 1 GbE 1 – 10 GbE ~10 times better
    Disk drive density 36 – 137 GB 600 – 3000 GB ~20 times better
    Disk drive access time (performance) ~6 ms ~3 ms ONLY 2 TIMES BETTER!

    Therefore, when an application needs good performance, customers are unable to take advantage of low-cost, high-capacity SATA drives. The typical approach to delivering IOPS has been to deploy as many high RPM drives as necessary to get to the needed performance, even if the capacity of these drives was under-utilized in many instances.

    Flash can solve the problem, but presents a cost barrier

    Flash SSDs are ideally suited to address this issue. Flash delivers 50-100 times better IO performance than the fastest disk drive. Therefore, a single flash drive can replace 50-100 high-RPM, SAS hard disk drives. Using one or a few flash drives is indeed the right answer for those applications that need extreme performance, but only a few GBs of storage.

    However, mainstream applications such as Exchange, SQL databases running business applications, virtual servers and virtual desktops require adequate performance as well as a significant amount of storage capacity. For such applications, at 25 times the $/GB compared to multi-TB drives, storage solutions that rely solely on flash over-deliver on IOPS but become too expensive given the Terabytes of capacity.

    A blended model yields the best outcome, but efficient blending is key!

    If using multi-TB drives alone will not yield adequate performance, and using flash SSDs alone is too expensive for mainstream Enterprise applications, how then should we go about addressing the need for adequate performance AND cost-effective capacity?

    Given that flash SSDs deliver the best performance and low-cost, high-density multi-TB drives deliver the best cost of capacity, the ideal system would be one that can flexibly blend the right proportion of each to optimize cost of performance and cost of capacity. Industry participants have recognized this and have introduced tiering solutions as a way to mix SSDs and disk drives.


     

    What Nimble has been able to do far more effectively, by designing a file-system ground up to optimally leverage SSDs and low-cost disk drives, is to deliver the best blended system – our use of flash SSDs is over 10 times more cost-effective than competitive approaches and delivers high performance, while at the same time we use low cost multi-TB drives with inline compression to optimize cost/GB. I will discuss the uniqueness of our “blending” approach in an upcoming blog.

    Implications for customers: Evaluate solutions on $/GB AND on $/IOPS

    Many RFPs and evaluation processes focus on $/GB. The risk with this approach is that the decision to upgrade or augment your storage system may be forced upon you because you have run out of performance headroom, even though you have plenty of capacity headroom. Therefore, customers need to ensure that they are equally concerned about $/IOPS when comparing alternatives.

    Twitter Linkedin Rss Youtube

    My First Few Weeks as CEO »

    April 27th, 2011

    Suresh Vasudevan
    CEO

    Having been on Nimble Storage’s board of directors since 2009, I have had ample time to window shop, and cannot quite claim the same thrill and anxiety of new discoveries that inevitably comes along when you take on a new CEO role. What I can claim in complete honesty is that the last several weeks have been among the happiest of my professional career.

    There are three things that I deeply care about when it comes to how I feel about my job. Do I feel pride when I talk to my customers? Do I feel intellectually stimulated and stretched by the team that I am working with? Am I dealing mostly with what opportunities to pursue or what problems to address? On all three dimensions, I feel remarkably fortunate – let me tell you why.

    Do I feel pride when I talk to customers?

    I have met with over 20 customers and over a dozen channel partners in the last month, and in every meeting I ask them to tell me what their criteria was for choosing a product at the time of purchase, and what our value proposition turned out to be after having deployed us. In their words, I have heard a mix of 4 themes that have consistently come up:

    1. Performance: In order to match the performance that they saw from a single Nimble array of 4 SSDs+12 SATA drives, it took as many as 30-40 SAS drives from our competitors.
    2. Storage efficiency: Our inline data optimization allowed them to store between 1.5 and 2.5 times the amount of data (useable storage) compared to their experience of other vendors’ solutions, for a given amount of raw storage.
    3. Converged backup: They were able to use our extended snapshots to replace or avoid having to upgrade their disk-based backup systems and tape libraries.
    4. Ease of use: The user interface absolutely met the goal of being designed for an IT generalist rather than for a storage specialist, and every customer was impressed with the remote support capability built into our product.

    As someone that came from a product management background, I had always been excited about the underlying product architecture at Nimble, but I walked away from these conversations with a strong conviction that our architecture delivered a truly compelling economic value proposition.

    Do I feel intellectually stimulated and stretched by the team that I am working with?

    I had spent nearly a decade at NetApp, my last role having been as leader of all product management and engineering. The aspect of NetApp that I most enjoyed was caliber of the executive team and my product operations team. I have the same excitement about the team I am working with at Nimble – this is one of the most analytical, deliberate and goal-oriented teams that I have encountered. It is also a team that is maniacal about maintaining a high bar on the people that become part of Nimble – in terms of horsepower but equally importantly in terms of cultural fit.

    I am very familiar with storage, and I must have asked over 50 questions on the market, product, pricing, positioning and so on in my first few weeks. 8 out of 10 times, the team had thought about the question and had carefully considered the alternatives. What’s more, this team is fastidious about documenting its hypotheses, thought process and decisions – I could literally track all the discussions that led up to the countless decisions that were made over the last 3 years!!

    Within our corporate location in San Jose, the culture is very much of a “down-to-earth” culture that is non-hierarchical and cares mostly about what you have to say rather than who you are. When I traveled with the field teams, every single field team struck me as being successful in their prior roles, convinced that they can find similar or greater success at Nimble, and eager to induct other high performing former colleagues into Nimble. The one philosophy that permeates the company is that we all want to work with colleagues that we respect and can hang out with.

    Am I mostly dealing with what opportunities to pursue or what problems to address?

    I have weekly “deep-dive” meetings with my team on a variety of topics. A recent deep-dive meeting was on the topic of accelerating our international logistics plans since our sales teams are finding strong interest from Global “Enterprise” customers, and some of our customers are already planning roll-outs to multiple countries. My last deep-dive meeting was to align everybody on what needed to happen in the rest of the company since we had made a decision to significantly accelerate sales hiring. We have prioritized our major product development projects for next year but one of our upcoming deep-dive topics is around whether we should take on more in parallel, given that we are seeing multiple exciting opportunities to extend our product.

    I love these discussions. We have our share of discussions on how to address a specific customer support situation or why hiring is slow in some function. Having said that, the overwhelming majority of our discussions are about picking the right things to focus on, of the many opportunities we have.

    Maybe this is my honeymoon phase, but after my first couple of months here, I get up excited every morning and eager to get things done!!

    Twitter Linkedin Rss Youtube

    Ajay Singh
    Sr. Director, Product Management

    There’s a quiet shift underway in the IT landscape. No, not cloud computing – few would call that a quiet shift. It’s the trend away from traditional backup and DR to something faster, simpler and lower cost: Extended Snapshots and Replication (ESR). IT practitioners talk about it. Analysts see a trend, for example ESG found (table below) that small-mid environments already use this commonly for VMs. Industry experts take some flak for calling it as they see it. Even folks historically linked with traditional backup acknowledge the shift. Naturally, vendors not best served by this trend vehemently argue against it. When you hear someone argue – “we could have offered this for years, but it’s just not the right approach”, make sure the real reason isn’t an inherent weakness of their underlying technology.

    So what’s the fuss about? Let’s review a typical form of traditional backup and DR seen in a mid-sized enterprise, and contrast it with the ESR approach. We’ll skip archiving requirements, which have different solutions, and acknowledge some organizations have more specialized needs.

    Traditional Backup and DR – Repeated Copying of Redundant Data

    Backup software scans servers nightly for new data, and bulk copies changed data to a dedicated backup device, today likely to be disk based (although tape still rules for archives). Scanning and copying are resource hogs, impacting servers, storage and networks, so they’re done during designated backup windows. Because of restore performance and reliability issues, incremental backups are supplemented with massive weekly full copies which usually consume the weekends Backup dedupe makes it more affordable to retain the 30-90 days of backups most organizations need. However, the bulky upfront copy means you can’t afford to backup too often, so Recovery Points are sparse – typical RPO is one day. And restores still take hours to reconstitute data from the full and incremental backups. Deduped disk backups do have the benefit of enabling WAN efficient offsite replication. Once again though, Recovery Points are spread far apart, and restore times are long. Nor is there an option to run an application right off the DR copy – you need restores to primary storage.

    Extended Snapshots and Replication Approach

    The primary storage device captures (app consistent) near instant snapshots based on a predefined schedule (every few minutes, or once an hour) without affecting application performance. Efficient snapshot implementations are “un-duped and compressed” and reside on low cost disk, so you can afford the extended retention you need (say 30-90 days). Another subset is replicated (say every hour) using very efficient replication to an offsite DR array, where they are retained for say 60 days. When needed, the entire application or a subset can be restored from snapshots within minutes. Applications can also run directly off the backup/DR copies without any format conversion. There are no backup windows to manage.

    Comparing the Approaches

    Here’s how each approach handles common failure scenarios:

    Traditional backup has had the advantage of incumbency. IT shops are familiar with it. Backup software has supported this approach longer. However, IT shops hate traditional backup, and many are looking to change. And software vendors are catching up in terms of managing snapshots. Finally, newer approaches have so dramatically improved the cost and simplicity of ESR, the contrast more striking than ever:

    In the one case you have multiple devices juggling data, 3 data copies, and a lot of daily heavy lifting to get a barely acceptable level of SLAs for recovery. With the other approach, you have 2 devices, 2 data copies (unsurprisingly at a lower cost), no daily backup windows or pain, and much faster, better recovery options.

    Which would you choose?

    Twitter Linkedin Rss Youtube