InfoSight’s Discerning Eye for Good Performance
By Karthik Krishnaswamy, Nimble Storage Product Management
Broadly speaking, we’re all in agreement that latency is ‘bad’. Storage admins often spend sleepless nights worrying about latency issues. They experience even more anxiety when end users are impacted and start complaining about it.
But not all latency events are equal. Two separate events with the same average latency can have very different impacts on applications and end users depending (1) on what data was accessed and (2) how that data is being used. To address this issue, we have introduced a “Potential Impact” score for latency in InfoSight. With this Potential Impact metric, InfoSight aims to better discern between events when latency is likely to require urgent attention – and when it is less critical. Why do some latency events require more urgent attention than others? For most organizations, latency during a late- night backup job is not nearly as critical as when latency is experienced by Virtual Desktop users or within an OLTP database. Being able to distinguish between various workloads is very valuable for determining the impact of any latency associated with these workloads.
How is this accomplished? First, InfoSight does not rely solely upon the simple read-and-write average latency values that most storage arrays report. Instead, InfoSight collects many distinct latency measurements as operations vary in size, sequentiality, and originating application. By using a segmentation of operation types, InfoSight is able to contextualize every workload and develop a more nuanced view of latency. A good analogy is how students are graded. Typically, a student grade is a composite weighted score consisting of homework assignments, quizzes, class participation and exams. Students are also typically graded on a curve: each student is given a score relative to his/her classmates. In a similar fashion, different segments of IO can contribute to the overall impact score.
But how do we know it works? While we collect this detailed latency profile from our install base of 9,000+ customers, a large telemetry dataset alone can’t assess user sentiment. To validate that our scoring method is a predictor of how customers actually experience latency, we correlated our results with real-world customer feedback collected by our support department. Through this analysis, we were able to determine that our impact score was a far better predictor of how customers would characterize performance than a simple latency average. With this metric, storage admins can direct their focus on latency events with high potential impact – and sleep well at night! And, when a rare problem does arise that impacts latency, InfoSight provides automated root-cause analysis (see array name for link).