By Rod Bagg – Vice President, Analytics and Customer Support
We often describe the InfoSight analytics engine that’s built into Nimble Storage arrays as being predictive, but as a colleague pointed out the other day, it’s also preventive. Both are valuable, but there’s a big difference between the two, which got me thinking about the classic Robert Palmer rock song, Bad Case of Loving You – “Doctor Doctor, gimme the news, I’ve got a bad case of lovin’ you.”
Doctors and scientists do an astonishing amount of basic and applied research every year in order to cure and prevent disease. Indeed, cancer research alone now accounts for more than $100-billion in spending each year. These advances require a lot of very smart doctors and scientists, a deep understanding of our DNA, and tons of computing and other advanced technologies. Just like Nimble Storage.
Wait a minute, what’s the connection? Well, Nimble has:
- Very smart doctors – we have several PhD data scientists on the InfoSight team;
- Very smart scientists – dozens of computer scientists / engineers on the team, backed by hundreds of tenacious storage, networking and platform product engineers and architects;
- Tons of computers – we have a super-cool analytics database and all the compute horsepower needed to process trillions of data points without breaking a sweat;
- Nifty technology – in today’s data storage market, the InfoSight engine is second to none.
It looks like the only thing we’re missing is the DNA. Nimble’s leaders have “customer experience” embedded into their DNA – that’s why we’ve tenaciously poured our hearts into creating a world-class support organization around InfoSight and its world-class big-data analytics engine.
But right from the start, Nimble has leveraged another kind of DNA – the “Diagnostics for Nimble Analytics” that are constantly sent from every Nimble Storage array. That one little strand of DNA contains unique, irreplaceable information.
DNA can be simply described as “that which makes everyone almost the same but perfectly unique”. That’s a great mental model for thinking about all that data we get from every Nimble array. Buried in those strands of DNA are the fundamental and distinctive characteristics of every moment of every array ever deployed. Unlike real DNA, an array obviously cannot be understood in a pure chemistry sense. However, the techniques used by those smart scientists to study vast amounts information stored within DNA can be used. And that’s exactly why having a team of data scientists on staff in our support organization is essential to the power of InfoSight and anyone trying to replicate our DNA.
Let me give you a couple of examples:
Our Nimble DNA
Every single support case that is touched by a Technical Support Engineer (TSE) is analyzed and categorized into root-cause buckets by our Support PEAK team working with their product engineering counterparts. This collaborative team has three goals:
- Drive product change so the Support case would never be encountered.
- If a case is necessary, work with the Data Scientists and InfoSight engineering team to create an automated prescriptive case with a full solution back to the customer in order to avoid the issue altogether.
- If #1 and #2 are not immediately possible, then define, create and deploy tools to make the TSE and case resolution as efficient as possible (this allows us to solve the most complex issues in minutes).
This sense of urgency and commitment around Customer Support forms the real fabric of our DNA as a company.
Data Science in Action
Many times, our data scientists are looking for outliers and correlating factors to provide clues to understandings that lead to the root cause of the issue. In other words, looking for a needle in a haystack.
We recently encountered a case like this that involved a kernel “out-of-memory” panic. The challenge in these cases is trying to determine which process is consuming or, worse, leaking memory among dozens and dozens of processes and millions of memory allocations being done in a complex computer system. The data scientists embarked on the task by identifying all sensors that track various memory allocations to find one that was an outlier. That is, identify one sensor that seemed to be exhibiting behavior differently than all other similar sensors on all other arrays in the install-base.
Once that sensor popped out in their analysis and was graphed, we could easily see memory was being consumed over a long period of time. This was likely due to a leak, since the normal case among the large population of arrays was a relative flat-line, steady-state of memory consumption indicated by that same sensor. So we had our outlier. A sensor that indicated a likely memory leak – somewhere.
The next step was to find any correlating factors. Our scientists queried the event logs for messages that appeared at the same frequency as the memory was leaking. That query yielded one seemingly benign log event that correlated to application-synchronized snapshots. The data scientists informed engineering of their findings, and our engineers were able to identify the memory leak directly in the same code area where the event was being logged. This entire process was completed within a couple hours.
The final step was to identify a small set of arrays that were exhibiting the same issue. Based on our deep InfoSight data, we were able to predict the exact day each array would exhaust memory. We then created prescriptive cases based on those predictions and provided the fix proactively.
And that’s how prediction becomes prevention.
“Shake my fist, knock on wood
I’ve got it bad, and I’ve got it good.”