ConvergeOne Blog

Monitoring Success in a Complex IT Environment

Written by Mark Langanki | Nov 8, 2018 3:00:00 PM

Monitoring has proven to be one of the more difficult components of IT for many years, perhaps because we make too many assumptions about how monitoring works and what the monitoring tools we use can actually do. We’ve all been mesmerized by demos at trade shows that boldly claim to have finally “cracked the code”! Unfortunately, after we purchase the tool, take it home, and install it, we begin the long, winding road to disappointment when it doesn’t do what we thought it would do. Alas, the tool just showed well on the trade show floor.

Why is it such a struggle to find the holy grail of monitoring systems that can fix all of our problems? It’s simple: One tool alone can’t possibly fully monitor today’s complex IT environments—oh, and monitoring doesn’t work unless we have a full understanding of what’s really going on within our environment.

In this blog post, I will review the monitoring issues around the data presented from our systems and applications and the steps to reaching a state of monitoring nirvana.

Step #1: Receive Comprehensive Data

One of the main issues with monitoring comes from a lack of understanding about the flow and value of data. First, let’s think about the initial source of the event we wish to monitor. Where does it come from? What does it say? Most tools are betting on the developers of software, hardware, and operating systems being able to provide clear, concise data that describes exactly what is going on with the code when something goes wrong.

Problems arise when we are unable to identify the full definition of the issue as soon as it appears. There is a subliminal miss when developers fail to consider that the error generated by their code is not being read by a fellow developer, but rather a managed services engineer who is keen on finding the needle in the haystack. Developers should therefore ensure they are providing comprehensive and clearly understandable data.

Step #2: Remove the Noise

Even if we have perfect data that defines exactly what’s going on, we need to take the next step to remove the noise. In this case, noise is defined as things that are immaterial to the ability to understand what’s going on and what’s important. In other words, what’s important is defining, understanding, and resolving an issue—and if something isn’t contributing to doing any of these things, it’s noise.

In the course of developing code, programmers dedicate a large percentage of their code base to exception handling. Some of the exceptions that are encountered and reported are not truly problems that a monitoring application needs to further act upon. The true art is being able to identify which data points require further action within the context of the environment.

Step #3: Diagnose the Asymptomatic

Imagine that we have great data that tells us everything we need to know with no noise. Have we solved all of our issues? Not quite—and the next issue is a doozy.

Monitoring that’s based on waiting for data to be sent, captured, and analyzed accounts for a small percent of actual problems that occur. What I mean by that is that there are many issue types that can’t be caught by the standard method of what we call “traps” in monitoring.

Here are a few examples that can’t be monitored via standard Simple Network Management Protocol (SNMP) tools:

  • Logic errors: We configured our voice response system to be closed on Thanksgiving by date, but we forgot to update that configuration for the new year, and now we will be closed when we need to be open and vice versa. No monitoring application can monitor this since technically nothing is broken.

  • Bugs: Oh, bugs, you get us every time! We can’t monitor for something when we don’t know how it will perform. This is not referring to application failures that have known errors. We are talking about bugs that result in things just not working as expected.

  • Configuration errors: Yes, we could look for these, but they would not be represented in a failure in an error log/SNMP trap.

  • Compatibility: Whoa, now we are talking outside of a single system. How can we know if two or more things are compatible just by waiting for errors to be generated from an application?

The list keeps going, but you get the point. There is no single easy way to identify that something has failed and determine exactly what should be done about it.

I know what you’re thinking: Great, is there anything we can do to achieve true monitoring success (which we were fooled into thinking was such an easy feat when we witnessed that awesome demo on the trade show floor)? The answer is yes, but to get to that point, you need to think about monitoring differently.

We are all searching for ways to make the complicated easier to understand. You can achieve your monitoring vision—but you first need to understand the reasons why your current monitoring system doesn’t work and identify the steps to move into a mode that allows you to find the issues in your systems quickly and effectively.

 
[ FULL ARTICLE ]
3 CRUCIAL STEPS TO ACHIEVING SUCCESS
IN A COMPLEX IT ENVIRONMENT

 

Download the full article by Mark LangankiConvergeOne Chief Technology Officer, for a more in-depth look at the three crucial steps to overcoming monitoring challenges. The article also includes six ways to think about monitoring differently and set your organization up to achieve true monitoring success.