Nagios and Service Monitoring proofs of concept
When it comes to service management, either for internal use or as a Service Provider, you really should not underestimate the impact of service monitoring.
When it comes to System Administration nowadays it usually comes with big headaches. Servers, network devices, Internet access, Cloud services, phonecalls over IP, VPNs, clients, and of course the best one: users.
System and Network administration is a terrible task, dealing with problems is pretty intensive. And when there's a problem you can't relay on anything but your knowledge, skill, and calm. This one is the most difficult task perhaps, since when there's a problem, they tend to make it worse.
A Network Monitoring tool is not just something that verifies if a host is online. I've seen System Administrators using "Look at LAN" as a monitoring tool. It's the best friend you can have when it comes to management. Hereby I will be writing about Nagios, an Open Source solution I have been using myself for past 5 years, but the proof concepts applies to mostly any kind of monitoring solution, whether free or Vendor specific.
Nagios comes in two fashions. You have the Community edition and the Enterprise Edition. It's a reliable and very scalable system that can monitor virtually any kind of system and feature (you can find and download the source code from their website). It has an interesting set of controls you can use as-is, or you can build custom ones.
The monitoring system runs 24/7 and keeps under constant control hosts running a standard "ping" command, or either querying network services using the TCP/IP stack, SNMP, or custom agents. Theose last ones can be either on Linux, Unix or even Windows hosts (by installing agents). If a failure condition occurs the system notifies via E-Mail, Text, Twitter, Skype or even phone the System Administrator of the condition itself. Also the monitoring system can store performance data for further evaluations or statistics.
There is no standard monitoring procedure. The only one who can determine the baseline for system monitoring is yourself. A monitoring system does not rely on information written in books or Vendor sites. It is based on your network, users and business. You should monitor the standard common elements of a system, this is for sure. You should consider tracking the common disk usage to avoid storage exhaustion, CPU usage of network switches, bandwith, network availability and so on. This should be considered the most common baseline of system monitoring.
What you might want to consider is monitoring what I call the "Network Behavior". This is built on the business model you work in, and on reported incidents. You tend to monitor what usually happens in your network, and try to have a predictive mechanism to avoid lock-down situations, where your network fails, either parially or totally.
Monitoring with basic controls (plugins) is fair and accomplishes to most requests. But you should really proof your knowledge of how the various technologies work. You should really proof the use of SNMP at first. SNMP is a powerful and scalable protocol that allows you to verify the status of most devices and even softwares. If a device allows the use of SNMP, if it's not standard, there should be a MIB library that allows you to monitor the device-specific components. For example Watchguard XTM carries its own MIB libraries for device-specific controls, but also APC does the same for their UPS'. Most interestingly, SNMP is heavily implemented in any network device or printer.
When SNMP is not enough, scripting comes up. In Nagios it's possible to write your own scripts in technically any language you like. I am wide aware that my own choice is not the best one (as in efficiency of the plugins) but I use PHP for my very own controls. The language you script though, is not a real concern. How you use it, is.
Suppose you need to verify if your Active Direcory domain controller is working. You should consider controlling if DNS is working correctly by querying the zone of interest and see if it replies. You should verify if LDAP is responding, perhaps by querying it for some data. You might want also to check if Kerberos is answering to authentication queries. You might have to write your own scripts or find some on the Web, but you really need to understand what is at base of the various technologies.
Each error is an opportunity
Your work is based on inefficiency or inaccuracy of events. If a problem happens once, it can happen further in time. If you are able to reproduce a problem, you are also able to control it. This is a fundamental basis of service monitoring I think. Try always to determine how the problem can be reproduced.
It happened to have a user unable to write on a disk share on a NAS because - for a firmware bug - when it ran out of memory, the system mounted the shares in readonly mode. The problem is, that the system may run out of memory and then return from this status, but the shares will still remain in readonly mode. If you monitor the ability of creating a file (and deleting it) from the network share, you have a predictive mechanism that allows you to have the situation under control.
Nagios allows you to script quite complex operations. It not only monitors with custom scripts, but it also can act. You can write a script that performs pro-actively diagnostic operations when a problem occurs, or even issue remote system commands. The availability of APIs extends massively the capabilities of a system and allows you to perform initial actions on behalf of your Service Desk before the technician even starts evaluating the problem.
The level of integration of Nagios can be really impressive. You might want to allow a script to access your database and gather information about your customer, so you can inform him/her about a problem or build custom messages to inform your own Customers about what is going on on their systems. You can trigger automatic ticket opening in your trouble ticketing system. You have virtually no limits in capabilities that this system can offer to you.
The real paradigm change
Wherever you are either a Service Provider, a Professional or an internal System Administrator, you still have to deliver services. The main change that nowadays the world (I mean the users world) need is having a Service Desk that delivers the service proactively. From my past experience there's a psychological change in Users when you switch from the old-fashioned HelpDesk called by users to a condition where the Service Desk informs users of a problem and works on it. Information, awarness, service quality.
The result of an efficient monitoring system doesn't only give tools to the System Administrator to predict or be aware of problems. It is an incredibly useful tool for routine activities, such as service activation, or even system maintenance. When working as a Service Provider, you might want to automate most of your activities or jobs. The monitoring platform can be easily used as your checklist to see if all you have done is fine, complete and works as expected.
A little scenario
In a recent discussion on LinkedIn a user has described an idea of how to provide systems status of their customer's devices. Quite a common idea within Service Providers (sadly in Italy there are really few ones).
The scripting features described above allows you to build a near-perfect monitoring system that not only informs you of what is going on, but also can send a pre-analysis of the problem to the Service Desk operators (wherever they are) and even notify the end-user of a problem in-progress or predicted.
As a Service Provider having a clear idea of what is happening at a Customer's location, or even a Customer's service in our Data Center is extremely important for the service quality and reputation as well. Nagios has been found to be - for me - the best solution among others, allowing us to integrate the system with the datacenter infrastructure, monitoring remote Customer sites (and services) or even Cloud based services. We were able to resolve DNS issues when zone files were not correctly (or at all) loaded within Registrars, we could stop a mail system from sending SPAM once its queue was too big, we could even predict a DoS or intrusion attack when web services started to perform poorly. And, also, when working on the entire Data Center infrastructure (counting hundreds of servers) having a checklist that helps determining if all services are up and running is a must.
This content has been provided as-is and is the result of my own over 5 years experience with Nagios. I am neither affiliated nor involved in the Nagios development.