Monitoring Concepts

Monitoring Process

This diagram displays the basic monitoring process. The MONITORING ENGINE shown in the center of the diagram, has four functions:

The scheduler, which schedules and runs monitors
The alarm engine, which generates alarms on problems
The location where notifications and escalations are managed
Area where reports are generated

The DATA INPUT to the Monitoring Engine comes from Plugins and Agents which run from a command line to check the status of a host or service. The Monitoring Engine CONTROL INPUT comes from configuration files (directives that affect how Nagios operates) and from command line (external applications) data input. The CONTROL OUTPUT uses Event Handlers which run during host or service events to trigger other system actions for proactive problem resolution. Then, based on the service check and host check logic, notifications are sent as DATA OUTPUT to get resolved via email, pager, or user-defined methods. In addition to notifications, output can also be in the form of reports, and dashboards.

GroundWork Architecture

GroundWork's integrated open source architecture includes three tiers of functionality. This architecture is completely open and extensible for developers seeking to incorporate additional data sources or build presentation layer applications, views, and reports.

Presentation Tier

AJAX PHP Framework - The Presentation tier features a PHP Framework as the GroundWork Web user interface, which displays data needed by Operators and managers to monitor and manage their IT infrastructure. Data is presented via status screens, performance graphs, an event console, reports and dashboards.

Service Tier

GroundWork Foundation - This tier features GroundWork Foundation which normalizes and stores IT monitoring and management data (from the Monitor Data Collector) in a database. This tier includes application programming interfaces (APIs) and Web Services for writing presentation layer applications and views.

Monitor Data Collection Tier

Open Source Tools, Add Ons, Applications Integration - This component extracts data from IT monitoring and management tools such as Nagios, Syslog-NG, SNMPTT and third-party commercial monitoring systems, and prepares messages for delivery to GroundWork Foundation. The Application Integration component is an abstraction for the inclusion of arbitrary 3rd party applications.

Key Definitions and Concepts

Here we'll discuss some key definitions and concepts used in the monitoring environment.

Hosts - An element (physical server, workstation, network device, etc.) for which the availability status is to be tracked or mapped.
Services - A monitor, of a particular parameter or status, associated with a Host. This can be an actual service that runs on the Host (POP, SMTP, HTTP, etc.) or some other type of metric associated with the Host (response to a ping, number of logged in users, etc.).
Host Groups - An arbitrary collection of Hosts into named sets. The uses for Host Groups include access control, drawing layering, status displays, Notifications, scheduling maintenance, multi-server commands, and reports. A simple example of Host Groups might be collections of Hosts based on their physical location, e.g. london, newyork, and sanfrancisco. Grouping Hosts in this way would allow for staff responsible for just one location to display those Hosts. A key concept is that Hosts can belong to one or more Host Groups and must be a member of at least one Host Group. Of particular importance is the use of Host Groups for controlling Notifications and Escalations. Usually it is groups of Hosts, and not individual Hosts, that Administrators are responsible for. Defining the Host Groups for which Notifications will be sent to a particular collection of Administrators is more efficient than to do so for individual hosts and services. Escalations can be configured against the same host groups.
Plugins - External (to Nagios) programs that are executed whenever there is a need to check a host or service that is being monitored.
Flapping - Flapping occurs when a host or service changes state too frequently, resulting in a storm of problem and recovery Notifications. Flapping can be indicative of configuration problems (i.e. thresholds set too low) or real network problems.
Notifications - Communications to contacts or contact groups about the status of a monitored element. Notifications can be configured for circumstances including any hard state change, if a host or service remains in a non-OK state, and for acknowledgments.
Event Handlers - Event handlers are optional commands that are executed whenever a host or service state change occurs. An obvious use for event handlers (especially with services) is the ability for Nagios to pro actively fix problems before anyone is notified. Another potential use for event handlers might be to log service or host events to an external database.
Time Periods - A time period definition identifies a list of times during various days that are considered to be valid times for notifications and service checks.
Commands - Commands that can be defined and executed by Nagios include host and service checks, host and service notifications, and host and service event handlers. Command definitions can contain macros which enable the usage of generic commands to be easier.
Contacts - Contacts are defined and used to identify someone who should be contacted in the event of a problem on your network. Contact groups are one or more contacts grouped together for the purpose of sending out alert/recovery notifications. All contacts in a contact group are notified upon a host or service problem or recovery.
Profiles - A Profile is a collection of multiple services or hosts. Configuration uses device-specific profiles that contain both pre-defined and user-definable monitoring parameter settings. Using profiles, Administrators can quickly configure GroundWork Monitor to monitor groups of similar devices and benefit from GroundWork's deep expertise in monitoring design recommended practices.
Dependencies - When a monitored item is not on the same subnet as the monitoring server, monitoring is dependent upon the intervening switches and routers. GroundWork's standard dependency relationships include:
- Monitoring on upstream switches and routers
- Status of services on hosts
- Availability of port based monitoring agents
State and State Changes - In order to prevent false alarms, Nagios allows you to define how many times a host or a service check will be retried before the host or service is considered to have a real problem. The maximum number of retries before a host or service check is considered to have a real problem is controlled by the <max_check_attempts> option in the host and service definitions, respectively. What attempt a host or service check is currently on determines what type of state it is in. State types are used to determine when event handlers are executed and when notifications are sent out. The current state of services and hosts is determined by two components: 1) The status of a host or service. Host status can be UP, DOWN, UNREACHABLE, or PENDING. Service status can be OK, WARNING, CRITICAL, UNKNOWN, or PENDING, and 2) The type of state it is in. There are two state types in Nagios: 1) Soft states, and 2) Hard states.
Monitoring GUIs - Various CGIs are distributed with Nagios. By default the CGIs require that you have authenticated to the web server and are authorized to view any information. You will need to set up the web interface and CGI authorization. GroundWork Monitor has a replacement for the entire Nagios web interface called, Status. Status provides the complete status of all hosts and services that are being monitored. Status is a user-friendly way of peering into Nagios data for troubleshooting and problem resolution.
Basic Authentication - A username and password are required for access to GroundWork Monitor. They are independent of the operating system's login accounts and unique to the monitoring system. The system is role-based and the login ID determines which applications a user has access. An authenticated user is someone who has authenticated to the GroundWork web based framework using either local authentication or LDAP based authentication. The web based framework supports a single sign on capability for both system access and access to the underlying Nagios GUIs. User IDs are configured by the Administrator.
Notification Alarm Message Format - In the text below you can see the format for a notification alarm. The host and service are displayed along with the IP Address. The state is clearly displayed: here it is Critical. The date and time of the alarm, any additional information, and a link to the location where the Operator can acknowledge the alarm are also included. Notification commands are typically found in /usr/local/groundwork/nagios/etc/misccommands.cfg. This file can be edited to change the content of the notification. Following is an example of a service email notification;
```
Service: myapp_url_port Host: myapp Address: 10.0.0.10 State: CRITICAL
Date/Time: Mon Aug 6 3:54:07 PDT 2012
Additional Info: Socket timeout after 10 seconds
```
Downtime or Maintenance - Nagios allows you to schedule periods of planned downtime for hosts and services that you are monitoring. This is useful in the event that you actually know you're going to be taking a server down for an upgrade, etc. When a host and service is in a period of scheduled downtime, notifications for that host or service will be suppressed. Why use scheduled downtime?
- Avoids alarm fatigue
- Provides more accurate reports
- Reinforces change control discipline