How to troubleshoot performance graphing

If you have set up a performance graph, and it's not working, this page will guide you in troubleshooting why.

You can also use the procedures and information here to prepare debugging data for opening a support case, or to try to determine exactly why your graphs do not function normally. Please see My performance graphs all stopped working. What's wrong? if you are having this issue.

I want to prepare data to open a support case, since I can't figure out what's wrong:

Support usually needs the same data you would gather in the procedure above. Here are the steps to quickly gather it for support to analyze.

  1. At the command line, edit the perfdata.properties file to set the debug_level to 3:
    sed -e '/debug_level/s/1/3/' -i /usr/local/groundwork/config/perfdata.properties
  2. To make this active, you can either:
    1. Do a Commit from the User Interface (Configuration -> Control -> Commit) OR:
    2. kill the process_service_perfdata_file daemon (it will restart):
    3. kill the process_service_perfdata_file daemon (it will restart within a few minutes):
      pkill -f process_service_perfdata_file
      
  3. Wait for at least 15 minutes for data to accumulate in the log file /usr/local/groundwork/nagios/var/log/process_service_perfdata_file.log. You may tail -f this log if you wish, to be sure you are getting debugging data:
    tail -f /usr/local/groundwork/nagios/var/log/process_service_perfdata_file.log
    
  4. Compress the log file and transfer it to your workstation. You can use tar and gzip to compress it:
    tar czvf performace_log.tar.gz /usr/local/groundwork/nagios/var/log/process_service_perfdata_file.log
    
  5. Copy the file locally and attach to a support case.
  6. Change the debug setting back to 1, and kill the process again. You do not want the debug to run all the time, as it will use a lot of space, and slow down your system.
    sed -e '/debug_level/s/3/1/' -i /usr/local/groundwork/config/perfdata.properties
    pkill -f process_service_perfdata_file
    
I want to try to find out why my graph does not work by myself:

The data you need can be collected with the above procedure. Once you are generating the debug info, you can analyze it yourself.
There are several types of entries in the debug log, but the typical entry for processing performance data for a service on a host looks like this:

[]
Host: localhost
Svcdesc: local_users
Lastcheck: 1281471736
Statustext: USERS OK - 1 users currently logged in
Perfdata:users=1;5;20;0
Adding label=users,value=1,warn=5,crit=20,min=0,max=0
Table host_service, host=localhost, service=local_users already has an existing entry for location /usr/local/groundwork/rrd/localhost_local_users.rrd. New entry not added.
Graph RRD command: rrdtool graph - --imgformat=PNG --slope-mode DEF:a=/usr/local/groundwork/rrd/localhost_local_users.rrd:users:AVERAGE CDEF:cdefa=a AREA:cdefa#0033CC:"Number of logged in users" -c BACK#FFFFFF -c CANVAS#FFFFFF -c GRID#C0C0C0 -c MGRID#404040 -c ARROW#FFFFFF-Y --height 120
Nothing changed for localhost local_users
Update RRD command: /usr/local/groundwork/common/bin/rrdtool update /usr/local/groundwork/rrd/localhost_local_users.rrd 1281471736:1 2>&1
Posting data to Foundation
performancedatalabel=users
performancevalue=1
Elapsed Execution Time = 3914.383 seconds

In this example, everything is working normally. The key fields are:

Host: localhost
Svcdesc: local_users

these uniquely identify the host and service.

Lastcheck: 1281471736

this is the unix timestamp that the data was produced, and will be graphed as occurring at

Statustext: USERS OK - 1 users currently logged in

this is the actual plugin output before performance data. Note that this can be parsed with the Status Text Regex in Configuration - Performance to extract perf data (e.g. the number "1", here)

Perfdata:users=1;5;20;0

this is the perfdata from the plugin, in standard Nagios Plugin format. See http://nagiosplug.sourceforge.net/developer-guidelines.html#AEN201

Adding label=users,value=1,warn=5,crit=20,min=0,max=0

this is the result of the parsing we do on the perf data. Loosely, the label and value arrays are used in the rrd create command as $VALUE1$, $VALUE2$... $LABEL1$, $LABEL2$, etc, to create the RRD with the appropriate DS names (typically labels) and populate the RRDs with the data (typically values). The $WARN1$... and $CRIT1$... series are also available, however, and can come in useful to put the thresholds on the graphs. Similarly the $MAX1$... and $MIN1$... series can be used.

Table host_service, host=localhost, service=local_users already has an existing entry for location /usr/local/groundwork/rrd/localhost_local_users.rrd. New entry not added.

this just tells you the result of the lookup of the performance definition, matching the host and service to a particular configuration.

Graph RRD command: rrdtool graph - --imgformat=PNG --slope-mode DEF:a=/usr/local/groundwork/rrd/localhost_local_users.rrd:users:AVERAGE CDEF:cdefa=a AREA:cdefa#0033CC:"Number of logged in users" -c BACK#FFFFFF -c CANVAS#FFFFFF -c GRID#C0C0C0 -c MGRID#404040 -c ARROW#FFFFFF-Y --height 120

this is the result of substituting the values from the service and perf data into the RRD Graph Command entered in the performance configuration. This is the command that, if typed at a command line and routed to a .png file can be used to test the actual graph generation for this host and service. It is stored in the foundation database for this host+service, and updated only when changed, thus the message:
"Nothing changed for localhost local_users"
occurs when no change is needed for this command.

Update RRD command: /usr/local/groundwork/common/bin/rrdtool update /usr/local/groundwork/rrd/localhost_local_users.rrd 1281471736:1 2>&1

this is the actual command used to insert data into the RRD for this run of the check of this service on this host. Note that you will probably get an error if you try to run this command at the command line, as data cannot be inserted into an RRD for the same timestamp twice. If there is no data being input to your RRD, this command may be malformed, and so you may want to try it at the command line to diagnose why it fails. To change it, you will need to modify the Performance configuration in Configuration -> Performance for this service.

Posting data to Foundation
performancedatalabel=users
performancevalue=1

this is the record of posting data for this service to foundation for use in the enterprise performance reports. Note that as of 6.2, this data is batched, and sent with XML, so you will not see it in this position in the file after upgrade to 6.2 or above.

The first time a service check's performance data is processed, the RRD is automatically created (or attempted to be created). Often the RRD Create command that is echoed in this file in that case is informative. If you do not see the RRD for your host and service, that command may be malformed. You can try it at the command line and see what error message you may get, and correct it in the Performance configuration for that service.

Other information in the debug log lists the attempts to post the graph commands to foundation, as well as summary data for each run of the process.
The default it to run the process every 5 minutes, by placing the data in a file called /usr/local/groundwork/nagios/var/service-perfdata.dat.being_processed, which is picked up by the process_service_perfdata_file daemon. These processes are launched by nagios with the launch_perfdata_process script, called as the "Service performance data file processing command", as defined in Configuration -> Control -> Nagios Main Configuration -> Page 3. See My performance graphs all stopped working. What's wrong? for an example.