Integrating Nagios XI alerts to a Slack Channel

Recently, I have been exploring the use of Slack for messaging and collaboration and in particular integration with other tools. In this post I will discuss enabling integration with Nagios XI and providing the functionality to send alerts generated to a channel within Slack. Firstly, you need to have created a login and be an owner or a member of a team. Now, we can browse to the Slack App Directory and search for Nagios to enable and configure.

In order to integrate Nagios and configure the plugin there is a requirement to install the necessary perl modules and to download the plugin to the Nagios XI server, place the plugin in th directory ‘/usr/local/bin’ and change the access permissions to the file.

sudo yum install perl-libwww-perl
sudo yum install perl-Net-SSLeay
cd /tmp 
wget https://raw.github.com/tinyspeck/services-examples/master/nagios.pl
cp nagios.pl /usr/local/bin/slack_nagios.pl
chmod 755 /usr/local/bin/slack_nagios.pl

We will need to edit ‘/usr/local/bin/slack_nagios.pl’ and modify the $opt_domain and $opt_token variables as per your Slack configuration. In the below example, I am using the Slack team domain ‘deangrant.slack.com’ and the token for the Nagios integration has been generated as ‘BIRQpEaFMixAi6LsMMj80bcC’.

my $opt_domain = "dean.slack.com";
my $opt_token = "BIRQpEaFMixAi6LsMMj80bcC"; 

Now we will configure Nagios to define a contact and commands to use for the plugin. In this example, I will use the Slack channel ‘nagiosalerts’ for both the host and service notification command to send messages. Firstly, I will modify the file ‘/usr/local/nagios/etc/contacts.cfg’ to define the contact for slack and specify the host/server notification period, options and commands.

define contact {
      contact_name                             slack
      alias                                    Slack
      service_notification_period              24x7
      host_notification_period                 24x7
      service_notification_options             w,u,c,r
      host_notification_options                d,r
      service_notification_commands            notify-service-by-slack
      host_notification_commands               notify-host-by-slack
}

Now we will define commands for the notification settings by modifying the file ‘/usr/local/nagiosxi/tmp/nagiosxi/subcomponents/nagioscore/mods/cfg/objects/commands.cfg’ and including the following in the notification settings section. As per my example, I am configuring the command_line to use the Slack channel ‘nagiosalerts’ for for both the ‘notify-service-by-slack’ and ‘notify-host-by-slack’ commands.

define command {
      command_name     notify-service-by-slack
      command_line     /usr/local/bin/slack_nagios.pl -field slack_channel=#nagiosalerts
}

define command {
      command_name     notify-host-by-slack
      command_line     /usr/local/bin/slack_nagios.pl -field slack_channel=#nagiosalerts
}

We can also define contact group membership for the contact defined for Slack. In this example, I have modified an existing contact group for the ‘Nagios Administrators’ to include the slack contact as a member by modifying the file ‘/usr/local/nagios/etc/contactgroups.cfg’.

define contactgroup {
  contactgroup_name admins
  alias             Nagios Administrators
  members           nagiosadmin, slack
			}

We also need to ensure that the file ‘/usr/local/nagiosxi/tmp/nagiosxi/subcomponents/nagioscore/mods/cfg/nagios.cfg’ contains the following configuration value to enable environment macros.

enable_environment_macros=1

Finally, to apply the configuration to enable the plugin we will be required to restart Nagios. Following a restart alerts which should now be sent to the Slack channel when generated.

sudo service nagios restart
Advertisements

Nagios XI: Automating Host Management

I was recently looking at how to automate adding and removing managed hosts and services in Nagios XI, which can be particularly useful in cloud computing and large environments where configuration management solutions have been implemented for provisioning. In these environments we typically use configuration files based on the attributes of a server role during the provisioning and configuration cycle.

Nagios XI contains a number of scripts in the directory /usr/local/nagiosxi/scripts that allow for automated host management, as below:

Script Description
reconfigure_nagios.sh Imports configuration files from the import directory, verifies configuration and restart Nagios if verification succeeds . If verification fails, configuration will be rolled back to the last working checkpoint. This is the command invoked from the web interface when selecting ‘Apply Configuration’.
nagiosql_delete_host.php Removes a host from the configuration database and removes the configuration file.
nagiosql_delete_service.php Removes services from the configuration database and removes the configuration file.

In order to automate adding managed hosts and services the method used was to create a single configuration file for a each host and each of its services to which service definitions are are only applied to a that host and not to a host list or host group and to name the configuration file according to the hostname. In the below example, I have created a single configuration file which defines the host and a managed service for CPU Usage and saved the configuration file as ‘server1.dean.local.cfg’.

define host {
 host_name server1.dean.local
 use xiwizard_windowsserver_host
 address server1.dean.local
 max_check_attempts 5
 check_interval 5
 retry_interval 1
 check_period xi_timeperiod_24x7
 notification_interval 60
 notification_period xi_timeperiod_24x7
 icon_image win_server.png
 statusmap_image win_server.png
 _xiwizard windowsserver
 register 1
 } 

define service {
 host_name server1.dean.local
 service_description CPU Usage
 use xiwizard_windowsserver_nsclient_service
 check_command check_xi_service_nsclient!!CPULOAD!-l 5,80,90
 max_check_attempts 5
 check_interval 5
 retry_interval 1
 check_period xi_timeperiod_24x7
 notification_interval 60
 notification_period xi_timeperiod_24x7
 _xiwizard windowsserver
 register 1
 }

Once the configuration file has been created we can place the file in the import directory located at ‘/usr/local/nagios/etc/import’ and invoke the script reconfigure_nagios.sh from the directory ‘/usr/local/nagiosxi/scripts’ to import the configuration file, verify the configuration and restart Nagios if successful. If the verification of the configuration fails, Nagios XI will restore the configuration files to the last working checkpoint but the imported configuration file will remain in the configuration database. In order to detect failures the following exit codes are returned where an exit code of ‘0’ to confirm that the configuration file has been successfully verified as a working configuration and Nagios has been restarted.

Exit Code Description
0 no problems detected
1 config verification failed
2 nagiosql login failed
3 nagiosql import failed
4 reset_config_perms failed
5 nagiosql_exportall.php failed (write configs failed)
6 /etc/init.d/nagios restart failed
7 db_connect failed

Now that we have added a managed host and services, how do we remove this the configuration database and delete the configuration file once the host is terminated? Providing the host has no dependent relationships we can firstly remove the services using the configuration name which matches the configuration file of the managed host (this is why it is important to name the configuration file according to hostname) and invoke the ‘nagiosql_delete_service.php’ from the directory ‘/usr/local/nagiosxi/scripts’ as the below example:

./nagiosql_delete_service.php --config=server1.dean.local

After the services have been successfully deleted we can remove the host by invoking the ‘nagiosql_delete_host.php’ script:

./nagiosql_delete_host.php --host=server1.dean.local

Once the host has been successfully removed, we can apply the new configuration as previosuly by invoking the ‘nagios_reconfigure_sh’ script. This method can also be applied to remove an imported configuration from the configuration database if verification of the configuration has failed during an import.

The above describes how to automate adding and removing hosts and services using Nagios XI and can be applied to your configuration management solutions during the provisioning and configuration cycle. In my scenario, I created a number of configuration files based on the attributes of server roles to which can be used as cookbook templates in Chef and using the ‘{node[‘fqdn’]}’ pattern to specify the host name in the definition file and the configuration file name.  I have also compiled PowerShell functions to perform the above which I will discuss in a later post.

Nagios: Starting NSClient++ service fails with the error ‘NSClient++ (x64) is not a valid Win32 application.’

I was recently investigating an issue where the Nagios monitoring agent (NSClient++) service failed to start with the following error message:

The NSClient++ (x64) service failed to start due to the following error:
NSClient++ (x64) is not a valid Win32 application.

In the first instance I attempted to resolve the issue by uninstalling and installing the monitoring agent on the impacted host. To which the same behaviour was experienced. On investigation, I found the following knowledge base article which describes the above symptom, where the cause of the issue is described as below:

  • The path of a service’s executable file contains spaces.
  • There is a file or folder on your computer’s hard disk that has the same name as a file or folder in the path to the service’s executable file

The first cause  above describes the condition that is causing the issue, so in order to attempt to resolve the issue I to wrapped the image path filename in quotations, as follows:

1) Browse to the registry key ‘HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NSClientpp’.

2) Modify the REG_EXPAND_SZ value data for ImagePath to be “C:\Program Files\NSClient++\nsclient++.exe”.

Following the modification I was successfully able to start the NSClient++ service and continue monitoring the host.

Nagios XI: Host and Service details not being displayed

Recently, I was troubleshooting an issue with Nagios XI where host and service details where not being displayed from the web management console.

On investigating the log file at ‘/var/log/messages’ there was a number of errors identifying that a table in the MySQL database was crashed and was required to be repaired.

ndo2db: mysql_error: 'Table './nagios/nagios_timedeventqueue' is marked as crashed and should be repaired'

In order to repair the table marked as crashed, I ran the below on the Nagios XI monitoring server , reconnected to the web management console and both host and service details were displayed as expected.

myisamchk --safe-recover /var/lib/mysql/*/*.MYI

 

Monitoring vCenter privelage reassignment with Nagios XI

During a restart of the ‘VMware VirtualCenter Server’ service, if a user or group assigned to the Administrator Role at the root folder level could not be verified during the restart the user privelages are revoked.

As part of security hardening on the vCenter server, I created a Nagios Remote Plugin Executor (NRPE) to search for the event created in the application log and create a service status. 

Firstly, we will only require to query the application log after the ‘VMware VirtualCenter Server’ service has started, we can retrieve this information as a date format by using the Get-Process cmdlet to return the ‘StartTime’ value of the process ‘vxpd’.

$Start= (Get-Process vpxd).StartTime

Now that we have retrieved a date value to query the application log after, we will need to filter the application log further using the ‘Get-EventLog’ cmdlet to retrieve an event, which is similar to the below:

Log Name: Application
Source: VMware VirtualCenter Server
Date: M/DD/YYYY H:MM:SS PM
Event ID: 1000
Task Category: None
Level: Warning
Keywords: Classic
User: N/A
Computer: [vCenter Server]
Description:  Removing permission for entity ""<group name>"", group ""DOMAIN\Account"", role -1.  Reason: User or group not found."

We will now create a filter to pass to the ‘Get-EventLog’ cmdlet to retrieve the any results like the above and store this is a variable so that we may use the results as a count. The below will filter for the Souce ‘VMware VirtualCenter Server’, the EntryType  ‘Warning’ and where the message text is like ‘Removing permission*User or group not found’.

The ‘ErrorAction’ preference is required as if zero counts of the below filter are returned, an error will be passed to the console output.

$Query = Get-EventLog -LogName Application -Source "VMware VirtualCenter Server" -EntryType "Warning" -After $Start-ErrorAction SilentlyContinue | Where-Object {$_.Message -like "Removing permission*User or group not found"}  

Conditional Logic will then be used to create a service status message based on the count of results returned in the above query. If zero results are returned the service status will be set to ‘OK’ with a status information stating that no instances of privelage reassignment since the process start time have been retrieved

If one or more results are returned, the service status will be set to ‘Critical’ with the status information message that a number of instances of privelage assignment since the process start time have been retrieved.

If ($Query.Count -eq "0") 
    { 
    "No instances of privelage reassignment since " + ($Start).ToString("dd/MM/yyyy HH:mm")
    $returncode="0"
    } 
ElseIf ($Query.Count -ge "1") 
    { 
    "" + $Query.Count + " instances of privelage reassignment since " + ($Start).ToString("dd/MM/yyyy HH:mm")
    $returncode = "2"
    }

The powershell session will now exit and return an exit code.

exit $returncode

Once you have configured the external script to run within Nagios (http://wp.me/p15Mdc-eC), for a service status of ‘OK’ you should receive something similar to the below:

CountVMUPR

CPU Stats service reports “UNKNOWN: iostat not found or is not executable by the nagios user.”

When monitoring Linux hosts in Nagios XI, the following error was being reported for the CPU Stats service:

"UNKNOWN: iostat not found or is not executable by the nagios user."

This is due to the check command having a dependency on the performance and monitoring tool sysstats, in order to resolve the issue install the package as below:

apt-get install sysstat

On the next service check you should receive a valid service status and information message.

cpustatusCapture

Monitor WUInstall status in Nagios XI

I recently wrote about checking the last success time of Windows Update and reporting this to Nagios (http://wp.me/p15Mdc-mj). Now what happens if you do not use Windows Update as your patch management solution. In my case I have been managing the installation of updates using WUInstall (http://www.wuinstall.com).

When my updates are installed the registry is not updated to reflect the last success time, therefore how can I monitor the last time updates were run and the status? The command line tool WUInstall provides the functionality to write the console output to a log file, in the below example each log file is written to a shared folder.

As per the previous example, I want to report if any updates have been installed in a particular time-span and if the process was successful.

In the case of the time-span this would be dependent on the host being monitored and therefore this period would be specified as a mandatory parameter when invoking the powershell script which was to be used as the check plugin within Nagios.

Param ([parameter(Mandatory = $true)][string] $Days)

Now I want to return the most recent log file for the host being monitored from the shared folder, where the log file name is that of the host and is contained in a parent folder based on the date the process was invoked. By using the Get-ChildItem cmdlet with the recurse option I am able to retrieve the most recent log file by sorting by the LastWriteTime in descending order and selecting the first file. I will return the full name of the file to a variable to pass to the Get-Content cmdlet for reading.

$LogFile = (Get-ChildItem "\\Server\Share\Logs" -Recurse | Where-Object {$_.Name -like "$env:computername*"} | Sort LastWriteTime -Descending | Select -First 1).FullName

In order to read the content of the log file we will need to encode this in Unicode format, and then search for the string ‘Overall’. In order to retrieve the overall result code we will return the next line from the content and store this as a variable.

$Log = Get-Content $LogFile -Encoding Unicode | Select-String "Overall" -Context 0,1 | % {$_.Context.PostContext}

In order to get the last run time the datetime function will be used to parse the date in the string to return the first ten characters which contain the date and the modify the date string to be in the format ‘dd/MM/yyyy’.

$LastRun = [datetime]::ParseExact($Log.Substring(0,10) , "yyyy/MM/dd", $null)
$LastRun = $LastRun.ToString("dd/MM/yyyy")

Now that we have the last run time and the result code in the log variable we can use conditional logic to set the status of the service. Firstly we will check to see if the last run date is greater or equal to the timespan value specified in the mandatory days field by subtracting this from the current date and if the result code is like ‘Succeeded’ return the service status as ‘OK’

If ($LastRun -ge (get-date).AddDays(-$Days) -and $Log -like "*Succeeded*")
   { 
   $resultcode = "0"
   }

If the last run date is less than the time-span value but the result code is like ‘Succeeded’ we will return the service status as ‘OK’.

ElseIf ($LastRun -lt (get-date).AddDays(-$Days) -and $Log -like "*Succeeded*")
   { 
   $resultcode = "1"
   }

If the result code is not like ‘Succeeded’ we will return the service status as ‘Critical’.

ElseIf ($Log -notlike "*Succeeded*")
   { 
   $resultcode = "2"
   }

If we are unable to retrieve any information required in the conditional logic the service status will be returned as ‘Unknown’.

Else
   { 
   $resultcode = "3"
   }

Finally, we will set a status information meessage based on the result code where the substring function is invoked on the  log variable to remove the first twenty characters of the string which contains timestamp information and to include the last run time and terminate the powershell session to return the exit code for the service status.

$Log.Substring(20,$Log.Length-20) + " at " + $LastRun
exit $returncode

Once a check_nrpe command has been configured in Nagios (http://wp.me/p15Mdc-eC) and you begin to monitor your host(s), you should see a service check as below:

WindowsUpdateStatusNagios

You can also run the above in Windows Powershell, to return the status information as below:

PowershellStatusWindowsUpdate

The full powershell script can be downloaded from: https://app.box.com/s/cc2pkvx9a10g5ww6ha0i