Triggering OpenNMS notifications when patterns occur in a log file

A common problem with OpenNMS is how to monitor a log file and trigger alerts when certain conditions are met. Let me clarify with an example: you have this mission critical app that sometimes experiences internal errors. The application keeps running and still responds to requests, but the error will slow down the system and/or delay further processing. Monitoring the process and/or network polling will obviously not be able to detect the issue and the only way is to tail the application log file and look for certain messages.

The problem can usually be solved simply by forwarding the log file to OpenNMS through syslog, but what for logs generated by applications that don't speak syslog or if you don't want to configure syslog forwarding?

Collectd Tail plugin comes to the rescue. Collectd is an interesting monitoring agent which basically can be integrated with anything, even though I think it is primarily used together with Graphite.
Since Collectd does not natively speak any of the protocols supported by OpenNMS integration has to be done some through some sort of scripting.

Solution Overview

I installed Collectd (5.2, custom built rpm, thanks fpm!) on the host running the application and configured collectd to tail the log file and look for lines matching certain patterns. Whenever a line matches, a counter is incremented and if the value exceeds a threshold an external notification script is invoked. In my case I want to be notified of every single occurrence so the threshold condition is: value != 0
The notification script then forks out a call to OpenNMS'own send-event.pl. In OpenNMS I have configured a notification connected to the event UEI which sends out alerts to our support personnel.

Shown below are Collectd configuration file and the notification script. send-event.pl can be simply copied over from the OpenNMS host.

File: collectd.conf ------------------- Interval 10 LoadPlugin logfile #LoadPlugin write_graphite LoadPlugin csv LoadPlugin threshold LoadPlugin exec LoadPlugin tail <Plugin "logfile"> LogLevel "debug" File "stdout" Timestamp true </Plugin> <Plugin exec> NotificationExec "me" "/opt/collectd/bin/notif.pl" </Plugin> <Plugin "csv"> DataDir "/tmp" StoreRates true </Plugin> <Plugin "tail"> <File "/tmp/scp.log"> Instance "scp" <Match> Regex "ERROR" DSType "CounterInc" Type "counter" Instance "hi_error" </Match> </File> </Plugin> # Load required matches: #LoadPlugin match_empty_counter #LoadPlugin match_hashed LoadPlugin match_regex LoadPlugin match_value #LoadPlugin match_timediff # Load required targets: LoadPlugin target_notification #LoadPlugin target_replace #LoadPlugin target_scale #LoadPlugin target_set #LoadPlugin target_v5upgrade PostCacheChain "SelectHiErrors" <Chain "SelectHiErrors"> <Rule "selecthi"> <Match "regex"> TypeInstance "^hi_error$" </Match> <Target "jump"> Chain "CheckHiErrors" </Target> </Rule> <Target "write"> </Target> </Chain> <Chain "CheckHiErrors"> <Rule "checkhivalue"> <Match "value"> Min 0 Max 0 Invert true </Match> <Target "notification"> Message "%{type_instance}" Severity "WARNING" </Target> </Rule> </Chain> File: notif.pl -------------- #!/usr/bin/perl use Sys::Syslog; use Sys::Syslog qw(:standard :macros); openlog('collectd_notif', "ndelay,pid", LOG_USER); while(<>) { chomp; ($key, $val) = split("\:", $_); if ($key =~ /TypeInstance/ && $val =~ /hi_error/) { my @args = ("/opt/collectd/bin/send-event.pl", "uei.my.org/collectd/scp/HiError", "-i", "192.168.123.123", "opennms.my.org"); system(@args) == 0 or syslog(LOG_ERROR|LOG_USER, "Error sending UEI uei.my.org/collectd/scp/HiError"); } }

Notes

To accept events from other hosts eventd has to be configured to listen on all ip addresses (by default it binds only to 127.0.0.1). Since this can pose a security risk iptables should be used to restrict access.

The configuration file in the example above instructs Collectd to use standard output for logging and to write values out to a csv file in /tmp: I left them in so that those unfamiliar with Collectd could run collectd in foreground to figure it out, but you should disable both in production.

unicolet.org

Search This Blog