Introduction ============ Ok folks. You asked for it and you got it about three weeks late but... There is still one major bug that I can't seem to get a handle on. Search for EAGAIN below for the details. However I think this is in some semblace of sanity for you to all try it out. If you don't understand something ask. If you are trying to solve a particular issue of yours using this patch, but can't quite figure out how to do it ask. Sadly this isn't as polished as I was hoping for, and I apologize in advance for anything that is broken or unclear. Now without further delay: Overview ======== These patches apply to nagios 2.6 and possibly other 2.x or development 3.x series using the Nagios Event Broker (NEB). They allow nagios to supply it's active event stream to an external process for advanced correlation. This allows you to write rules in SEC or other correlation engine to change the way nagios displays and responds to errors. The control settings are assigned to a service object and allows controlling processing of event based on host running the service service source of event: active or passive status of event: ok, warning, critical and unknown One of four things can happen to the event (depending on processing mode 1, 2, 3, or 4): 1) it is passed to the nagios core and not made available to the external correlation engine. 2) it is passed unchanged to the nagios core and also passed to the external correlation engine. In this mode nagios sees the result of the active event even when it causes a change in state. 3) a modified result of the active event is passed to the nagios core and also passed to the external correlation engine. In this mode nagios sees a modified event that does not change the current status of the service. This alows the external correlation engine to determine if the status should change and submit a passive event to change the status and trigger notifications etc. 4) (NOTE: This is not yet implemented.) the active event is passed to the external correlation engine and the event is ignored by nagios as though it hs never been received. This allows the external correlation engine and is preferred to mode 3 as it doesn't cause a temporary update of the service data that may be overwritten by the correlation engine. Once the data arrives at the external correlation engine you can perform many different operations. Using a text based event stream should allow the use of different engines including: SEC - Simple Event Correlator: http://www.estpak.ee/~risto/sec/ ESPR - a java 1.5 based correlation engine http://esper.codehaus.org/ With SEC I have successfully: Required 4 consecutive ok active polls before changing the state of the service back to ok. This replaces the flap detection by making sure that the service is in an ok state 'long enough' before asserting it is ok. Suppressed the warning state when 2 cron process are running if it is during the time window when backups would run. This is an example of changing the thresholds for a check depending on what time it is. Changed plugin output from a less useful error message to a message that describes the actual cause of the error. This allows mapping of plugin output from one message to another. Manifest ======== sec.cfg - configuration file for nagios that creates three services for interacting with the patches and the SEC (or other) engine. Needs editing for your site. corr.sr - core correlation rules for use with SEC. Needs editing for your site. examples.sr - sample rules for different correlation techniques. nagios_log.sr - rules for analyzing the nagios log file. start_sec - init.d style file for controlling the sec process. Needs editing for your site. nagios_sec_manual.txt - this file 4 patches described below: patches/neb.patch, patches/control.patch, patches/module.patch, patches/buffer_slots.patch Building with the patch ======================= There are three required patches and one optional but recommended patch that increases the number of slots in the circular ring buffer that is used to handle the external command pipe. See: http://permalink.gmane.org/gmane.network.nagios.devel/3216 A permanent fix that allows setting the ring buffer size in nagios.cfg has been committed to some branch of a future nagios which makes this patch obsolete. See: http://permalink.gmane.org/gmane.network.nagios.devel/3239 for the proper fix. Use the same build directory that you used to build and install your current test nagios setup. Save the 4 patch files. From the top of the nagios build tree use "patch -p1 < patchfile". The patch files are: neb.patch - adds a new Nagios Event Broker callback needed by the module control.patch - adds the control attributes for the module to the service object module.patch - creates the NEB module. buffer_slots.patch - increases the circular ring buffer size. May not be needed depending on version of Nagios. See comments at the beginning of this section. All the files should patch without issue. Once this is done follow the normal nagios build/install instructions. Run: make clean make nagios make modules Then install base/nagios where you want to install your nagios binary or run 'make install-base'. Copy module/ext_corr.o to some location. I use /usr/lib/nagios/module/ext_corr.o,but /etc/nagios/modules/ext_corr.o could also work as there is no nagios install target to install modules. You will have to reference his path in your nagios.cfg file. Using the patch =============== Modify nagios.cfg and add (all on a single line): broker_module=/usr/lib/nagios/modules/ext_corr.o --file /tmp/sec --tag module1 --control ops01;ZSecControl --output standard The arguments to the ext_corr.o module are: --file - (required) the path to the file where the events are written. It must be a file and not named pipe for anything except the smallest nagios installation as the buffer size for the named pipes are too small to buffer sufficient data. If you are not using a named pipe, the directory must be writable by the nagios user so that the file can be rotated and a new file created by nagios. For performance using a RAM disk or memory based filesystem under Solaris may be a win. I suggest making this the first argument as it is used to identify the ext_corr instance by default unless the tag is defined. There is no default and the NEB module will exit with an error if it is missing. --tag - (optional) the tag is used to identify log messages in the nagios log file when using multiple instances. If not present the basename of the outputfile is used. --control - (optional, but recommended) to force the module to close and reopen the outputfile (i.e. rotate the file) commands can be sent to the NEB module via this service. The format of the service is ;. You need to create a service of this name. A sample definition is included in sec.cfg. There is no default. Using an initial character that doesn't occur in your other services: Q, or Z minimizes the amount of processesing needed to identfy the control service. The sample nagios sec rules (sec.cfg) and the sec rules files (corr.sr) have to be modfied to insert the first (hostname) and second (service description) components in some rules. --output - "standard" generates an event stream to the correlation engine with a 4 character prefix before the PROCESS_SERVICE_CHECK_RESULT (see below for format). If the format is "plain" the output stream is only the Nagios PROCESS_SERVICE_CHECK_RESULT. The default is standard. In nagios.cfg add: cfg_file=/etc/nagios/sec.cfg using the sample sec.cfg file from the distribution. You wll have to edit the file and set the hostname and service description for host ops01, service ZSecControl to match what you specified to the --control argument. Running Multiple Instances -------------------------- If you copy ext_corr.o to a new file name you can run multiple instances of it (to feed two correlators for example that look at different subsets of events). You can not have the same name for the object file and create two instances. Also you chould change the tagname for the instances to be able to distinguish between the two in log messages. New service object properties ----------------------------- The service object gets two new properties: ec_passive_action ec_active_action they perform the same roles for passive and active actions. I will discuss the active form only as the passive form is analogous. A sample invocation is: define service{ service_description CronDaemonCheck host_name test use generic-service check_command check_local_procs!:1!1:!-C crond ec_active_action critical,unknown:warning } defines the following operations: for "ok" events - operate in mode 1. This is the default since none of the values specify "ok" or "o". for "critical" or "unknown" events - operate in mode 2. Send the unmodified event to Nagios and a copy to SEC for "warning" events - operate in mode 3. Send a modified copy of the event to Nagios and a copy of the original event to SEC. The format is ':'. A status list is a comma seperated list of the status names: o or ok w or warning c or critical u or unknown Statuses that occur before the first ':' are handled in mode 2 (e.g. critical and warning above). Status that occur after the first ':' are handled in mode 3 (e.g. warning above). Statuses that are not specified are handled in mode 1. In the examples below modes are discussed in terms of status of an event for a service. E.G. the ok status of SERVICE should be handled with mode 3. Since the service object definition is unique for a host/service_description pair, you can set the modes operating mode uniquely for a (status, service, host) triplet. Choosing what mode to use for services -------------------------------------- As mentioned above the plugin operates in three modes. This section discusses which mode to put a service/status in. Mode 1 ~~~~~~ If the event doesn't need to be passed to the correlation engine choose mode 1. These events are isolated in that they don't have an effect on other services nor do other services have an effect on them. Mode 2 ~~~~~~ If the event is going to be used as input to the correlator, but is not going to be overridden (set) by the correlator use mode 2. Also if you don't mind or want a notification when the correlator determines the proper state, you should use mode 2. E.G. service is ok critical event come in in mode 2 nagios sends alert correlator determines it is ok correlator submits an ok event nagios notifies that problem is clear A variant of the above that may be useful is used when the max_count for the service is set greater than 1 (e.g. max_count 2). service is ok critical event come in in mode 2 nagios logs it and sets the count to 1 correlator determines it is ok correlator submits an ok event nagios logs the state change and sets the count back to 0. Mode 3 ~~~~~~ If the correlator will set the status of the event, or you don't want notification/logging of events (although state stalking can be used for debugging) use mode 3. These are services that depend on others to determine of a problem is occurring. Also this is used for implementing different problem thresholds in different time periods. Plugin output format ==================== The plugin has two output modes: standard and plain Standard Output Format ~~~~~~~~~~~~~~~~~~~~~~ This format has the following form: An example is: 32a [1168276673] PROCESS_SERVICE_CHECK_RESULT;cook02.example.org;DeadLetterCheck;3;Remote command execution failed: ssh: connect to host 192.168.0.0 port 22: No route to host The prefix consists of three characters character 1 - current service status 0 - current status of service is ok 1 - current status of service is warning 2 - current status of service is critical 3 - current status of service is unknown character 2 - module mode (1 is never seen since mode 1 event are not passed to the correlation file) 2 - pass event to nagios (mode 2) 3 - pass modified event to nagios (mode 3) 4 - remove event from nagios (mode 4 not yet implemented) character 3 - active or passive event a - active event p - passive event Then a space character separates the prefix from the nagios passive service check result. The passive check result can be passed into nagios to record/set the status of the service. Further information on the PROCESS_SERVICE_CHECK_RESULT passive command can be obtained from: http://nagios.org/developerinfo/externalcommands/commandinfo.php?print=true&command_id=114 Dedicated nagios services ========================= There are two parts of the dedicated services that allow nagios to monitor sec and vice versa. These two files are linked, so changing the service neames in one requires corresponding changes in the other. Nagios Config --------------- A nagios configuration file to implement the 3 or so additional services for use with the plugin. * a service to control the plugin (rotate log files primarily) * a service for nagios to probe the SEC process * a service to see if the external SEC process is running. You will have to change the host_name for each service to be the name of your nagios host. Also you will want to change the contact_groups for each service. Also if you don't have a check_dummy check command defined, add: # 'check_dummy' command definition define command{ command_name check_dummy command_line $USER1$/check_dummy $ARG1$ } to one of your config files. The check_dummy plugin is a standard part of the nagios plugins and should be installed. Plugin Control (service ZSecControl) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The primary use of this service is to rotate the log file (see about automatic rotation below). If a manual ok event is sent and the text of the event is "rotate" (without the quotes), the file that connects Nagios to SEC will be closed by the NEB module and reopened. So you can move the file aside, submit the rotate request and a new file will be created. SEC will automatically finish with the old file and move on to the new file whan it is created. It usually happens within 10 seconds and always with 60 seconds. The name of the control host and service description are passed using the --control option in nagios.cfg. Probe the SEC process (service SecReport) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This is a service that runs a check_dummy script every minute that is forwarded to the process running service by a SEC rule. It can also be used by SEC to report things to the user about ongoing operations or errors. It has state stalking turned on so that reports from SEC will be recorded in the logs. Is the external SEC process running (service SecAliveCheck) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This is a service that doesn't run an active check, but has a 2 minute freshness threshold. If the loop from nagios -> sec -> nagios is not completed, it will execute the check_dummy check because the result is stale. If this doesn't update the service, a notification should be sent. (note, this check could possibly be written better. Also I have never seen the notifcation because my test install runs with notification's disabled so it is quite possible I screwed up the check definition in some way.) Note that the freshness check will commonly trigger just after a nagios startup as the checks are cheduled out in time and it may be more than 2 minutes before the probe runs. Once the first probe runs, it comes to steady state and should work ok provided all of the SEC writes to the nagios command pipe are completed without error (see below for EAGAIN failure from SEC). SEC config ---------- A basic required SEC ruleset with the minimum set of rules is supplied in the corr.sr file. This implements: Setting of variables upon SEC startup. See the comments in the file for configuration info. Nagios log messages are ignored by the rules in this file (this is used if sec is watching both the nagios log as well as the event stream). This is the mode that start_sec operates in. Mapping the SecReport check to SecAliveCheck. Automatic File rotation of the file sending events from nagios to SEC. Automatic forwarding of any mode 3 events that haven't been handled Search for CHANGEME in the file for info on what has to be customized. Setting of variables upon SEC startup ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Sec permits the definition of varibles that are used in action statements. This includes email addresses file names etc. See the comment at the top of corr.sr for further info. You will need to customize these for your site. Mapping the SecReport check to SecAliveCheck ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The following rule detects the periodic SecReport service messages from the nagios sec.cfg file. The arrival of these messages triggers the submission of a SecAliveCheck to nagios on the host defined in %H. # default core rules for SEC/Nagios integration # Map the poll from SecReport to SecAliveCheck to prove that sec is running # and processing events. type = single desc = Handle SEC Keepalive (core) ptype = regexp rem = CHANGEME replace ops01 with the name of your nagios host pattern = ops01;SecReport.*Testing SEC. action = write %nagiosCmd ([%u] PROCESS_SERVICE_CHECK_RESULT;%H;SecAliveCheck;0;OK: SEC is forwarding events) Automatic File Rotation ~~~~~~~~~~~~~~~~~~~~~~~ The following SEC rules rotates the file that communicates events from Nagios to SEC and detects/notifies if it takes more than 30 seconds. # rule 1: detect the start of a new file. type = single desc = if file rotation occurs, delete the time_rotation context (core) ptype = substr rem = CHANGEME replace ops01 and ZSecControl with the name rem = of your nagios host and command service (specified by rem = --control passed to the nagios module) and replace rem = /tmp/sampler with the argument to the --file option. pattern = PROCESS_SERVICE_CHECK_RESULT;ops01;ZSecControl;0;Output file /tmp/sampler reopened context = time_rotation action = delete time_rotation # rule 2: generate the rotation every morning at 6AM. type = calendar time = 00 06 * * * desc = rotate /etc/sample within 60 seconds once a day (core) rem = move file, start context that will detect if rotation rem = not done is 60 seconds, write event to cause module to rotate file action = shellcmd /bin/mv %eventStreamFile %{eventStreamFile}.old; \ create time_rotation 60 (pipe '%s' /bin/mail -s "%eventStreamFile file rotation failure" %notify ); \ write %nagiosCmd ([%u] PROCESS_SERVICE_CHECK_RESULT;%H;%S;0;rotate|) Rule 2 does the work. Every day at 6AM it moves the file specified in the variable %eventStreamFile to .old. It then sets up a context called time_rotation for 60 seconds. If the context expires normally, it will report the file rotation failure using mail. Rule 1 detects the first line in a rotated file. It then deletes the time_rotation context which does not trigger the email report mechanism. Note that the email of a failure message could be changed to report a critical event using the SecReport service as well. Feel free to change the rotation time and frequency. I have had it rotating once a minute with only minor increase in load, but once an hour is probably the max this is needed. Automatic Forwarding of Mode 3 Events ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Since mode three events have not taken effect in nagios (they are forwarded to the core maintaining the current state, not the new state determined by the active event), the default rule in the SEC file forces these state changing events to be processed by nagios. If they reache this final state the assumption is that they pere not part of any other rule that suppressed or otherwise handled the event and they should be processed by nagios rather then discarded. type = single desc = Forward Active Mode 3 Events (core) ptype = regexp pattern = ^(.3a )(.*)$ action = write %nagiosCmd $2 Addng new rules ~~~~~~~~~~~~~~~ You can add new rules anywhere after the comment # Start rules that apply to the event stream Some sample rules are located in example.sr. SEC use cases with examples =========================== The package come with some sample rules in the examples.sr file. In this document the lines are numbered using (1). Do not use the numbers in actual implementation. The proper form is shown in the examples.sr file. Multiple Thresholds in Time --------------------------- Suppress the warning state when 2 cron process are running if it is during the time window when backups would run. To do this, set the warning state for the CronDaemonCheck to use mode 3 (since more than one process is a warning condition). Then set up 2 sec rules to define the backups_running time period. The first turns the backups_running context/flag on: # see section 5.3.4 in # http://sixshooter.v6.thrupoint.net/SEC-examples/article-part2.html # to understand the reason for the calendar spec. type = calendar time = * 0-6 * * * desc = start backups_running context context = [! backups_running] action = create backups_running by running every minute between midnight and 6 am. A second calendar rule second turns the context off: type = calendar time = * 7-23 * * * desc = stop backups_running context context = [backups_running] action = delete backups_running running every minute between 7AM and 11PM. Then define a rule to check the warning event generated by Nagios to see if: * the number of cron processes is 2 * the event occurred during the BackupRunning time period. If so, it submits a modified ok event that clears any problem and reports that it was cleared because backups are running. # Capture the mode 3 event CronDaemonCheck on host concord # If 2 cron processes are running submit the result as ok # between 6PM and 2AM. Normally 2 processes is a warning. # type=single desc = concord can have 2 cron processes between 6PM and 2 am for backups ptype=regexp pattern = ^... (\[[0-9]+\] PROCESS_SERVICE_CHECK_RESULT\;concord\;CronDaemonChec\ k);1;(PROCS WARNING: ([0-9]+) processes with command name .crond.*) rem = $1 is the part of the event before the state info. rem = $2 is the part of the event after the state info (plugin output) rem = $3 is the number of running processes rem = This rule applies only if backups_running context exists and rem = there are exactly two running processes. context = backups_running && =($3 == 2) rem = if it applies, write the event with status 0 (ok) and a note rem = before the plugin output indicating that it is a correlated rem = output action = write %nagiosCmd ($1;0;[backups running] $2) Variant of Multiple Thresholds on Other Events ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A variant on this can be done by using another SEC rule to trigger the context. E.G. if you have a web application that operates slowly (takes 20 seconds to respond vs 5) for the first 5 minutes after the web server comes up you could: create a "webServerStarted" context in SEC with a 5 minute lifetime based on the clear status of the WebServerCheck. create a rule that looks for the critical event coming from wepAppCheck, verifies that it is within 20 seconds and suppresses (or clears) the event if the webServerStarted context exists. In this case the ok state of the webServerStarted check would be handled using mode 2, and the critical state of the wepAppCheck would be handled using mode 3. Flap detection replacement/Count ok polls before reset ------------------------------------------------------ The following 2 SEC rules make the service reset to ok only after 4 consecutive ok results have been received. The service ok state must be handled in mode 3 for this to work properly. # rule 1 type=single rem = use takenext to have the failure asserted by the default rule continue=takenext desc = detect non-ok state for rtables_split_check on host $2 ptype=regexp rem = look for mode 3 active events. pattern = ^.3a (\[[0-9]*\] PROCESS_SERVICE_CHECK_RESULT\;([^;]+)\;rtables_split_check\;[123]\;.*) rem = reset the SingleWithThreshold if a non-ok check occurs action = reset +1 require 4 consecutive ok's for rtables_split_check on host $2 #rule 2 type = SingleWithThreshold desc = require 4 consecutive ok's for rtables_split_check on host $2 ptype = regexp rem = match only when current service state is not ok pattern = ^[123]3a (\[[0-9]*\] PROCESS_SERVICE_CHECK_RESULT\;([^;]+)\;rtables_split_check\;0\;.*) action = write %nagiosCmd $1 rem = window = 3 minute interval/cycle * 4 cycles * 60 sec/minute + 30 sec process time window=750 thresh=4 The first rule detects non-ok states of the service on a host and deletes any active singlewiththreshold rule that may have been counting ok states for that service and host. The second rule counts ok states of the service and host. When more than 4 are received in the 720 second window, it passes the ok though to nagios. If a failure message comes through, it triggers rule 1, and is passed on to further rules in the file. Because it is a failure message, it does not trigger rule 2, but it does trigger the catchall rule that passes all mode three events to nagios thus changing the state (from say warning to critical or vice versa). Replace plugin output with a new message ---------------------------------------- You can also replace plugin output with a new message. This rule: type = single desc = shorten and diagnose ssh failure with row of @@@@@@@@ ptype = regexp pattern = ^(... )(.*);Remote command execution failed: \@+(.*)$ rem = $1 is the event prefix text. rem = $2 is the nagios passive command except the plugin output rem = $3 is the text after the row of @'s action = write %nagiosCmd ($2;ssh key failure, likely remote host identification has changed. $3.) detects the failing mode of a row of @ signs and replaces it with more useful text. Reschedule checks on failure ---------------------------- We use check_by_ssh extensively. When the ssh service fails, the ssh checks clump together. This increases the client host's load as well as triggering limits on the maximum number of ssh connections that can be in the opening state. The following SEC rule: type = single desc = reschedule ssh exchange failure some random 23-43 seconds in future rem = Have only one pending reschedule per host in any minute. ptype = regexp pattern = PROCESS_SERVICE_CHECK_RESULT\;([^;]+)\;([^;]+)\;[0123]\;Remote command\ execution failed: ssh_exchange_identification: Connection closed by remote host rem = $1 = host name $2 = service description context = ! resched_$1_in_progress action = eval %d (int(rand(20))+23+%u;); create resched_$1_in_progress 60 ; \ write %nagiosCmd ([%u] SCHEDULE_FORCED_SVC_CHECK;$1;$2;%d) detects this failure mode and randomly reschedules one service 23-47 seconds into the future. It will only reschedule one service per minute (per host) to try to reduce the log jam. Nagios Automation (future) -------------------------- When running a 24x7 operation getting woken up in the middle of the night is annoying. Sec can automate some steps to provide additional sleep. Normally if a problem occurs, you can fix a problem but it can still persist as the problem may have caused backups in other places in the system. If you expect the problem to clear in the next hour, you would normally schedule downtime for 1 hour. However this means that you will also get the OK notification sometime during that hour. To prevent this you could: 1) schedule the downtime 2) suppress notifications (to prevent the ok from being paged) 3) submit an ok/clear manually 4) re-enable notifications now the regularly scheduled check will cause the service to go back into critical state since there is still a problem that will clear itself. However this critical change occurs during the hour of downtime, so no notification is sent. Because the change occurred during the downtime, the ok notification is also suppressed allowing an uninterrupted (we hope) night's sleep if the problem clears before the downtime window expires. If the downtime window expires, then the event is paged. However performing these four steps is a pain especially at 4 in the morning. If the plugin registers for a scheduled downtime notice, it can look at the message that comes through and perform the 4 steps automatically if the message starts with a given keyword. Drawbacks ========= Note that for SEC correlated services, there is no SOFT state if you use SEC for counting Open Issues =========== These are the open isues. If anybody has any feedback or idea I would love to hear them. Active check failure ------------------- If the active check fails (command doesn't exist etc) no event is sent to the event stream. I this desirable? You can see the failure by monitoring the nagios log file, but should there be an event of some form? Correlation info missing from config cgi ---------------------------------------- The external correlation information isn't displayed in the configuration cgi. EAGAIN failure from SEC when writing to nagios pipe/fifo -------------------------------------------------------- From 1 to 5% of the time on my system I am getting errors in the sec log where it can't write to the nagios command pipe/fifo. You can see the total compared to the failing by running: grep Error sec_log.txt | wc -l ; grep 'Writing event' sec_log.txt | wc -l 1236 100932 This is with the ring buffer size set to 20480, but it's frequency didn't change much from when the setting was 10240. If anybody has any bright ideas on solving this I am all ears. I will be experimenting with increasing the buffer settings and trying to determine under what conditions it occurrs. In nagios 3, this should be better as you can have multiple command pipe and even command files with no size limit. Definitions =========== Event Stream - the results from active plugin execution, or passive check results for a service.