Introduction
============

Ok folks. You asked for it and you got it about three weeks late but...
There is still one major bug that I can't seem to get a handle on.
Search for EAGAIN below for the details. However I think this is in
some semblace of sanity for you to all try it out.

If you don't understand something ask. If you are trying to solve a
particular issue of yours using this patch, but can't quite figure out
how to do it ask.

Sadly this isn't as polished as I was hoping for, and I apologize in
advance for anything that is broken or unclear. Now without further
delay:

Overview
========

These patches apply to nagios 2.6 and possibly other 2.x or
development 3.x series using the Nagios Event Broker (NEB).  They
allow nagios to supply it's active event stream to an external process
for advanced correlation.  This allows you to write rules in SEC or
other correlation engine to change the way nagios displays and
responds to errors.

The control settings are assigned to a service object and allows
controlling processing of event based on

   host running the service
   service
   source of event: active or passive
   status of event: ok, warning, critical and unknown

One of four things can happen to the event (depending on processing
mode 1, 2, 3, or 4):

  1) it is passed to the nagios core and not made available to the
     external correlation engine.

  2) it is passed unchanged to the nagios core and also passed to the
     external correlation engine. In this mode nagios sees the result
     of the active event even when it causes a change in state.

  3) a modified result of the active event is passed to the nagios
     core and also passed to the external correlation engine. In this
     mode nagios sees a modified event that does not change the
     current status of the service. This alows the external
     correlation engine to determine if the status should change and
     submit a passive event to change the status and trigger
     notifications etc.

  4) (NOTE: This is not yet implemented.) the active event is passed
     to the external correlation engine and the event is ignored by
     nagios as though it hs never been received. This allows the
     external correlation engine and is preferred to mode 3 as it
     doesn't cause a temporary update of the service data that may be
     overwritten by the correlation engine.

Once the data arrives at the external correlation engine you can
perform many different operations. Using a text based event stream
should allow the use of different engines including:

  SEC - Simple Event Correlator:
    http://www.estpak.ee/~risto/sec/

  ESPR - a java 1.5 based correlation engine
    http://esper.codehaus.org/

With SEC I have successfully:

  Required 4 consecutive ok active polls before changing the state of
  the service back to ok. This replaces the flap detection by making
  sure that the service is in an ok state 'long enough' before
  asserting it is ok.

  Suppressed the warning state when 2 cron process are running if it is
  during the time window when backups would run. This is an example of
  changing the thresholds for a check depending on what time it is.

  Changed plugin output from a less useful error message to a message
  that describes the actual cause of the error. This allows mapping of
  plugin output from one message to another.

Manifest
========

sec.cfg - configuration file for nagios that creates three services
          for interacting with the patches and the SEC (or other)
          engine. Needs editing for your site.

corr.sr - core correlation rules for use with SEC.  Needs editing
                for your site.

examples.sr - sample rules for different correlation techniques.

nagios_log.sr - rules for analyzing the nagios log file.

start_sec - init.d style file for controlling the sec process. Needs
            editing for your site.

nagios_sec_manual.txt - this file

4 patches described below: patches/neb.patch, patches/control.patch,
                           patches/module.patch, patches/buffer_slots.patch

Building with the patch
=======================

There are three required patches and one optional but recommended
patch that increases the number of slots in the circular ring buffer
that is used to handle the external command pipe. See:

  http://permalink.gmane.org/gmane.network.nagios.devel/3216

A permanent fix that allows setting the ring buffer size in nagios.cfg
has been committed to some branch of a future nagios which makes this
patch obsolete. See:

  http://permalink.gmane.org/gmane.network.nagios.devel/3239

for the proper fix.

Use the same build directory that you used to build and install your
current test nagios setup.  Save the 4 patch files. From the top of
the nagios build tree use "patch -p1 < patchfile". The patch files
are:

   neb.patch - adds a new Nagios Event Broker callback needed by the
               module
   control.patch - adds the control attributes for the module to the
                   service object
   module.patch - creates the NEB module.

   buffer_slots.patch - increases the circular ring buffer size. May
                        not be needed depending on version of Nagios.
                        See comments at the beginning of this section.

All the files should patch without issue. Once this is done follow the
normal nagios build/install instructions. Run:

  make clean

  make nagios

  make modules

Then install base/nagios where you want to install your nagios binary
or run 'make install-base'. Copy module/ext_corr.o to some location. I
use /usr/lib/nagios/module/ext_corr.o,but
/etc/nagios/modules/ext_corr.o could also work as there is no nagios
install target to install modules. You will have to reference his path
in your nagios.cfg file.

Using the patch
===============

Modify nagios.cfg and add (all on a single line):

  broker_module=/usr/lib/nagios/modules/ext_corr.o --file /tmp/sec
      --tag module1 --control ops01;ZSecControl --output standard

The arguments to the ext_corr.o module are:

  --file <outputfile> - (required) the path to the file where the
           events are written. It must be a file and not named pipe
           for anything except the smallest nagios installation as the
           buffer size for the named pipes are too small to buffer
           sufficient data.

	   If you are not using a named pipe, the directory must be
	   writable by the nagios user so that the file can be rotated
	   and a new file created by nagios. For performance using a
	   RAM disk or memory based filesystem under Solaris may be a
	   win.

	   I suggest making this the first argument as it is used to
	   identify the ext_corr instance by default unless the tag is
	   defined. There is no default and the NEB module will exit
	   with an error if it is missing.

  --tag <tagname> - (optional) the tag is used to identify log
           messages in the nagios log file when using multiple
           instances. If not present the basename of the outputfile is
           used.

  --control <service>- (optional, but recommended) to force the module
    to close and reopen the outputfile (i.e. rotate the file) commands
    can be sent to the NEB module via this service. The format of the
    service is <hostname>;<service description>. You need to create a
    service of this name. A sample definition is included in sec.cfg.
    There is no default.

    Using an initial character that doesn't occur in your other
    services: Q, or Z minimizes the amount of processesing needed to
    identfy the control service. The sample nagios sec rules (sec.cfg)
    and the sec rules files (corr.sr) have to be modfied to insert the
    first (hostname) and second (service description) components in
    some rules.

  --output <standard|plain> - "standard" generates an event stream to
    the correlation engine with a 4 character prefix before the
    PROCESS_SERVICE_CHECK_RESULT (see below for format). If the format
    is "plain" the output stream is only the Nagios
    PROCESS_SERVICE_CHECK_RESULT. The default is standard.

  
In nagios.cfg add:

  cfg_file=/etc/nagios/sec.cfg

using the sample sec.cfg file from the distribution. You wll have to
edit the file and set the hostname and service description for host
ops01, service ZSecControl to match what you specified to the
--control argument.
 
Running Multiple Instances
--------------------------

If you copy ext_corr.o to a new file name you can run multiple
instances of it (to feed two correlators for example that look at
different subsets of events). You can not have the same name for the
object file and create two instances. Also you chould change the
tagname for the instances to be able to distinguish between the two in
log messages.

New service object properties
-----------------------------

The service object gets two new properties:

  ec_passive_action

  ec_active_action

they perform the same roles for passive and active actions. I will
discuss the active form only as the passive form is analogous.

A sample invocation is:

  define service{
        service_description     CronDaemonCheck
        host_name               test
        use                     generic-service
        check_command           check_local_procs!:1!1:!-C crond
        ec_active_action        critical,unknown:warning
  }

defines the following operations:

  for "ok" events - operate in mode 1. This is the default since none
                    of the values specify "ok" or "o".
  for "critical" or "unknown" events - operate in mode 2. Send the
                    unmodified event to Nagios and a copy to SEC
  for "warning" events - operate in mode 3. Send a modified copy of
                    the event to Nagios and a copy of the original
                    event to SEC.

The format is '<mode 2 status list>:<mode 3 status list>'. A status
list is a comma seperated list of the status names:

   o or ok
   w or warning
   c or critical
   u or unknown

Statuses that occur before the first ':' are handled in mode 2
(e.g. critical and warning above). Status that occur after the first
':' are handled in mode 3 (e.g. warning above). Statuses that are not
specified are handled in mode 1.

In the examples below modes are discussed in terms of status of an
event for a service. E.G.

   the ok status of SERVICE should be handled with mode 3.

Since the service object definition is unique for a
host/service_description pair, you can set the modes operating mode
uniquely for a (status, service, host) triplet.

Choosing what mode to use for services
--------------------------------------

As mentioned above the plugin operates in three modes.  This section
discusses which mode to put a service/status in.

Mode 1
~~~~~~

If the event doesn't need to be passed to the correlation engine
choose mode 1. These events are isolated in that they don't have an
effect on other services nor do other services have an effect on them.

Mode 2
~~~~~~

If the event is going to be used as input to the correlator, but is
not going to be overridden (set) by the correlator use mode 2.

Also if you don't mind or want a notification when the correlator
determines the proper state, you should use mode 2. E.G.

   service is ok
   critical event come in in mode 2
   nagios sends alert
   correlator determines it is ok
   correlator submits an ok event
   nagios notifies that problem is clear

A variant of the above that may be useful is used when the max_count
for the service is set greater than 1 (e.g. max_count 2).

   service is ok
   critical event come in in mode 2
   nagios logs it and sets the count to 1
   correlator determines it is ok
   correlator submits an ok event
   nagios logs the state change and sets the count back to 0.


Mode 3
~~~~~~

If the correlator will set the status of the event, or you don't want
notification/logging of events (although state stalking can be used
for debugging) use mode 3. These are services that depend on others to
determine of a problem is occurring. Also this is used for
implementing different problem thresholds in different time periods.

Plugin output format
====================

The plugin has two output modes:

   standard
and
   plain

Standard Output Format
~~~~~~~~~~~~~~~~~~~~~~

This format has the following form:

<prefix><space><nagios PASSIVE_SERVICE_CHECK_RESULT>

An example is:

  32a [1168276673] PROCESS_SERVICE_CHECK_RESULT;cook02.example.org;DeadLetterCheck;3;Remote command execution failed: ssh: connect to host 192.168.0.0 port 22: No route to host

The prefix consists of three characters

 character 1 - current service status

   0 - current status of service is ok
   1 - current status of service is warning
   2 - current status of service is critical
   3 - current status of service is unknown

 character 2 - module mode

   (1 is never seen since mode 1 event are not passed to the
      correlation file)
   2 - pass event to nagios (mode 2)
   3 - pass modified event to nagios (mode 3)
   4 - remove event from nagios (mode 4 not yet implemented)

 character 3 - active or passive event

   a - active event
   p - passive event

Then a space character separates the prefix from the nagios passive
service check result. The passive check result can be passed into
nagios to record/set the status of the service.

Further information on the PROCESS_SERVICE_CHECK_RESULT passive
command can be obtained from:

   http://nagios.org/developerinfo/externalcommands/commandinfo.php?print=true&command_id=114

Dedicated nagios services
=========================

There are two parts of the dedicated services that allow nagios to
monitor sec and vice versa. These two files are linked, so changing
the service neames in one requires corresponding changes in the other.

Nagios Config
---------------
A nagios configuration file to implement the 3 or so additional
services for use with the plugin.

     * a service to control the plugin (rotate log files primarily)
     * a service for nagios to probe the SEC process
     * a service to see if the external SEC process is running.

You will have to change the host_name for each service to be the name
of your nagios host. Also you will want to change the contact_groups
for each service. Also if you don't have a check_dummy check command
defined, add:

  # 'check_dummy' command definition
  define command{
        command_name    check_dummy
        command_line    $USER1$/check_dummy $ARG1$
        }

to one of your config files. The check_dummy plugin is a standard part
of the nagios plugins and should be installed.

Plugin Control (service ZSecControl)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The primary use of this service is to rotate the log file (see about
automatic rotation below).

If a manual ok event is sent and the text of the event is "rotate"
(without the quotes), the file that connects Nagios to SEC will be
closed by the NEB module and reopened. So you can move the file aside,
submit the rotate request and a new file will be created. SEC will
automatically finish with the old file and move on to the new file
whan it is created. It usually happens within 10 seconds and always
with 60 seconds. The name of the control host and service description
are passed using the --control option in nagios.cfg.

Probe the SEC process (service SecReport)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This is a service that runs a check_dummy script every minute that
is forwarded to the process running service by a SEC rule.

It can also be used by SEC to report things to the user about ongoing
operations or errors. It has state stalking turned on so that reports
from SEC will be recorded in the logs.

Is the external SEC process running (service SecAliveCheck)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This is a service that doesn't run an active check, but has a 2 minute
freshness threshold. If the loop from nagios -> sec -> nagios is not
completed, it will execute the check_dummy check because the result is
stale. If this doesn't update the service, a notification should be
sent. (note, this check could possibly be written better. Also I have
never seen the notifcation because my test install runs with
notification's disabled so it is quite possible I screwed up the check
definition in some way.)

Note that the freshness check will commonly trigger just after a
nagios startup as the checks are cheduled out in time and it may be
more than 2 minutes before the probe runs. Once the first probe runs,
it comes to steady state and should work ok provided all of the SEC
writes to the nagios command pipe are completed without error (see
below for EAGAIN failure from SEC).

SEC config
----------

A basic required SEC ruleset with the minimum set of rules is supplied
in the corr.sr file. This implements:

   Setting of variables upon SEC startup. See the comments in the file
   for configuration info.

   Nagios log messages are ignored by the rules in this file
   (this is used if sec is watching both the nagios log as well as the
   event stream). This is the mode that start_sec operates in.

   Mapping the SecReport check to SecAliveCheck.

   Automatic File rotation of the file sending events from nagios to
      SEC.

   Automatic forwarding of any mode 3 events that haven't been handled

Search for CHANGEME in the file for info on what has to be customized.

Setting of variables upon SEC startup
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Sec permits the definition of varibles that are used in action
statements. This includes email addresses file names etc. See the
comment at the top of corr.sr for further info. You will need to
customize these for your site.


Mapping the SecReport check to SecAliveCheck
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The following rule detects the periodic SecReport service messages
from the nagios sec.cfg file. The arrival of these messages triggers
the submission of a SecAliveCheck to nagios on the host defined in %H.

  # default core rules for SEC/Nagios integration
  # Map the poll from SecReport to SecAliveCheck to prove that sec is running
  # and processing events.
  type = single
  desc = Handle SEC Keepalive (core)
  ptype = regexp
  rem = CHANGEME replace ops01 with the name of your nagios host
  pattern = ops01;SecReport.*Testing SEC.
  action = write %nagiosCmd ([%u] PROCESS_SERVICE_CHECK_RESULT;%H;SecAliveCheck;0;OK: SEC is forwarding events)


Automatic File Rotation
~~~~~~~~~~~~~~~~~~~~~~~

The following SEC rules rotates the file that communicates events from
Nagios to SEC and detects/notifies if it takes more than 30 seconds.

  # rule 1: detect the start of a new file.
  type = single
  desc = if file rotation occurs, delete the time_rotation context (core)
  ptype = substr
  rem = CHANGEME replace ops01 and ZSecControl with the name
  rem = of your nagios host and command service (specified by
  rem = --control passed to the nagios module) and replace
  rem = /tmp/sampler with the argument to the --file option.
  pattern = PROCESS_SERVICE_CHECK_RESULT;ops01;ZSecControl;0;Output file /tmp/sampler reopened
  context = time_rotation
  action = delete time_rotation

  # rule 2: generate the rotation every morning at 6AM.
  type = calendar
  time = 00 06 * * *
  desc = rotate /etc/sample within 60 seconds once a day (core)
  rem = move file, start context that will detect if rotation
  rem = not done is 60 seconds, write event to cause module to rotate file
  action = shellcmd /bin/mv %eventStreamFile %{eventStreamFile}.old; \
	   create time_rotation 60 (pipe '%s' /bin/mail -s "%eventStreamFile file rotation failure" %notify ); \
	   write %nagiosCmd ([%u] PROCESS_SERVICE_CHECK_RESULT;%H;%S;0;rotate|)

Rule 2 does the work. Every day at 6AM it moves the file specified in
the variable %eventStreamFile to .old. It then sets up a context
called time_rotation for 60 seconds. If the context expires normally,
it will report the file rotation failure using mail.

Rule 1 detects the first line in a rotated file. It then deletes the
time_rotation context which does not trigger the email report
mechanism.

Note that the email of a failure message could be changed to report a
critical event using the SecReport service as well.

Feel free to change the rotation time and frequency. I have had it
rotating once a minute with only minor increase in load, but once an
hour is probably the max this is needed.

Automatic Forwarding of Mode 3 Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Since mode three events have not taken effect in nagios (they are
forwarded to the core maintaining the current state, not the new state
determined by the active event), the default rule in the SEC file
forces these state changing events to be processed by nagios. If they
reache this final state the assumption is that they pere not part of
any other rule that suppressed or otherwise handled the event and they
should be processed by nagios rather then discarded.

  type = single
  desc = Forward Active Mode 3 Events (core)
  ptype = regexp
  pattern = ^(.3a )(.*)$
  action = write %nagiosCmd $2


Addng new rules
~~~~~~~~~~~~~~~

You can add new rules anywhere after the comment

  # Start rules that apply to the event stream

Some sample rules are located in example.sr.


SEC use cases with examples
===========================

The package come with some sample rules in the examples.sr file.
In this document the lines are numbered using (1). Do not use the
numbers in actual implementation. The proper form is shown in the
examples.sr file.


Multiple Thresholds in Time
---------------------------

Suppress the warning state when 2 cron process are running if it is
during the time window when backups would run. To do this, set the
warning state for the CronDaemonCheck to use mode 3 (since more than
one process is a warning condition). Then set up 2 sec rules to define
the backups_running time period. The first turns the backups_running
context/flag on:

  # see section 5.3.4 in
  #   http://sixshooter.v6.thrupoint.net/SEC-examples/article-part2.html
  # to understand the reason for the calendar spec.
  type = calendar
  time = * 0-6 * * *
  desc = start backups_running context
  context = [! backups_running]
  action = create backups_running

by running every minute between midnight and 6 am.
A second calendar rule second turns the context off:

  type = calendar
  time = * 7-23 * * *
  desc = stop backups_running context
  context = [backups_running]
  action = delete backups_running

running every minute between 7AM and 11PM.

Then define a rule to check the warning event generated by Nagios to
see if:

   * the number of cron processes is 2
   * the event occurred during the BackupRunning time period.

If so, it submits a modified ok event that clears any problem and
reports that it was cleared because backups are running.

  # Capture the mode 3 event CronDaemonCheck on host concord
  # If 2 cron processes are running submit the result as ok
  # between 6PM and 2AM. Normally 2 processes is a warning.
  #
  type=single
  desc = concord can have 2 cron processes between 6PM and 2 am for backups
  ptype=regexp
  pattern = ^... (\[[0-9]+\] PROCESS_SERVICE_CHECK_RESULT\;concord\;CronDaemonChec\
  k);1;(PROCS WARNING: ([0-9]+) processes with command name .crond.*)
  rem = $1 is the part of the event before the state info.
  rem = $2 is the part of the event after the state info (plugin output)
  rem = $3 is the number of running processes
  rem = This rule applies only if backups_running context exists and
  rem =  there are exactly two running processes.
  context = backups_running && =($3 == 2)
  rem = if it applies, write the event with status 0 (ok) and a note
  rem = before the plugin output indicating that it is a correlated
  rem = output
  action = write %nagiosCmd ($1;0;[backups running] $2)


Variant of Multiple Thresholds on Other Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A variant on this can be done by using another SEC rule to trigger the
context. E.G. if you have a web application that operates slowly
(takes 20 seconds to respond vs 5) for the first 5 minutes after the
web server comes up you could:

  create a "webServerStarted" context in SEC with a 5 minute lifetime
     based on the clear status of the WebServerCheck.

  create a rule that looks for the critical event coming from
     wepAppCheck, verifies that it is within 20 seconds and suppresses
     (or clears) the event if the webServerStarted context exists.

In this case the ok state of the webServerStarted check would be
handled using mode 2, and the critical state of the wepAppCheck
would be handled using mode 3.

Flap detection replacement/Count ok polls before reset
------------------------------------------------------

The following 2 SEC rules make the service reset to ok only after 4
consecutive ok results have been received. The service ok state must
be handled in mode 3 for this to work properly.

  # rule 1
  type=single
  rem = use takenext to have the failure asserted by the default rule
  continue=takenext
  desc = detect non-ok state for rtables_split_check on host $2
  ptype=regexp
  rem = look for mode 3 active events.
  pattern = ^.3a (\[[0-9]*\] PROCESS_SERVICE_CHECK_RESULT\;([^;]+)\;rtables_split_check\;[123]\;.*)
  rem = reset the SingleWithThreshold if a non-ok check occurs
  action = reset +1 require 4 consecutive ok's for rtables_split_check on host $2

  #rule 2
  type = SingleWithThreshold
  desc = require 4 consecutive ok's for rtables_split_check on host $2
  ptype = regexp
  rem = match only when current service state is not ok
  pattern = ^[123]3a (\[[0-9]*\] PROCESS_SERVICE_CHECK_RESULT\;([^;]+)\;rtables_split_check\;0\;.*)
  action = write %nagiosCmd $1
  rem = window = 3 minute interval/cycle * 4 cycles * 60 sec/minute + 30 sec process time
  window=750
  thresh=4

The first rule detects non-ok states of the service on a host and
deletes any active singlewiththreshold rule that may have been
counting ok states for that service and host.

The second rule counts ok states of the service and host. When more
than 4 are received in the 720 second window, it passes the ok though
to nagios.

If a failure message comes through, it triggers rule 1, and is passed
on to further rules in the file. Because it is a failure message, it
does not trigger rule 2, but it does trigger the catchall rule that
passes all mode three events to nagios thus changing the state (from
say warning to critical or vice versa).

Replace plugin output with a new message
----------------------------------------

You can also replace plugin output with a new message. This rule:

  type = single
  desc = shorten and diagnose ssh failure with row of @@@@@@@@
  ptype = regexp
  pattern = ^(... )(.*);Remote command execution failed: \@+(.*)$
  rem = $1 is the event prefix text.
  rem = $2 is the nagios passive command except the plugin output
  rem = $3 is the text after the row of @'s
  action = write %nagiosCmd ($2;ssh key failure, likely remote host identification has changed. $3.)

detects the failing mode of a row of @ signs and replaces it with more
useful text.

Reschedule checks on failure
----------------------------

We use check_by_ssh extensively. When the ssh service fails, the ssh
checks clump together. This increases the client host's load as well
as triggering limits on the maximum number of ssh connections that can
be in the opening state. The following SEC rule:

  type = single
  desc = reschedule ssh exchange failure some random 23-43 seconds in future
  rem = Have only one pending reschedule per host in any minute.
  ptype = regexp
  pattern = PROCESS_SERVICE_CHECK_RESULT\;([^;]+)\;([^;]+)\;[0123]\;Remote command\
   execution failed: ssh_exchange_identification: Connection closed by remote host
  rem = $1 = host name $2 = service description
  context = ! resched_$1_in_progress
  action = eval %d (int(rand(20))+23+%u;); create resched_$1_in_progress 60 ; \
	   write %nagiosCmd ([%u] SCHEDULE_FORCED_SVC_CHECK;$1;$2;%d)

detects this failure mode and randomly reschedules one service 23-47 seconds
into the future. It will only reschedule one service per minute (per
host) to try to reduce the log jam.

Nagios Automation (future)
--------------------------

When running a 24x7 operation getting woken up in the middle of the
night is annoying. Sec can automate some steps to provide additional
sleep.

Normally if a problem occurs, you can fix a problem but it can still
persist as the problem may have caused backups in other places in the
system.

If you expect the problem to clear in the next hour, you would
normally schedule downtime for 1 hour. However this means that you
will also get the OK notification sometime during that hour. To
prevent this you could:

  1) schedule the downtime
  2) suppress notifications (to prevent the ok from being paged)
  3) submit an ok/clear manually
  4) re-enable notifications

now the regularly scheduled check will cause the service to go back
into critical state since there is still a problem that will clear
itself. However this critical change occurs during the hour of
downtime, so no notification is sent. Because the change occurred
during the downtime, the ok notification is also suppressed allowing
an uninterrupted (we hope) night's sleep if the problem clears before
the downtime window expires. If the downtime window expires, then the
event is paged.

However performing these four steps is a pain especially at 4 in the
morning. If the plugin registers for a scheduled downtime notice, it
can look at the message that comes through and perform the 4 steps
automatically if the message starts with a given keyword.

Drawbacks
=========

Note that for SEC correlated services, there is no SOFT state if you use
SEC for counting 


Open Issues
===========

These are the open isues. If anybody has any feedback or idea I would
love to hear them.

Active check failure
-------------------

If the active check fails (command doesn't exist etc) no event is sent
to the event stream. I this desirable? You can see the failure by
monitoring the nagios log file, but should there be an event of some
form?

Correlation info missing from config cgi
----------------------------------------

The external correlation information isn't displayed in the
configuration cgi.

EAGAIN failure from SEC when writing to nagios pipe/fifo
--------------------------------------------------------

From 1 to 5% of the time on my system I am getting errors in the sec
log where it can't write to the nagios command pipe/fifo. You can see
the total compared to the failing by running:

grep Error sec_log.txt | wc -l ; grep 'Writing event' sec_log.txt | wc -l
1236
100932

This is with the ring buffer size set to 20480, but it's frequency
didn't change much from when the setting was 10240. If anybody has any
bright ideas on solving this I am all ears.  I will be experimenting
with increasing the buffer settings and trying to determine under what
conditions it occurrs. In nagios 3, this should be better as you can
have multiple command pipe and even command files with no size limit.

Definitions
===========

Event Stream - the results from active plugin execution, or passive
               check results for a service.