[WBEL-users] Still killing me softly with mon
Ed Morrison
emorrison@ncen.org
Tue, 14 Sep 2004 10:34:27 -0700
Well my mon issues seem to never solve. After a vacation I am once
again tackling my unsolvable mon problems. I have a RedHat 8 box
running mon-0.99.2 (last stable version). I need/want to rid myself of
this box and use WBEL for this purpose. To this end I have WBEL
installed along with mon-0.99.2. Mon works great for sending me alerts
that the services I'm monitoring are down. Unfortunately they are not
down. Below is a tail of my log file. I thought iptables might be my
problem so I stopped them with no change. My RH8 box is on the same
network as the WBEL box and behind the same firewall as the WBEL with
the same port openings to it (working just fine). For that matter I
reassigned the IPs of both boxes and gave the WBEL box the RH8 IP and
this too did nothing to help my cause.
In addition I decided to setup a WBEL box on the same network I am
running the services I am monitoring to see if I am missing anything
with my firewall or iptables, this WBEL box does the exact same thing as
the WBEL box on my home network.
Below is a tail of /var/log/messages and a copy of my mon.cf file.
Tail excerpt:
Sep 14 10:21:02 whitebox mon[7593]: mon server started
Sep 14 10:21:07 whitebox mon[7593]: failure for wwwservers ping
1095182467 ncen yubamail
Sep 14 10:21:07 whitebox mon[7593]: calling alert mail.alert for
wwwservers/ping (/usr/lib/mon/alert.d/mail.alert,changed to protect the
innocent) ncen yubamail
Sep 14 10:21:07 whitebox mon[7593]: calling alert mail.alert for
wwwservers/ping (/usr/lib/mon/alert.d/mail.alert,changed to protect the
innocent) ncen yubamail
mon.cf:
# Example "mon.cf" configuration for "mon".
#
# $Id: example.cf 1.1 Sat, 26 Aug 2000 15:22:34 -0400 trockij $
#
#
# This works with 0.38pre8
#
#
# global options
#
cfbasedir = /usr/lib/mon/etc
alertdir = /usr/lib/mon/alert.d
mondir = /usr/lib/mon/mon.d
maxprocs = 20
histlength = 100
randstart = 60s
#
# authentication types:
# getpwnam standard Unix passwd, NOT for shadow passwords
# shadow Unix shadow passwords (not implemented)
# userfile "mon" user file
#
authtype = getpwnam
#
# NB: hostgroup and watch entries are terminated with a blank line (or
# end of file). Don't forget the blank lines between them or you lose.
#
#
# group definitions (hostnames or IP addresses)
#
hostgroup servers-nccc nccc_01 nccc_02 sql_1 sql_2
# hostgroup serversbd2 dns-yp2 foo2 bar2 ola3
hostgroup mailhost yubamail
hostgroup routers admin ctec1 ctec2
hostgroup switch admin-switch
# hostgroup workstations blue yellow red green cornflower violet
# hostgroup netapps f330 f540
hostgroup wwwservers www www2 yubamail ncen
# hostgroup printers hp5si hp5c hp750c
# hostgroup new nntp
hostgroup ftp ftp
#
# For the servers in building 1, monitor ping and telnet
# BOFH is on weekend call :)
#
watch servers-nccc
service ping
description ping servers in bd1
interval 5m
monitor fping.monitor
period wd {Mon-Fri} hr {7am-10pm}
alert mail.alert changed to protect the innocent
alertevery 1h
period NOALERTEVERY: wd {Mon-Fri} hr {7am-10pm}
alert mail.alert changed to protect the innocent
period wd {Sat-Sun}
alert mail.alert changed to protect the innocent
alert mail.alert changed to protect the innocent
# service telnet
# description telnet to servers in bd1
# interval 10m
# monitor telnet.monitor
# depend serversbd1:ping
# period wd {Mon-Fri} hr {7am-10pm}
# alertevery 1h
# alertafter 2 30m
# alert mail.alert emorrison@ncen.org
# alert page.alert changed to protect the innocent
watch mailhost
service fping
period wd {Mon-Fri} hr {7am-10pm}
alert mail.alert changed to protect the innocent
alertevery 1h
# service telnet
# interval 10m
# monitor telnet.monitor
# period wd {Mon-Fri} hr {7am-10pm}
# alertevery 1h
# alertafter 2 30m
# alert mail.alert emorrison@ncen.org
# alert page.alert changed to protect the innocent
service smtp
interval 10m
monitor smtp.monitor
period wd {Mon-Fri} hr {7am-10pm}
alertevery 1h
alertafter 2 30m
alert mail.alert changed to protect the innocent
service imap
interval 10m
monitor imap.monitor
period wd {Mon-Fri} hr {7am-10pm}
alertevery 1h
alertafter 2 30m
alert mail.alert changed to protect the innocent
service pop
interval 10m
monitor pop3.monitor
period wd {Mon-Fri} hr {7am-10pm}
alertevery 1h
alertafter 2 30m
alert mail.alert changed to protect the innocent
watch wwwservers
service ping
interval 2m
monitor fping.monitor
allow_empty_group
period wd {Sun-Sat}
alert mail.alert changed to protect the innocent
alert mail.alert changed to protect the innocent
alertevery 45m
service http
interval 4m
monitor http.monitor
allow_empty_group
period wd {Sun-Sat}
alert netpage.alert edpage
upalert mail.alert -S "web server is back up" mis
alertevery 45m
# service telnet
# monitor telnet.monitor
# allow_empty_group
# period wd {Mon-Fri} hr {7am-10pm}
# alertevery 1h
# alertafter 2 30m
# alert mail.alert mis@domain.com
# alert page.alert mis-pagers@domain.com
#
# If the routers aren't pingable, send a page using
# a phone line and the IXO protocol, which doesn't
# rely on the network. Failure of a router is pretty serious,
# so check every two minutes.
#
# Send out one page every 45 minutes, but log the failure
# to a file every time.
#
watch routers
service ping
description routers which connect bd1 and bd2
interval 1m
monitor fping.monitor
period wd {Sun-Sat}
alert mail.alert changed to protect the innocent
alert mail.alert changed to protect the innocent
alertevery 45m
# period LOGFILE: wd {Sun-Sat}
# alert file.alert -d /usr/lib/mon/log.d routers.log
#
# If mon cannot ping one of the hubs, users will be calling soon
#
watch switch
service ping
interval 1m
monitor fping.monitor
period wd {Sun-Sat}
alert mail.alert changed to protect the innocent
alert mail.alert changed to protect the innocent
alertevery 45m
#
# Monitor free disk space on the NFS servers
#
# When space gets below 5 megs, send mail, and delete
# the oldest nightly snapshots.
#
# monitors that terminate with ";;" are not executed with the
# host group appended to the command line
#
#watch netapps
# service freespace
# interval 15m
# monitor freespace.monitor /f330:5000 /f540:5000 ;;
# period wd {Sun-Sat}
# alert mail.alert mis@domain.com
# alert delete.snapshot
# alertevery 1h
#
# workstations
#
#watch workstations
# service ping
# interval 5m
# monitor fping.monitor
# period wd {Sun-Sat}
# alert mail.alert mis@domain.com
# alertevery 1h
#
# news server
#
#watch news
# service ping
# interval 5m
# monitor fping.monitor
# period wd {Sun-Sat}
# alert mail.alert mis@domain.com
# alertevery 1h
# service nntp
# interval 5m
# monitor nntp.monitor
# period wd {Sun-Sat}
# alert mail.alert mis@domain.com
# alertevery 1h
#
# FTP server
#
watch ftp
service ftp
interval 5m
monitor ftp.monitor
period wd {Sun-Sat}
alert mail.alert changed to protect the innocent
alertevery 1h
#
# dial-in terminal server
#
#watch dialin
# service 555-1212
# interval 60m
# monitor dialin.monitor.wrap -n 555-1212 -t 80 ;;
# period wd {Sun-Sat}
# alert mail.alert mis@domain.com
# upalert mail.alert mis@domain.com
# alertevery 8h
# service 555-1213
# interval 33m
# monitor dialin.monitor.wrap -n 555-1213 -t 80 ;;
# period wd {Sun-Sat}
# alert mail.alert mis@domain.com
# upalert mail.alert mis@domain.com
# alertevery 8h
If anyone can help me with this I would appreciate it.
Thank you,
Ed