Big Brother - Help
A Web-based Systems and Network Monitoring and Notification System
Order of severity
Serious Trouble
No report lately
May need attention
All is well
All connections are checked every 5 minutes
Pager Codes
The administrator will be notified when conditions merit. The numeric message
is formatted as follows: [3 DIGIT CODE] [IP-ADDRESS]
- 100 - Disk Error. Disk is over 95% full...
- 200 - CPU Error. CPU load average is unacceptably high.
- 300 - Process Error. An important processes has died.
- 400 - Message file contains a serious error.
- 500 - Network error, can't connect to that IP address.
- 600 - Web server HTTP error - server is down.
- 7-- - Generic server error - 7 + server port number i.e. 721 = ftp down
- 800 - DNS server on that machine is down
- 911 - User Page. Message is phone number to call back.
- 999 - The host reporting an error could not be found in the etc/bb-hosts file.
Severe Conditions
Most severe conditions result in the administrator being notified. These include loss of
network connectivity, loss of HTTP access, and disk conditions over 95% full, since these
can result in a system hang. Furthermore, any "NOTICE" messages in
the message file causes a notification since this may signal a disk fault.
Under these circumstances, the screen should turn red. Click on the corresponding red dot
for additional information about the condition.
If a severe situation is occurring that is not being noticed by Big Brother, use the PAGE/ACK
button on the main screen to notify the administrator manually.
Warning Conditions
These include HTTP server errors, disks 90-94% full, the death of important
processes, and "WARNING" messages in the system logs.
The screen should turn yellow if this is the most severe situation at the time.
Click on the corresponding yellow dot for additional information, and notify
the administrator manually if necessary.
No Report Warnings
Each report is checked for freshness. If any report is more than 30 minutes old, it is marked
with a purple dot, and the screen turns purple, assuming that it is the most serious situation
at the time.
These may be the result of heavily loaded systems, but may also indicate a more
serious loss of communication within the Big Brother system itself.
System Information
Click on any server name for additional details about the machine. Information
about all components are available, including serial numbers, partition sizes,
SCSI addresses, and the physical locations of the devices. This
information lives in the www/notes directory.
General Information
The current status of any individual component is always available by clicking the appropriate
dot in the display matrix. You may have to hit Reload to get the most recent entry.
Occasionally the screen changes color for CPU or HTTP warnings. These can usually be disregarded
since Big Brother has been instructed to be very sensitive during this initial test. Similarly, internet connections may turn yellow when the network
is heavily loaded. Although it should be checked out, this is usually not a
problem unless the whole Internet section goes yellow.
Big Brother Column Information
conn
The conn column denotes the ping check performed periodically. This
code is located in bb-network.sh.
nntp
The nntp column denotes the nntp check performed periodically. This
code is located in bb-network.sh. It makes sure the news server is
alive and well.
cpu
The cpu column denotes the cpu check performed periodically. This figure
is based on the 5 minute load average as reported by the 'uptime' command,
in the second column. The code for this test is located in bb-local.sh.
disk
The disk column denotes the disk check performed periodically. This
test is just the 'df' command with the disk most full being reported.
The warning amount is 90% by default, and the system is set to panic at
95%. These values are set in $BBHOME/etc/bbdef.sh and may be changed.
The code for the disk test lives in bb-local.sh. You may also set
warning/panic level individually in the etc/bb-dftab file. See the
etc/bb-dftab.INFO.
dns
The dns column verifies the status of the DNS server on that machine.
The test is basically an nslookup with the server name and IP address
as arguments.
ftp
The ftp column denotes the ftp check performed periodically. This
code is located in bb-network.sh. It is part of the new group of
generic server tests performed. To test this service on a given
machine, just include 'ftp' on the line in the bb-hosts file.
http
The http column denotes the http check performed periodically. This
code is located in bb-network.sh. It will return OK if the server is
there and does not return a string containing the word 'Error'. It
should be more rigourous. Note that password-protected pages return
an error when they shouldn't.
msgs
The msgs column denotes the msgs check performed periodically. This
code is located in bb-local.sh. Only NOTICE and WARNING conditions are
considered. Note that a NOTICE condition will cause a notification (code red)
whereas a WARNING just turns the screen yellow. There is no way to
turn these messages off, short of clearing out the messages file
manually or modifying the tags from WARNING to wARNING and NOTICE to
nOTICE. You may also introduce tags in the etc/bbdef.sh file in
the PAGEMSG and MSGS variables.
oracle
The oracle column is for the status of the Oracle database instances on the
servers, their process status, and a visual as to what is up and what is not.
It checks and lists any and all processes used by the Oracle server(s), and
also checks the network listner(s). It gives full status response from the
Oracle listener control program's status command. In addition, it now also
gives a list of any and all users currently connected, including Oracles own
system logins. Code is run from the oracle extension script provided in the
new $BBHOME/ext directory which is only in release 1.2b and later. If you
want a copy, check out the Big
Brother Info Page or email me if you want a copy, at
paul@pluzzi.com.
cpu2
CPU2 is a column for the status of the processors. There isnt really any column
devoted to this and it is a big enough concern. It is currently tailored to SUN
specifically, with the mpstat and psrinfo commands. It checks for any off-line
processors, and will go red on that condition. It also checks swap information.
Code is run from the cpu2 extension script provided in the new $BBHOME/ext
directory which is only in release 1.2b and later.
dmp
DMP is a column for the status of the dynamic multi pathing features provided
by EMC and Veritas and Sun. It checks the format command for O/S level check
and then queries the Veritas setup to see that it matches. It will go red and
page on any missing paths, whether it be from format or the Veritas check.
It is currently tailored to SUN specifically, but should work on any supported
Veritas platform, as the commands are standard.
Code is run from the dmp extension script provided in the new $BBHOME/ext
directory which is only in release 1.2b and later.
ha
This is the High Availability column. It is currently setup for Veritas First Watch
but will definately be Veritas VCS aware in the very near future. Once again, it is
specifically tailored to the SUN platform right now, but as the need arises, other
platforms will be added.
Code is run from the ha extension script provided in the new $BBHOME/ext
directory which is only in release 1.2b and later.
iostat
This column is for the iostat command, so that it can be made available via the web.
It currently polls twice at 5 second intervals. The output can be pretty long and get
truncated on shared disk environments. In fact I am a victim of this alread on the
dsgoddb01 host. This column also does a vmstat command for 5 intervals of 5 seconds.
Code is run from the iostat extension script provided in the new $BBHOME/ext
directory which is only in release 1.2b and later.
ipcs
IPCS is a column for the status of the inter process communication status.
Code is run from the cpu2 extension script provided in the new $BBHOME/ext
directory which is only in release 1.2b and later.
logs
The logs column is dedicated to showing log files regardless of condition. I dont
currently set any negative color for this, because that is what the "msgs" column is for.
However, I can see the need to do some other parsing of these files in the future, so look
for that. Another difference between this and the "msgs" column, is that I currently look
for the syslog, sulog, and messages file, whereas the "msgs" column only checks the messages.
Code is run from the logs extension script provided in the new $BBHOME/ext
directory which is only in release 1.2b and later.
network
The network column is designed to report any negative conditions for a downed interface.
It also reports information about the routing tables, and goes red on any collisions or
other errors from netstat -i. This may need to be modified on a network that is slow,
or is still shared. I happen to be on switched networks exclusively, so I dont have any
collisions. If that is not the case for you, you will need to add the line that checks
for a 10% ratio of collisions to total traffic.
Code is run from the network extension script provided in the new $BBHOME/ext
directory which is only in release 1.2b and later.
printers
The printers column is for checking the status of any network connected
printers. It is specifically written around HP's hpnp software so you will
need to have that for this to work.
Code is run from the printers extension script provided in the new $BBHOME/ext
directory which is only in release 1.2b and later.
prtdiag
The prtdiag column is strictly to show the output of a verbose prtdiag command on SUN.
I know that a portiion of this get truncated, but it is very minimal. The negative status
indicators are flagged on any offline, unsteady, or otherwise negative conditions.
Code is run from the prtdiag extension script provided in the new $BBHOME/ext
directory which is only in release 1.2b and later.
top
TOP is a column for the status of the freeware top processes running on a
system. This uses the -b flag to capture output for a snapshot of time.
Code is run from the cpu2 extension script provided in the new $BBHOME/ext
directory which is only in release 1.2b and later.
vx_check
The vx_check column is used again, on a SUN platform, to check the status of the disks,
thru Veritas Volume Management. It gets a list of all disks in the Veritas control, and
executes a "vxdisk check" on them. It dynamically gets the disk list, so there is no need
to supply a list. That would not be proper anyway, because if you add a disk, and it has
problems, you would not otherwise know about it, if it was a static disk list.
Code is run from the vx_check extension script provided in the new $BBHOME/ext
directory which is only in release 1.2b and later.
vx_group
The vx_group column is for checking the Veritas processes running, the vxiod daemon output,
and of course a "vxdg list" command.
Code is run from the vx_group extension script provided in the new $BBHOME/ext
directory which is only in release 1.2b and later.
vx_list
The vx_list column is dedicated to checking for any offline, or invalid disks listed thru
Veritas, that dont have a "double dash" line. In otherwords, if the disk is configured for
any disk group, whether it be raw or cooked, it will be checked. If it is not online, you
are getting a negative color for it.
Code is run from the vx_list extension script provided in the new $BBHOME/ext
directory which is only in release 1.2b and later.
pop3
The pop3 column denotes the pop3 check performed periodically. This
is part of the generic test code in bb-network.sh. It checks that
the pop3 server is alive and well. To test a machine for the pop3
server, put the word 'pop3' on that server's line in the bb-hosts file.
You may have to put pop-3 instead on certain platforms. Check /etc/services
for the correct spelling.
procs
The procs column denotes the procs check performed periodically. This
code is located in bb-local.sh. It makes sure that the processes defined
in etc/bbdef.sh in the PROCS variable exist on the local machine. If
a process does not exist, and it has been defined in the PAGEPROCS
variable, then the code is red and a notification is sent out. The ps command
is used to get a current process listing.
smtp
The smtp column denotes the smtp check performed periodically. This
is part of the generic server test code located in bb-network.sh. It
makes sure that the SMTP process (usually sendmail) is alive and well.
Copyright © 1997-1999 The MacLawran Group Inc - All Rights Reserved