Big Brother Information
Finally added a table of information relating to what ports big brother uses to work thru a firewall.
The big news here is a new script to allow the disk monitor column to finally work properly on HP-UX. In HP-UX 10.20, and 11.0, the bdf and df output is two lines if you use long names for volume groups. This wreaks havoc on the dis monitor column because the field where the percentage and disk free column should be is vacant and instead shown on a different line. To correct this, there is a new script which I install into the $BBHOME/ext area, called hpux_df.sh.
See below for some new updates to the extension scripts, which are not yet available at deadcat. The oracle, network, and prtdiag scripts have been updated.
Added a new page detailing my recent Sun Big Brother server installation.
See below for MANY new updates to the extension scripts, which are also available at deadcat. There is a nice new table showing current versions and dates released. updated 02-05-2002 and table now includes links to scripts
Looking for my Big Brother page? Yes I am still on the "old" version, since a new version with a FRESH new look came out, but I did steal the backgrounds from the new version. I think the new backgrounds make the pages so much easier to read. I am testing the entire new version now, and may deploy soon .... (A version of my extensions can be seen in a "stale" version that I just made a copy of.)
Actually, as of July 25, 2000, I finally got my AIX box running again, and am beginning the port of all my Big Brother information to that server. Tonite was the first test, and I suspect by the weekend, it will be complete. The new version is 1.4h2 and is also viewable.
I had to move the BigBrother display server, from the machines "help" facility's web server, to the Netscape Internet FastTrack server. I began having problems with all kinds of things on that "help" web server. Truth is that it should have been on the FastTrack server all along, but I had many problems configuring the cgi directory. The scripts wouldnt run correctly, but now I had real reason to push a bit harder. The server kept crashing, and it was frustrating me to no end.
Well, for starters, I have been working with Big Brother, since the early days, with revision 1.0. It has served me well, thru my posts at Edu-Met, AT&T, and my own personal network. I am using it for my own local servers at Pershing now as well. As my network grew, to the point that it is now bigger than companies I have worked for in the past, the "guaranteed" uptime, became more critical. It is also a tremendous asset, for pro-actively monitoring clients I support, so that I can be paged and notified, so I can resolve problems, all before the client ever needs to call me.
I have implemented many of my own extension scripts, which I will offer here. The oracle script is shown in the next "paragraph" below. Here I will concentrate on the enterpise Sun servers I am using in my current position, with all their database and EMC configurations, and offer what I have done to monitor all of this.
These are all VERY Sun specific right now, but will very likely be ported to HP-UX within the next few months. It seems inevitable. In fact, as of 08-31-2001, I have finally got some HP-UX specific stuff into these agents. The network script for one is now HP-UX aware.
I wanted to, at a glance (and with paging notification), be able to know if my failover software has gone down, or if there are any network collisions, or if Veritas is reporting any issues, or if dynamic multipathing was down, or many other issues you will see below. Since it is so easy to do, I wrote this into Big Brother, it is very extensible. (For the links below in this section, you can either left click them, to see the code, or right click, and download - they are all shell scripts)
Here is what I setup and what you need to do to use it :
- First off, I have included a complete tar file of all of what you will see below. If you prefer, you can grab it that way, and break it all apart yourself. I like to offer the flexibility, because that is how I would like to see it if I were downloading it. I must however mention, that the tar file is slightly out of date, and you would be better off getting each one individually, since many are updated, and the tar is all version 1 scripts.
- I have included a copy of my modified $BBHOME/www/bb-help.html file, for all fully documented column names. This way when you click on the column name, expecting to find information about what is being monitored, you will be able to.
- If using a version prior to 1.4, you will need to modify the $BBHOME/runbb.sh script, so that in the section for custom "ext" scripts ($BBEXT), you have any or all of the extensions listed. I will define them all, but configure for all as well. It needs the quotes. (Also note that I have modified the "restart" section of this script. I have carried it over from an older version, so that it will delete log files, and bogus html stuff for me, and then totally refresh - you may or may not like this modification, so all you really need to be concerned with is the "ext" section). I have also set these up for 15 minute updates, for what its worth.
- I also recommend using the modified $BBHOME/ext/bbsys.local file. I have provided one for linux, solaris and hpux. I have setup any variables that need referencing in any of these extensions, all within this file. You could concievably put it in place even without using the extensions and do no harm. So naturally, if you are going to use ANY of the extensions, it is ABSOLUTELY NECCESSARY to include this file, and restart big brother. I have provided a complete and combined one also just in case that is preferred.
- I have included a copy of my modified $BBHOME/etc/bbdef.sh file, for configuration information. Nothing can be more frustrating than not having something to reference against. So it is included here for information.
- I have included a copy of my modified $BBHOME/etc/bb-hosts file, so that you can get an example of the syntax and layout for all of it. It is mandatory that these extensions be configured in the bb-hosts file, because without it, they will not run.
- I have also included another copy of the paging scripts $BBHOME/bin/bb-page.sh and $BBHOME/bin/bb-page1.sh I had to modify to make this work. Keep in mind that I am still using version 1.2b for all of this. The newer versions 1.3 and later may handle this already, but I have not fully deployed them yet.
- Last configurable group before getting to the actual extension scripts, are the configuration for the paging rules. These sort of go with the page.sh and page1.sh scripts above. There is a $BBHOME/etc/bbwarnrules.cfg and $BBHOME/etc/bbwarnsetup.cfg.
- cpu2 (sample output) - This is used to see if you have multiprocessors, and to make sure that they are all online. It uses mpstat and psrinfo among other things. Updated now to new version which uses combination of "uname -X" info and prtdiag's (if available) to check this also. Seems that a downed proc on a SPARC machine isnt reported via psrinfo and mpstat. They both agree that it is not there, but dont report it off-line!!! This gets around that by using other sources to validate this info. (requires new $BBHOME/etc/bbsys.local file mentioned above)
- dmp (sample output) - This checks that the dynamic multipathing is functioning properly. This is a critical issue and should therefore be monitored. It checks that "format" on the Solaris platform reports no problems, then looks to Veritas to make sure that the number of configured paths is the same as the active paths. It will then show them to you in a table on screen. It also now works with newer versions of the Veritas suite of products. (requires new $BBHOME/etc/bbsys.local file mentioned above)
- ha (sample output) - First Watch/VCS High Availability software check, to make sure that the state of the primary server in the cluster is up, the secondary is up and in the right mode, and that the heartbeats are correctly shown. This HAS BEEN ported for Veritas Cluster Server, as I am now fully entrenched in that product as well. (requires new $BBHOME/etc/bbsys.local file mentioned above)
- hpux_df.sh (N/A) - This is now a multi platform script used to check disk space. It originally came about because HPUX has problems displaying long lvm names on one line and that creates hell for the parser in figuring out space issues. This script will convert it back to one-line format for easy parsing.
- iostat (sample output) - This is actually just an informational page right now, because I have not yet set up any warning conditions, other than the command not being found. It will run a vmstat and an iostat. I will eventually parse the results for bad values. (requires new $BBHOME/etc/bbsys.local file mentioned above)
- ipcs (sample output) - This is also strictly informational purposes right now. About all I can see to do, is to alert that there is a segment that can be freed, but that doesnt seem proper to go yellow on that, so it stays clear for now. (requires new $BBHOME/etc/bbsys.local file mentioned above)
- logins (sample output not available yet) - The idea here is to watch the btmp file on HPUX and the loginlog on solaris for failed login attempts. It also watches sudoers log files for failures. This is to proactively watch for attacks in a very primitive fashion. (requires new $BBHOME/etc/bbsys.local file mentioned above)
- logs (sample output) - Since on my Sun boxes, I find that syslog is not always reporting to /var/adm/messages, and is sometimes reporting to /var/adm/messages.2 or similar, I noticed that the "msgs" field was not always sufficient. So I got around that one. I also check other log files as well for this condition. Lastly I only grep out the current date, so you dont have to continually live with messages until you reboot or relocate the errors. I have also now encorporated the HA logs into this if they are in use. (requires new $BBHOME/etc/bbsys.local file mentioned above)
- mail (sample output) - Since on my Sun boxes, I find that sometimes the root mail is not always checked as it should be, I like to know if there are any bad messages in there worth reading. (requires new $BBHOME/etc/bbsys.local file mentioned above). I never mentioned it before, but for this to work best, each night at midnight, I rollover roots mail to an archive area via an archiving script I call arch_mail.sh - I keep it in the admin only scripts directory which is /var/adm/bin for me, but others use /usr/local/sbin. This way the bb column always reports on current information. I personally log all that stuff to /var/log/`uname -n`/mail/root.mail.`date ...` and clean it up every 30 days. I find that it works best this way.
- mailq (sample output not available yet) - This is used to watch the mailq or spooled mail messages. The threshold is definable, such that you watch to be sure your queue doesnt back up. It can also be configured to keep an eye on the next hop mail server to be sure that they can communicate. (requires new $BBHOME/etc/bbsys.local file mentioned above).
- network (sample output) - This script runs commands like netstat, for statistics, routing information and the like. It also checks that interfaces are all up, and will go red on any collisions or errors on an interface. Since I only work in a switched environment anymore, it should have no collisions or errors. If you are in a hub only environment, these will need modification. The new version of this now handles the display properly for the interfaces. (requires new $BBHOME/etc/bbsys.local file mentioned above). It is also capable of working on HP-UX (tested on 10.20 and later) with its missing "-a" flag to ifconfig.
- printers (sample output) - This was a quick test to allow me to see the status of all the remote printers across the globe and their status. After my first production call that there was a printer problem which was critical, I immediately saw the need for this one. Uses/requires hpnp software from HP. (requires new $BBHOME/etc/bbsys.local file mentioned above)
- prtdiag (sample output) - Sun specific - updated to new version, which now checks for things like power supplies, board temps, front lights, cpu's online, and many other checks. Shows real color status, not just green. Now checks E480s as well as Netras and all other SPARC IIs. (requires new $BBHOME/etc/bbsys.local file mentioned above). The prtdiag is now updated to work with the Netra T1 servers using LOM also.
- queue (sample output) - Since on my Sun boxes, I find that sometimes the print queues have problems, be it IP addressess on printers change, printers move, problems, etc - I want to know if any jobs have been queued over a day, or if there are more than 15 jobs queued at any time on any printer. (requires new $BBHOME/etc/bbsys.local file mentioned above)
- top (sample output) - This reports the top 30 procs on the system for review. This is a mostly informational page because I am having problems with the error condition. I had been looking for any processes that consumed more than 20% of regular CPU activity, but my pattern match is not working well. This has been resolved in a newer version and now handles 20-39% as yellow and red above that. (requires new $BBHOME/etc/bbsys.local file mentioned above)
- vx_check (sample output) - Veritas "vxdisk check" command to get status of disk. Uses "vxdisk list" for a list of disks, thereby making it dynamic and portable. (requires new $BBHOME/etc/bbsys.local file mentioned above)
- vx_group (sample output) - Veritas "vxdg list" command to get status of disk groups. (requires new $BBHOME/etc/bbsys.local file mentioned above)
- vx_list (sample output) - Veritas "vxdisk list" command to get list and status of disk. (requires new $BBHOME/etc/bbsys.local file mentioned above)
- I have added some full examples, but they are static, they dont update. I took them from my latest implementation, but it is not on my home network because I dont have EMC storage, Veritas Volume Manager, or High Availability software in my own "data center" yet. However, my host - castor - has veritas installed now so you can see the veritas columns there. No dmp on the castor platform though.
- Good luck, and I hope you like it. By all means, let me know if you have any problems.
I have implemented my own oracle extension scripts, which I will offer here. I wanted to, at a glance, be able to know if my database is up, and which processes are running, which instances are installed, and which are running, whether the network listener is up, and how many users are connected. This can tend to be a time consuming issue for a remote client, so I wrote this into Big Brother, since it is very extensible. (For the links below in this section, you can either left click them, to see the code, or right click, and download - they are all shell scripts). This is also aware of VCS clusters now. Instead of being red on the backup nodes, it will run clear, knowing that there is a primary node hosting the service. It now works in standalone and clustered environments.
Here is what you need to do :
- You will need to modify the $BBHOME/runbb.sh script, so that in the section for custom "ext" scripts ($BBEXT), you have "oracle" listed. It needs the quotes. (Also note that I have modified the "restart" section of this script. I have carried it over from an older version, so that it will delete log files, and bogus html stuff for me, and then totally refresh - you may or may not like this modification, so all you really need to be concerned with is the "ext" section)
- UPDATE : - as of September 16th, the new "oracle" script will do all of the pieces below in one script, and can handle big brother running as a user other than root. The main script "oracle" (sample output) goes into the $BBHOME/ext directory, and you need to have version 1.2b or later running, as far as I know. I had older versions on my clients, but they didnt support the "ext" type of setup, so make sure you have a new enough version.
- Secondly - I call some scripts that I keep in a database management area, call $ORACLE_HOME/db_mgmt. These two scripts are called "parse_dbs.sh" and "whos_on_db.sh.v2" (new version 2 which uses svrmgrl instead of sql*plus hence no need for sql username/passwords so long as you run it from the local box). Since I call them so frequently I wrote scripts for them too, but truly you could probably combine all of this into one script. Keep in mind that I use the two db_mgmt scripts ALL THE TIME, so I just "re-used" them in their existing format.
- Third - security - I set the permissions on the "oracle" script so that only
root can use it. That means perms of 500. Since I am always the SA on the
job, and most times the DBA too, I tend to work with root level access
regularly. Therefore, I can always switch to any user I want to , to do
whatever I need to. I have this "oracle" script, with the Oracle "system"
user and password coded in my "oracle" script, so that when it calls the
"whos_on_db.sh" script, it will have no problem. So make sure that nobody
else can read this script. That is VERY important. Since I get to run big
brother as root, only root can read that script. It is as safe as you can
get. (Actually, with the new version 2 of the oracle script, I no longer include the username and password in the script. This method is changed to use the svrmgrl instead of sqlplus, so there is no longer a need for that.)
- So, you will find three scripts, the "oracle" script goes in the
$BBHOME/ext directory with perms of root/root/500, the "parse_dbs.sh" which
goes in the $ORACLE_HOME/db_mgmt directory with perms of oracle/dba/755, and
"whos_on_db.sh" script which goes in the $ORACLE_HOME/db_mgmt directory with
perms of oracle/dba/500.
- Good luck, and I hope you like it. Let me know if you have any problems.
I had a great deal of trouble to get the paging to work for me, using the 1.2b version. I had never used the paging feature before, so I cant say if it worked well in older versions or not, but I will offer to you, my modified scripts for the paging. I am currently using only email, no paging, but it is really no change to the scripts to make that work, just the config file (because I plan to implement the Internet paging features of sites like Sprint PCS, Nextel, and AT&T).
I had to modify the "bb-page.sh" and the "bb-page1.sh" to get the paging to work. Now, for what it is worth, I am using this on an SCO OpenServer 5.0.4 machine, as my $BBDISPLAY, $BBPAGE server. I also had to modify the "bb-doack.sh" script, to get the web "ack" to work properly. I am very happy with it now, that it all works properly.
To make the paging work properly, I needed to assign a value to each extension script. Otherwise when I send an ack for any one of the new services, it will stop the paging for all the others, since it will be the "unidentified 999". To get around this, you need to edit the "svcerrlist" and "SVCERRLIST" variables in bb-page.sh and bbwarnsetup.cfg. I tried to maintain the "classes" as provided by Sean, and then grouped the new extensions together with the existing classes. The list I use goes like this :
svcerrlist: disk:100 cpu:200 procs:300 msgs:400 conn:500 http:600 dns:800 ERR:999 vx_check:120 vx_group:125 vx_list:130 dmp:150 iostat:175 cpu2:250 prtdiag:275 ipcs:325 top:350 logs:450 ha:550 network:850 oracle:900 printers:950
Category |
Column |
Value |
Disks - 100's |
disk |
100 |
|
vx_check |
120 |
|
vx_group |
125 |
|
vx_list |
130 |
|
dmp |
150 |
|
iostat |
175 |
Processors/cpu - 200's |
cpu |
200 |
|
cpu2 |
250 |
|
prtdiag |
275 |
Processes - 300's |
procs |
300 |
|
ipcs |
325 |
|
top |
350 |
Messages - 400's |
msgs |
400 |
|
logs |
450 |
Connectivity - 500's |
conn |
500 |
|
ha |
550 |
http info - 600's |
http |
600 |
General - 700's |
N/A |
7xx |
Network - 800's |
dns |
800 |
|
network |
850 |
Other - 900's |
oracle |
900 |
|
printers |
950 |
|
ERR |
999 |
I have "slightly" customized some of the page look and feel. Certainly not as much as some of the other demo sites that are shown, but just little subtle changes. Some show more information, like my bb-hosts file "box header" line :
group-compress <H3><I>Windows 95/98 Clients<BR>-Often Sleeping-</I></H3>"
and some just give a footer (I link the $BBHOME/footer script to the $BBHOME/www/notes/footer script, so they are the same throughout all the pages). Others just seem a little bit more readable. For example, I had changed the default type face color to white (in the mkbb.sh script), so that I could read it easier across all the background colors, after looking at the links. In other words, if you look at the comments (www/notes directory), that say what the machines are, who supports them, what is installed on them, etc, upon returning to the original page, the color changes to reflect that the link was visited, and I could not read it clearly anymore, so I changed that behaviour.
I have it deployed across many machines, from NT workstations and servers, to different *NIX variants. I use a simple method to create the tar file, so that I can setup the other clients, so this is how I do that.
Alot of people have emailed me concerning the NT agent, and how to get it to work. I have used it for both the server and workstation versions of NT. The only real difference for me, between the two, as far as the agent is concerned, is that my workstation has no processes defined, that must be up and running all the time, whereas my server agents do. For example, my server must have my Oracle Web Application Server, and Backup Exec processes running, or I want to be notified. My workstation simply needs to be up and running.
So, as an example, for the NT workstation, I have done the following :
- Installed client version 1.04e
- All checkboxes UNCHECKED, except for "Send Notification Alerts" and
"BBWARN Style Notifications"
- Ignore messages are any messages you want to ignore, just put the
portion of the event log messages in quotes, for the messages you want to be
immune from the BigBrother checks.
- Msg Levels, I have set as follows :
SYS:ERR:Y:1440 SYS:WARN:N:1440 APP:ERR:Y:1440 APP:WARN:N:1440
- Process List would be any processes that MUST be running at all times.
On the workstation I have none defined.
- Drives list is what drive you want checked for space, and the thresholds
for them (override defaults from above).
C:90:95 D:90:95
- Drive default thresholds of 90 and 95 for warning and panic respectively
- CPU default thresholds of 90 and 95 for warning and panic respectively
- BBDISPLAY host is whatever Unix machine runs your Big Brother server
daemons
- BBPAGER host is whatever machine has your modem for paging, in my case
same as BBDISPLAY
- IPPORT is default 1984, unless you have reconfigured everything
- TIMER is default of 300 seconds (5 minutes)
For the Big Brother Unix Server, in the bb-hosts file, just setup a line for
the workstation. I used a new group called "NT Workstation Clients", and
the IP and name are shown below :
group-compress <H2><I><FONT COLOR="white">NT Workstation
Clients</FONT></I></H2>
207.86.37.18 ntws
That is all I do to get them working.
I implemented a counter on the site, because I noticed I was getting tremendous activity, compared to my regular websites. Of course, this was back in the day when I was Sean's premier demo site, but since I am still lagging behind on the version 1.2b, it seems that I have been "demoted" to a lower numbered demo site, and the hits arent as great. (Dont worry, I wont stop writing the extensions or help sections ....) Regardless, I wanted to see just how many of you are actually looking into this product, so I got the counter from SiteMeter, and you too can do the same.
And lastly, as you know if you have read any of my code changes, I fully document everything I do (you can always tell my comments by my trademark "five pound" lines "#####"). That is why you see comment fields in all the scripts, and also see that my machine names, and header variables (in bb-help.html) at the top, are all commented. This is truly of temendous value. Think about a larger organization, where you are a consultant. By all rights, a very good consultant, is only around for a little while. If the job is done correctly, you finish early, transition the knowledge and move on. So for any new admins, you want to leave enough information for them to be able to figure things out for themselves. That is what the comments are all about.
If I have missed anything or if you need additional help, try :
- Read thru the example file of the "$BBHOME/etc/bb-hosts" file. It is what tells Big Brother what to monitor. For example, ftp, telnet, http://local .... and the rest.
- If you want client machines to report back specific data, like disk space, then you need to install the Big Brother tool on those clients too. They send data over port 1984 back to the $BBServer at 5 minute intervals.
- Click on the picture at the top of the Big Brother html page, and view the
info page. It has some good links about how to get things to work.
- Lastly, check the archives of the mail list/postings of questions and cry's for help with all of this. I have found many good discussions, and good answers there. It is at :http://www.fluentcomm.com/~bb/bb.htm. Also look into the ftp archive downloads at : http://www.deadcat.net/.
- And as a last possible resort, you can contact me, and maybe I can help you or point you in the right direction. I am at paul@pluzzi.com.
This page last updated 10-24-2002