Quick Index


|
|
This paper was originally published in the Proceedings
for The Third World Conference on System Administration,
Networking and Security, Washington DC, 1994.
The Good, The Bad and The Ugly.
Anecdotes from the System Administration Trenches
Linda Bissum
ABSTRACT
Making mistakes are never difficult. Making mistakes
when adminstrating many UNIX systema is certainly easy. In
spite of this. most papers tend to focus only on what
works. This paper is descriping ways I have seen and
experenced, which does not work.
IntroductionI have been working with UNIX system
administration and security for over a decade, and like
this environment system very much. In some way, the UNIX
operating system is like a human being. It can made to do
almost any task adequately, however, it will do almost
nothing perfectly. UNIX also allows most tasks to be
performed in a large variety of ways, unfortunately, this
also leave plenty of room for mistakes.
According to an old saying, the way to avoid making
mistakes is to have experience. And how does one get
experience? By making mistakes! Learning from other peoples
mistakes sounds good in theory, but unfortunately does not
work too well in praxis. In spite of this, I hope some
people may gain some usefull insights from reading this
paper.
This paper is a tour down memory lane, revisiting some
of my mistakes and experiences. It is mostly about mistakes
done by system administrators, users or management.
However, it is also about how such mistakes can be avoided,
and how, in a general sense, how the work in the trenches
can be made more bearable.
Cleaning up the Hard DiskI don't believe that
there is any real UNIX System Administrator anywhere who
has not, at one time or another, executed the famous rm
-rf * in the wrong place, and deleted a large number
of files which should never have been removed. In fact,
when I meet system administrators who claim that they have
never ever done anything like this, I make a mental note to
make sure that they do never get access to one of my
systems; having an unexperienced system administrator can
be bad, but having one who is afraid of admitting mistakes
can be down right lethal.
The all time worst I have seen, where at one computer
installation, where management had, in their infinitive
wisdom, decided that all key engineers should have the
root password for all the key servers. This
specific event happened a Friday late afternoon, where an
engineer had discovered that a server where almost running
out of space in the root directory. Being a genially nice
guy, he decided that he did not need to disturb me, and
that he could fix this himself. He searched for some large
file in the root directory which could be removed. The two
largest files he could find where /boot and
/vmunix , and as they had not been modified for a
long time, they would surely be on the backup tape, and
could therefore be removed safely. After the files had been
removed, just to be sure that the disk space was recovered
he decided to reboot the system. Much to his surprise, the
system would not boot, and I got to spend too much of my
weekend fixing the problem (it where back in the ``good''
old days, when you had to boot the system from tape).
This experience was the drop for me which made the cup
flow over. Ever since, I have been a firm believer in users
not having the root password for any
mission critical machine. Of cause the politics of
root password and superuser access has always been
at the forefront of the life in the system administration
trenches.
The Root Of All EvilIt seems that almost everybody
who uses a UNIX system wants to be able to have root
access. In some way this is very understandable for it is
part of the human nature to strive to become better in at
whatever we do. For a UNIX user, it seems almost natural
that the way to growth is to go from being a normal user to
become a superuser. What most people forget, is that with
the increased power that a superuser enjoys, being able to
play God in the Unixverse, it also comes with a big
responsibility to balance the need of all users of the
system.
I don't know if I am just become better over the years
at playing the game of Root Politics , or if there
finally has become better understanding in the user
community of how high reliability of the UNIX systems is
best reached. here is some of the various means that I and
other sysadmins have dealt with this dragon:
- In small companies which do not have a large staff of
system administrators, it is impossible to have a system
administrator on duty all the time. On the other hand, it
is highly undesirable to hand out the root
password to everybody who might work late at one time or
another. A good compromise can be reached by giving the
security guard a sealed envelope with the current
root password.
If a user comes into a situation where they think
they need the root password, they can obtain
it at any time from the security guard, but will be
required to sign for the envelope, and will also be
required to file a report of the incident within 24
hours. The sheer intimidation of having to go though
signing and filling out a report will limit this to
very few cases which often are justified.
If the company is too small for a guard, a similar
systems can be established where the envelope is kept
in Mr. BigBoss's office (don't use your own office, it
won't work nearly as well).
- As an alternative, require everybody who has
the root password to carry a beeper and be
available for on-call duty on weekends and evenings. The
effect of the sysadmin answering any user request for the
root password with ``as soon as I can get you a
beeper'' has the interesting effect of most people fast
determinating that ``oh, by the way I don't need it any
longer''.
- On the more humorous side, one system administrator I
met at a conference some years ago claimed that he had
solved this particular problem by renaming the
root account to clerk . While many
people would like to be alble to become superuser or
root , the desire of being clerk were
much smaller.
To Know or Not to Know, That is the Question...One
of the biggest problems I have encountered so many times
that I have lost count, is the new UNIX system
administrator, who has to cope with maintaining the systems
while not having the necessary knowledge.
In many cases, these people have been given
responsibilities way beyond they capacity, mostly out of
desperation of the organizations management and user
community, as the explosive growth in UNIX system and
Internet connectivity has made it hard to find qualified
system administrators. For UNIX system administrators in
the early eighties, there where no books available on any
topic; anything you did not know you learned from the UNIX
manual or from the system itself. Today there is, besides
conferences and support from the Internet, there have been
published a large number of good books (and an even larger
number of mediocre ones) on various UNIX system
administration topics. In one way it makes it a lot easier
to learn to administrate the systems well, but in other
ways it makes me sometime think that it must be almost like
drinking from a fire hose. In many cases the new people are
very anxious to learn what is necessary, and are often
dealing with the steep learning curve remarkedly well. It
helps that most old time system administrators still
remember how it was to be new to UNIX, and not having
anybody to ask questions, They are therefore more than
willing to give advise and share information. However,
there still ample room for mistakes.
One common mistake I see newer system administrators do,
is not learning to install the systems from scratch. Even
worse, they may not even have a clue to where the necessary
system software is stored, or even know if the installation
media is present anywhere locally. In such cases, even if
they are able to call in a consultant who is capable of
installing the system, it might not be possible to complete
the installation because of lack of the necessary
software.
One situation which specially comes to mind where at a
financial institution with offices in San Francisco and Los
Angeles. These two location where connected by a lease
line, but nobody onsite had any clue about hardware or
software which were used to make the connection work. While
this was bad enough by itself, there were no documentation
on either hardware or software to be found anywhere. When I
stressed the need to get the information, and get the
hardware on a service contract, the answer was ``Why should
we, it never breaks''. When it did break, it took almost
three weeks before that network connection was back in
operation.
I have also seen that the lack of knowledge and the lack
of documentation can lead to almost overpowering
procrastination. At one client site, which where using some
old Sun 4 file servers with equally old disk drives where
using some special software to concatenate two of the
drives to get one large file system. I had seen messages in
the system log which indicated that one of the drives where
failing. As those specific drives had a notorious low Mean
Time Between Failures, it where to be expected that the
drive would probably fail completely with a few months at
the most. However, because nobody in the local technical
staff understood the concatenation software or how the
drives where configured, management decided that it where
better not to replace the malfunctioning drive,
out of fear that it might not come back online again. When
the drive did fail a few weeks later, t happened at the
worst possible time during the end-of-month processing, a
big all weekend event. When I got the go-ahead to replace
the failed drive, it only took a few hours to get the
system back fully operational while it took much longer
redo the end-of-month process. If the replacement had been
done as scheduled maintenance, it would have been less
visible in high places, and much less painful to everybody
involved.
It has been my experience, that upper management is
almost always without technical knowledge, and will most
often not want to know anything about the technical issues.
This can make life difficult for a system administrator.
However some of the experience I had as a consultant with
management of various organizations, has sometimes even
been almost ridiculous. I rememeber more than one time I
have been in the situation where management insisted that
somebody should follow me around, and keep and eye on what
I did. While this may be reasonable if the person
accompanying with me has some UNIX background, in several
cases, that person did not have a clue about UNIX. That I
had the root password to all the servers, and
therefore could do anything I wanted, should I have wanted
to do something bad, never seemed to occur to anybody.
What do I careWhile lack of skilled people can be
bad, being in a situation where nobody cares is even worse.
I remember at one client site where the newly hired sole
system administrator told me that she ``did not do
networks''. Part of her responsibility where to maintain
the organizations Internet connection, so I had to tell her
"well, you do now". When their e-mail connectivity broke a
month later (their primary name server where out of
operation for some time) she got the point that it might be
a good idea to learn a little something networking. However
at that time upper management objected to having to pay for
her education. In many cases a bad situation can be
remedied and improved, but there is also situations where
getting uninvolved as fast as possible is the only feasible
alternative. This was certainly one of these.
Over the years, I have talked with many system
administrators who have been in a bad situation, because of
upper management or their boss makes their life miserable.
It is very difficult to fight your boss and win; if the
work situation is really unsustainable, get another
job.
Backup, BackupOperators can also be a source of
problems. They do usually not have much knowledge about the
system they are running, and are often easily intimidated
by an outside consultant. This can make them defensive and
difficult to work with, at least initially. However, if
they get inventive, specially on their own, it can be much
worse. One system administrator I talked with at a
conference told me about how one of their operators had
decided that it where much easier if he just labeled the
backup tapes and put them directly into the storage vault;
actually doing the backup took a long time and was a lot of
work. Nobody found out until one day when it was necessary
to make a restore. Then they discovered that the needed
backup tape were blank, and in fact all the backup tapes
were blank. It did apprently not occur to anybody that the
real problem where that nobody had ever taken the time to
explain to to the operator the purpose and function of the
backup. If he had understood the consequences of his little
shortcut it would proably never have occured.
Backups are important! While doing backups is a boring
routine job, it is also the only thing standing between a
fast, painless recovery and massive loss of data when a
mistake is made. As long as we have good backups available,
we will be able to recover from almost any mistake, that
being a human, software, electrical or mechanical failure.
Also, as a more practical consideration, people can loose
their jobs because of bad backups. Besides the two cases I
have heard about from other administrators, I have personal
been witness to cases. In the first case, the operation
manager at a large hardware company lost his job, when an
disk drive failed, and it was discovered that it was never
backed up. As the drive was the home of the release sources
tree, it had major repercussion everywhere. The root of the
problem was the engineering department had purchased and
installed that drive, but never notified the operation
manager about the new addition to the machine. There is two
lessons to be learned here:
- Don't let users do any maintenance on the machines
you have responsibility for. If your management forces
that decision on you, make clear that you cannot be hold
responsible for users action.
- And make sure you know your machines, how they are
configured, and what software they are running. This is
important for maintenance, for reliability and for
security.
- The second case where a system administrator got
fired, he was much more actively involved in the failure.
In my experience, whenever something goes wrong, it
require at least three independent failures. In this
case, the system administrator had written his own backup
script, without testing it properly, he used the old
AT&T cpio archive program in stead of the
much more reliable dump program, he did the
backup over NFS, and he did the backup from cron
redirecting the diagnostic output to /dev/null ,
the UNIX equivalent of a black hole. In other words, a
count no less than four mistakes which in combination was
a disaster waiting to happen..
As more disks where added to the systems, the total
amount of data started to exceed the capacity of the
tape, but all messages from the system where redirected
to /dev/null so nobody where any wiser. The
problem where first discovered when the CEO accidently
removed an important file, and could not get it back,
because it where backed up somewhere past the end of
the tape.
So if for no other reason, simple job security makes
is worth to spend some extra time to make sure that a
good control system is in place. Being careless, or
leaving backup to unsupervised operators or junior
system administrators can lead to a rude and sudden
awakening.
- I usually recommend to my clients that they establish
a system which is centered around a check list. Some of
the actions must then include:
- Finally, it is a good idea to restore a complete
partition to a spare disk drive once a month. This help
to ensure that the people who are responsible for the
restore actually knows how to do it, and that they can do
it in minimal time.
- Finally, if incremental backup is used, it is a good
idea to use a simple scheme to keep down the number of
tapes used. E.g. if dump is used, doing daily dumps at
level 9, weekly dumps at level 5 and monthly dumps at
level 0 will be sufficiently simple that you will most
likely not goof up when doing a full restore at 3 AM.
Using the Tower of Hanoi algorithm suggested in
the dump man page will make it difficult and time
consuming to do the restore correctly at a time when you
are not at your brightest.
The New Guru'sOne thing which is always scary, is
when people had learned just a little bit about a topic,
and then consider themselves fully qualified on all aspects
of that item. I still remember one place where I was
helping an organization to bridge the gap after their
sysadmins had left for another job, and was also spending
some time educating the new system administrator, who had
been promoted from within, but had little prior experience
with maintaining UNIX system. I still vividly remember his
first comment to me, when he returned from a one week
course introducing him to UNIX system administration.
Literally, he was greeting me with the words ``Where is the
kernel software, I want to reconfigure our kernel!'' when
he walked in the door.
It has sometimes been difficult to make new people
understand that before you start change a system, you
better understand the one which already is in place. How
can someone think that they can improve a system they do
not yet understand is simply beyond my comprehension.
One area where I see a lot if ``New Guru's'' is in
connection with firewall technology. I have seen people who
have never worked with firewalls before starting to make
authoritative statements about what can and cannot be done,
without understanding the implications of their statements.
It seems to be another human trend, that when a new, hot
topic appears, everybody wants to be able to speak about it
with authority, even if they do not fully understanding all
the issues.
Another area where a lot of people speak loudly without
understanding their topic, is Policies and Procedures. More
and more people are referring to this in a way where it
almost becomes slogans, unfortunately many UNIX people
seems to think that implementing a policy only consists of
finding a policy somebody else wrote, and maybe make a few
changes here and there. There are even large FTP archives
on the Internet, with big collections of policies, where
you can go and pick one to your liking. In the real world
it does not work like this. There is no computer
installation anywhere, which are without policies and
procedures! They may not be written down anywhere, the
people may not be in agreement what they are, and they may
not be very helpful in getting real work done, but nevert
the less, they are there. Implementing new explicitly
documented policeis which does not work well within the
implicit policies and the corporate culture will not work!
If you want policies which actually work, you either need
to determine the existing set of policies and then set out
to document and change whatever is there so it becomes
something useful. If you have sufficiently level of power
(the CEO level) you can just put down the law, if you are
willing to enforce them completely (do this or you are
fired). However, it might not be too good for the working
morale. Implementing polices is a slow and painstakingly
process, but if it is done right, it will pay off in the
end.
SecurityFirewalls is just one small aspect of
security, however interesting it may be. There are many
other aspects to security, and many ways to do it well.
However, it is important to monitor the security of systems
and networks. One of the jokes in the firewall community,
is that the reason many sites think they have never been
broken into, is because they have no means in place to
detect when it takes place. Unfortunately, this does not
only apply to firewalls, it often apply to almost any
computer security isses at most of the sites I have been.
It does not help either, that security is often the
inverse of convenience. When a security feature is
installed, it always means some level of inconvenience to
the user. This will be the case when a user is required to
use non reuseable password when logging in over the
network; when the backup is stored in a vault for fire and
theft protection; and when implicit trust between machines
has been removed. This is why many people still is getting
caught using old style passwords and their accounts
compromised by crackers.
This is why the World Trade Center building bombing left
some companies not only without working computers, but also
without any recent backups to make new machines take over.
And this is why Tsutomu Shimomura where enabling trust
between his internal machines, and thereby enabling Kevin
Mitnick to compromise them.
And in the End ...Let me finish with one of the
funniest system administration tales I ever remember
hearing. This one comes from Steve Simmons, another old
time UNIX system administrator turned consultant. One of
his clients had problems with their DEC line printer. DEC
hardware maintenance had been called time and again, in
order to fix the printer, but each time it broke down again
shortly after the maintenance people had left. Finally, the
lead sysadmin got so fed up with the problem, that he
decided to do something to really get the attention of the
maintenance people.
He removed the front cover of the printer, took it down
to his car and drove to the local shooting range, where he
put several bullets through the printer cover. Afterwards
he drove back, mounted the cover back on the printer, and
then called DEC hardware maintenance, telling them that the
printer where down again.
|