SUMMARY: monitoring condition of a disk array

From: Peter Schauss x 2014 (ps4330@gdatabr.mmac2.jccbi.gov)
Date: Mon Feb 26 1996 - 07:55:47 CST


My original question was:

>>What procedures should I be using to monitor my SPARC storage array
>>to insure that I do not have any failures "sneaking up on me"?

>>I have an array with 27 2.1 gb drives, configured as raid 5 volumes
>>and I am using the vxva software (from VERITAS) which Sun provides
>>with the storage array.

Thanks to people who replied:

margarita suarez <marg@watsun.cc.columbia.edu>
jnapier@soemail.ucsd.edu (Jim Napier)

The summary is that:

1. You do need to watch this configuration carefully.
2. The only source of warnings is the /var/adm/messages file.

Does anyone have a script which does this?

>From marg@watsun.cc.columbia.edu Fri Feb 23 10:39 EST 1996
Date: Fri, 23 Feb 96 10:45:38 EST
From: margarita suarez <marg@watsun.cc.columbia.edu>
To: ps4330@okc01.rb.jccbi.gov (peter schauss x 2014)
Cc: unixsys@watsun.cc.columbia.edu
Subject: re: monitoring condition of a disk array

we have several open issues right now with sun, because we have lost
entire raid5 volumes under veritas volume manager several times now (6x
in the last year, 3x in the past week!). we have lost data
catastrophically whether or not there was a disk failure (and we have
never had more than a single disk fail at one time, though we have seen
many, many, disks fail).

for one thing, hot sparing doesn't work correctly yet. if you have an
e-mailable pager, you might want to find the script that reports that a
disk has gone bad, and add a line to page you. several times we have
seen volume manager mark a disk bad, but fail to employ the hot spare.
this is an open bug which is supposed to be fixed in the next release
of volume manager.

sun is now telling us to move to solstice disksuite (on-line disksuite)
instead, since the are eventually going to drop support for veritas.

actually, we are considering buying auspex boxes or something else and
throwing these arrays away.

marg

>From jnapier@soemail.ucsd.edu Sun Feb 25 13:36 EST 1996
Return-Path: <jnapier@soemail.ucsd.edu>
Received: from okc01.rb.jccbi.gov (okc01-a) by gdatabr.parker (5.x/SMI-SVR4)
        id AA09382; Sun, 25 Feb 1996 13:36:46 -0500
Received: from UCSD.EDU (mailbox2.ucsd.edu) by okc01.rb.jccbi.gov (5.x/SMI-SVR4)
        id AA01071; Sun, 25 Feb 1996 12:39:29 -0600
Received: from soeadm (soeadm.ucsd.edu [132.239.189.20]) by UCSD.EDU (8.6.12/8.6.9) with SMTP id KAA23574 for <ps4330@okc01.rb.jccbi.gov>; Sun, 25 Feb 1996 10:43:20 -0800
Received: by soeadm (5.x/UCSDPSEUDO.4)
        id AA14498 for ps4330@okc01.rb.jccbi.gov; Sun, 25 Feb 1996 10:43:19 -0800
Date: Sun, 25 Feb 1996 10:43:19 -0800
From: jnapier@soemail.ucsd.edu (Jim Napier)
Message-Id: <9602251843.AA14498@soeadm>
To: ps4330@okc01.rb.jccbi.gov
Subject: Re: Monitoring condition of a disk array
X-Sun-Charset: US-ASCII
Content-Type: text
Content-Length: 2479
X-Lines: 59
Status: RO

It sounds like you're talking about hardware failures and I don't know
of a good way to insure against that. If you are beginning to have
problems they will most like turn up in your /var/adm/messages file.
You may see complaints about reads or writes failing on a particular
disk or file system. This can sometimes be remedied by taking the
disk offline and using format to do a media analysis and repair
and then restoring the disk from tape. I don't believe there are
any tools to keep tabs on the controllers and cpus in an SSA. Those
kinds of things usually fail completely and suddenly when they go.
If this is a mission critical system, your best bet I think is to
invest in a hardware maintenance contract with Sun. This will guarantee
that any downtime can be reduced to as little as half a day. For
a few thousand dollars a year it'll give you a lot of peace of mind.

If you get any other opinions on this I'd love to hear them.

/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/

Jim Napier jnapier@soe.ucsd.edu
Systems Administration (619)534-5212
School of Engineering
UC San Diego

/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/=/

Please reply to the address below and not to the address in the message
header.

Peter Schauss
ps4330@okc01.rb.jccbi.gov
Gull Electronic Systems Division
Parker Hannifin Corporation
Smithtown, NY



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:10:54 CDT