SUMMARY: Replacing failed disks on a Raid 5 Disk array

From: Peter Schauss x 2014 (ps4330@gdatabr.mmac2.jccbi.gov)
Date: Thu May 09 1996 - 08:05:09 CDT


My original post was:

What is the proper procedure for replacing a failed disk in a Raid 5
configuration on a Sparc Storage Array. My volumes consist of 11
subdisks plus a log disk and I am using the Veritas volume management
software.

I want to simulate a failure in order to test my recovery procedures.
My idea would be to umount all of the volumes involved, initialize
one of the disks by creating a large partion on it and running newfs.
Then I would bring it on line as though it were a new disk. So my
question is how do I do this last step?

--------------------------------------------------------------------

I only received one response.

>From ottenber@mr.med.ge.com Tue May 7 14:53 EDT 1996
From: "Paul A. Ottenberg 4-6166 MR" <ottenber@mr.med.ge.com>
Date: Tue, 7 May 1996 14:02:10 +0600
To: ps4330@okc01.rb.jccbi.gov
Subject: Replacing failed disks on a Raid 5 Disk array

Peter:

backup that system before toasting anything....

highly recommend you review: http://www.columbia.edu/~marg/misc/ssa/
before simulating a failure.

paul.

-- 
                               '''  
                              (o o)
--------------------------o00--(_)--00o-------------------------------
Paul A. Ottenberg               |       email : ottenbergp@med.ge.com
EIS Admin Team Leader           |       voice : 414.521.6166
GE Medical Systems              |       fax   : 414.521.6800
PO Box 414; Mail Stop: W832     |
Milwaukee, WI  53201-0414       |
--

-------------------------------------------------------------------- I checked the web site at Columbia and there was some pretty scary stuff there about the Sun/Veritas implementation of RAID5. Nevertheless, I pushed ahead on the faith that my daily backups would bail me out if I lost everything.

Here is what I did:

1. I selected one physical disk to sacrifice, disk03. Before starting the process I used vxprint |grep disk03 to list all of the virtual disks which used disk03.

2. I umounted all of the virtual disks which used disk03.

3. Used format to create a single partition on disk03 (physical device /dev/rdsk/c1t0d2).

4. Used newfs to create a file system on c1t0d2s0, thus wiping out whatever information had been on this disk.

5. Mounted /u02, one of the virtual disks which uses disk03.

6. At this point vxvm sent me two email messages warning me that a hardware failure had occured on disk03, listing all of the affected drives. It also said that no hot spare was found (correct - I did not have any defined ) and that "apparently" no data had been lost.

7. At this point vxprint -l vol02 showed the "degraded" flag. (Same for all other virtual disks which use disk03).

8. I followed the instructions in the manual for "Replacing Physical Disks". (I umounted /u02)

In the gui interface:

Select the view for the disk group.

Basic Ops->Disk Operations->Replace Disks

9. vxprint -l vol02 still showed the degraded flag set.

10. Called Sun support.

11. Sun Support told me to use vxdiskadmin:

Select menu option "remove disk for replacement", specifying disk03.

Select menu option "replace disk" specifying disk03.

12. I mounted all of the disks which used disk03. One by one, the "degraded" flag dissappeared from the vxprint -l listing.

While this was going on, vxprint | grep disk03 looked like this:

dm disk03 c1t0d2s2 - 4152640 - - - - sd disk03-01 vol02-01 ENABLED 419520 0 - - - sd disk03-02 vol04-01 ENABLED 419520 0 - - - sd disk03-03 vol07-01 ENABLED 419520 0 - - - sd disk03-04 vol09-01 ENABLED 419520 0 - - - sd disk03-05 vol11-01 ENABLED 419520 0 - - - sd disk03-06 vol13-01 DETACHED 419520 0 RECOVER RECOV - sd disk03-07 vol15-01 ENABLED 419520 0 - - - sd disk03-08 vol18-01 ENABLED 419520 0 - - - sd disk03-09 vol20-01 ENABLED 419520 0 - - - sd disk03-10 vol22-01 ENABLED 376960 0 RECOVER RECOV -

so that I could track the recovery process.

13. During the recovery process, I got a message:

vxvm:vxvol: ERROR: Subdisk disk03-06 in plex vol13-01 is locked by another utility

When vol13 stayed in the status shown above for a long time, I rebooted. This seemed to clear whatever conflict the above message was talking about and vol13 finally completed its recovery process.

As a result of this exercise, I and my customer have considerably more confidence in the disk array as we have configured it as well as a better understanding of how the whole error detection/correction process works. I hope this was helpful.

Peter Schauss ps4330@okc01.rb.jccbi.gov Gull Electronic Systems Division Parker Hannifin Corporation Smithtown, NY



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:10:59 CDT