SUMMARY: A1000 problems part 1 & 2

From: Jeff Welsch <jeff.welsch_at_enviz.com> Date: Tue Nov 27 2001 - 16:58:42 EST · This archive was generated by hypermail 2.1.8 : Wed Mar 23 2016 - 16:32:36 EDT

Original messages below.

My problem has been solved by an unbelievable effort on the part of Sun.
The root of the problem is that either during transport or upon powering
up the A1000 after moving it the controller blew and spewed some data
onto the disks, corrupting the RDAC information.

I spent many hours on the phone with Sun and every attempt to revive the
LUN failed.  In the end I met with three very competent SSEs and they
took the A1000 to their offices.  There they rounded up a team of
engineers and were able to modify the firmware in the A1000 so that
recreating the LUNs did not overwrite the data with 0s.  While I lost my
RAID-0 LUN1, Sun was able to recreate the entire RAID-5 LUN0 and I lost
no data!  

While my data was visible to Solaris, RM6.22 did not see the array.  To
fix this problem I followed this solution provided by Sun:

To fix this problem you will need to remove the rdac logical devices
(c#t#d#) as seen by Solaris and Raid Manager in order to recreate the
logical device controller #s. This procedure can also be used to sync
controller #'s between format and lad if c#'s don't match

To perform the procedure for syncing up c#'s in lad and format with
RM6.2.2, and replacing c#'s back to an acceptable value :

cd /dev/dsk
rm c#'s for A1000 devices

cd /dev/rdsk
rm c#'s for A1000 devices

cd /dev/osa/dev/dsk
rm c#'s for A1000 devices

cd /dev/osa/dev/rdsk
rm c#'s for A1000 devices

(Run the following rdac_disks to remove all rdac devices from format)

/usr/lib/osa/bin/rdac_disks

(Run the following hot_add to recreate proper rdac device controller #s
for all of the following: format, lad, /dev/(r)dsk /dev/osa/dev/(r)dsk
instantly with no need to reboot or boot -r)

/usr/lib/osa/bin/hot_add

Note: It is also possible that after a "boot -r", the rdac devices MIGHT
NOT show up in format at all. Simply follow the same guidelines as
above, to recreate the rdac devices and sync up Solaris with Raid
Manager.

     While tempting, do not try to run devfsadm to create links in place
of hot_add, because it will create a Solaris path such as
     /sbus@3,0/QLGC,isp@3... as opposed to the correct 
     /pseudo/rdnexus@2,0..path that is required for the device to be
properly addressed.

Pravin Nair sent me a similar procedure which requires a 'boot -r' to
correct.  The hot_add command is a great way to avoid rebooting.

I would like to thank Pravin Nair, Jed Dobson, Christian Nicca,Tom
Chipman, Tony Walsh, and Patricio Mora for their suggestions.

---------------------------
Part 1:

Gurus,

I have been charged with moving cages within our colo provider this
weekend.  I am having problems bringing up my A1000 after moving it.

The A1000 is a 12 bay model with 10 36.4GB disks and 2 18GB disks
installed.  It is connected via SCSI to an E220R.  When the device is
powered up now, the LEDs all light up correctly (all 12 are green), but
the 4 LEDs (0-3,0-4,1-3,1-4) switch to amber and the service LED turns
on.  Coincidentally (I really hope), those four disks were installed in
the array Thrusday morning, and the array was rebuilt shortly after.  I
have been able to copy my data from my backup to the new array and have
been running for more than a day prior to the move.  The array contains
a 10 disk RAID-5 array and a 2 disk RAID-0 array.  

arraymon claims that no RAID modules were found, and RaidManager 6.22
reports that the controller has failed.  I am assuming that this is
referring to the RAID module, and it would make sense becuase
probe-scsi-all returns only the two internal disks and DVD drive.

Can anyone shed some light on the subject?  Perhaps an explanation for
why those four LEDs are on (is it a POST message?  I am investigating
that now).  I really need to get either of the RAID devices running.
This is rather urgent because of the nature of the problem.

Thanks, and I will summarize.

Part 2:

Gurus,

It turns out that my A1000 controller had failed.  Sun replaced the
controller and Solaris 8 is seeing the array.  The new problem, and a
very scary one, is that my LUNs are not configured/identified correctly.
I had two LUNs in the array.  LUN 0 is a RAID5 device with 9 disks.
LUN1 is a RAID0 device with two disks.  Now, RaidManager is reporting
that LUN0 is dead, and does not even see LUN1.  The two disks in LUN1
are recognized as unassigned and optimal.  Is there a way to force
RaidManager to reload the configuration of LUN1?  Rebuilding will
overwrite the data, correct?  

Ahh!  Murhpy strikes again.  I would appreciate any help possible.

I will combine the two messages into one summary.  Thanks,