SUMMARY(?): Sol8/Disksuite 4.2.1 Raid 5 Data Loss Postmortem

From: Mike Kiernan <mkiernan_at_onet.pl>
Date: Thu Sep 13 2001 - 07:21:57 EDT
First question I've sent here and not received a
single answer. In my experience this usually
means the question is badly phrased or just
too boring ;). On either count I apologize :)

I still don't understand how/why disksuite
got itself into this state. I guess I'll 
file this in the 'Acts of God' folder.

all the best,

Mike

-- 
Michael Kiernan
Onet.pl S.A.
Krakow, Poland
http://www.onet.pl/


> Last night we lost a 170GB Disksuite RAID-5 volume.
> We were unable to rescue any data and are still
> running a recovery.
> 
> E420R running Solaris 8, Disksuite 4.2.1.
> 
> There are 6 disks in the volume, all on controller 1
> (no comments). At the time, metastat reported one
> disk as 'Maintenance' then another as 'Last Erred'.
> 
> c1t10d0s0               5082     No    Maintenance
> c1t11d0s0               5082     No    Okay
> c1t12d0s0               5082     No    Last Erred
> 
> sd24 = c1t10d0s0 -> ../../devices/pci@1f,4000/scsi@3,1/sd@a,0:a
> sd25 = c1t11d0s0 -> ../../devices/pci@1f,4000/scsi@3,1/sd@b,0:a
> sd26 = c1t12d0s0 -> ../../devices/pci@1f,4000/scsi@3,1/sd@c,0:a
> 
> The first error (in the messages file) does indeed
> appear to have occurred on the disk that was marked
> Maintenance, and the second error on the Last Erred
> disk. The timestamp for the messages is the same.
> An error appears a few secs later for the 3rd disk
> shown above.
> 
> The problem is that the disk that had actually failed
> was the one that was marked 'Last Erred' (ie t12) and
> not the one that was marked 'Maintenance' (t10).
> On replacing the physically bad disk t12 it was thus
> not possible to add t10 back [there is no way to remove
> the newly added empty disk from the volume]. metareplacing
> t10 would have caused further data corruption.
> 
> I suspect that even if we had managed to recover the
> volume there was data corruption [look at the vxfs
> errors below].
> 
> What I want to know is:
> 
> a) Why was the wrong disk marked bad.
> b) Why is there data corruption in these 'last erred' scenarios?
> c) Is there any way to recover from this situation.
> d) bonus question: where does disksuite store raid5 vol
>    status information [the /etc/md files?]
> 
> Will Summarize.
> 
> Regards,
> Mike Kiernan
> 
> [ more data/messages below ]
> 
> 533 $>uname -a
> SunOS random 5.8 Generic_108528-06 sun4u sparc
> 
> 532 $>pkginfo -l SUNWmdu
>    PKGINST:  SUNWmdu
>       NAME:  Solstice DiskSuite Commands
>   CATEGORY:  system
>       ARCH:  sparc
>    VERSION:  4.2.1,REV=1999.11.04.18.29
>    BASEDIR:  /
>     VENDOR:  Sun Microsystems, Inc.
>       DESC:  Solstice DiskSuite Commands
>     PSTAMP:  11/04/99-18:31:00
>   INSTDATE:  Apr 04 2001 10:10
>     VSTOCK:  258-6252-11
>    HOTLINE:  Please contact your local service provider
>     STATUS:  completely installed
>      FILES:     57 installed pathnames
>                 11 shared pathnames
>                 12 directories
>                 20 executables
>              11230 blocks used (approx)
> 
> Did not have the Disksuite 4.2.1 product patch 108693-07,
> but this patch contains nothing relevant.
> 
> 534 $>metastat -p | grep d20
> d20 -r c1t2d0s0 c1t3d0s0 c1t4d0s0 c1t10d0s0 c1t11d0s0 c1t12d0s0 -k -i
> 32b
> 
> 535 $>metastat
> d20: RAID
>     State: Needs Maintenance
>     Invoke: metareplace -f d20 c1t10d0s0 <new device>
>     Interlace: 128 blocks
>     Size: 358355376 blocks
> Original device:
>     Size: 358356480 blocks
>         Device              Start Block  Dbase State        Hot Spare
>         c1t2d0s0                5082     No    Okay
>         c1t3d0s0                5082     No    Okay
>         c1t4d0s0                5082     No    Okay
>         c1t10d0s0               5082     No    Maintenance
>         c1t11d0s0               5082     No    Okay
>         c1t12d0s0               5082     No    Last Erred
> 
> the disks are:
> sd24 = c1t10d0s0 -> ../../devices/pci@1f,4000/scsi@3,1/sd@a,0:a
> sd25 = c1t11d0s0 -> ../../devices/pci@1f,4000/scsi@3,1/sd@b,0:a
> sd26 = c1t12d0s0 -> ../../devices/pci@1f,4000/scsi@3,1/sd@c,0:a
> 
> /var/adm/messages:
> 
> Sep 11 14:57:51 random scsi: [ID 107833 kern.warning] WARNING:
> /pci@1f,4000/scsi@3,1/sd@a,0 (sd24):
> Sep 11 14:57:51 random         SCSI transport failed: reason
> 'incomplete': retrying command
> Sep 11 14:57:51 random
> Sep 11 14:57:51 random scsi: [ID 107833 kern.warning] WARNING:
> /pci@1f,4000/scsi@3,1/sd@c,0 (sd26):
> Sep 11 14:57:51 random         SCSI transport failed: reason
> 'incomplete': retrying command
> Sep 11 14:57:51 random
> Sep 11 14:57:52 random scsi: [ID 107833 kern.warning] WARNING:
> /pci@1f,4000/scsi@3,1/sd@b,0 (sd25):
> Sep 11 14:57:52 random         SCSI transport failed: reason
> 'incomplete': retrying command
> Sep 11 14:57:52 random
> <-- snip
> Sep 11 14:58:58 random scsi: [ID 365881 kern.info] /pci@1f,4000/scsi@3,1
> (glm1):
> Sep 11 14:58:58 random         Cmd (0x3f094030) dump for Target 12 Lun
> 0:
> Sep 11 14:58:58 random scsi: [ID 365881 kern.info] /pci@1f,4000/scsi@3,1
> (glm1):
> Sep 11 14:58:58 random                 cdb=[ 0x28 0x0 0x3 0x9f 0xd9 0xe6
> 0x0 0x0 0x4 0x0 ]
> Sep 11 14:58:58 random scsi: [ID 365881 kern.info] /pci@1f,4000/scsi@3,1
> (glm1):
> Sep 11 14:58:58 random         pkt_flags=0x4000 pkt_statistics=0x61
> pkt_state=0x7
> Sep 11 14:58:58 random scsi: [ID 365881 kern.info] /pci@1f,4000/scsi@3,1
> (glm1):
> Sep 11 14:58:58 random         pkt_scbp=0x0 cmd_flags=0x8e1
> Sep 11 14:58:58 random scsi: [ID 365881 kern.info] /pci@1f,4000/scsi@3,1
> (glm1):
> Sep 11 14:58:58 random         Cmd (0x3f095b30) dump for Target 12 Lun
> 0:
> Sep 11 14:58:58 random scsi: [ID 365881 kern.info] /pci@1f,4000/scsi@3,1
> (glm1):
> Sep 11 14:58:58 random                 cdb=[ 0x28 0x0 0x0 0x7b 0x1 0x4a
> 0x0 0x0 0x4 0x0 ]
> Sep 11 14:58:58 random scsi: [ID 365881 kern.info] /pci@1f,4000/scsi@3,1
> (glm1):
> Sep 11 14:58:58 random         pkt_flags=0x4000 pkt_statistics=0x61
> pkt_state=0x7
> Sep 11 14:58:58 random scsi: [ID 365881 kern.info] /pci@1f,4000/scsi@3,1
> (glm1):
> Sep 11 14:58:58 random         pkt_scbp=0x0 cmd_flags=0x8e1
> Sep 11 14:58:58 random scsi: [ID 365881 kern.info] /pci@1f,4000/scsi@3,1
> (glm1):
> Sep 11 14:58:58 random         Cmd (0x3f36f6c0) dump for Target 12 Lun
> 0:
> Sep 11 14:58:58 random scsi: [ID 365881 kern.info] /pci@1f,4000/scsi@3,1
> (glm1):
> Sep 11 14:58:58 random                 cdb=[ 0x28 0x0 0x0 0xdb 0xd9 0x4e
> 0x0 0x0 0x4 0x0 ]
> Sep 11 14:58:58 random scsi: [ID 365881 kern.info] /pci@1f,4000/scsi@3,1
> (glm1):
> Sep 11 14:58:58 random         pkt_flags=0x4000 pkt_statistics=0x61
> pkt_state=0x7
> Sep 11 14:58:58 random scsi: [ID 365881 kern.info] /pci@1f,4000/scsi@3,1
> (glm1):
> Sep 11 14:58:58 random         pkt_scbp=0x0 cmd_flags=0x8e1
> Sep 11 14:58:58 random scsi: [ID 107833 kern.warning] WARNING:
> /pci@1f,4000/scsi@3,1 (glm1):
> Sep 11 14:58:58 random         Disconnected tagged cmd(s) (3) timeout
> for Target 12.0
> Sep 11 14:58:58 random glm: [ID 401478 kern.warning] WARNING:
> ID[SUNWpd.glm.cmd_timeout.6018]
> Sep 11 14:58:58 random scsi: [ID 107833 kern.warning] WARNING:
> /pci@1f,4000/scsi@3,1/sd@c,0 (sd26):
> Sep 11 14:58:58 random         SCSI transport failed: reason 'reset':
> retrying command
> Sep 11 14:58:58 random
> Sep 11 14:58:58 random scsi: [ID 107833 kern.warning] WARNING:
> /pci@1f,4000/scsi@3,1/sd@c,0 (sd26):
> Sep 11 14:58:58 random         SCSI transport failed: reason 'timeout':
> retrying command
> Sep 11 14:58:58 random
> Sep 11 14:59:03 random scsi: [ID 107833 kern.warning] WARNING:
> /pci@1f,4000/scsi@3,1/sd@a,0 (sd24):
> Sep 11 14:59:03 random         Error for Command:
> read(10)                Error Level: Fatal
> Sep 11 14:59:03 random scsi: [ID 107833 kern.notice]   Requested Block:
> 57638178                  Error Block: 57638178
> Sep 11 14:59:03 random scsi: [ID 107833 kern.notice]   Vendor:
> SEAGATE                            Serial Number: 3DE047KQ
> Sep 11 14:59:03 random scsi: [ID 107833 kern.notice]   Sense Key: Not
> Ready
> Sep 11 14:59:03 random scsi: [ID 107833 kern.notice]   ASC: 0x4 (<vendor
> unique code 0x4>), ASCQ: 0x1, FRU: 0x2
> Sep 11 14:59:03 random md_raid: [ID 112651 kern.warning] WARNING: md
> d20: read error on /dev/dsk/c1t10d0s0
> Sep 11 14:59:04 random proftpd[26721]: [ID 567783 daemon.notice] random
> (dial521.walbrzych.dialog.net.pl[217.30.158.13]) - FSep 11 14:59:05
> random proftpd[26726]: [ID 567783 daemon.notice] random
> (dial521.walbrzych.dialog.net.pl[217.30.158.13]) - FSep 11 14:59:08
> random scsi: [ID 107833 kern.warning] WARNING:
> /pci@1f,4000/scsi@3,1/sd@b,0 (sd25):
> Sep 11 14:59:08 random         Error for Command:
> read(10)                Error Level: Fatal
> Sep 11 14:59:08 random scsi: [ID 107833 kern.notice]   Requested Block:
> 32651478                  Error Block: 32651478
> Sep 11 14:59:08 random scsi: [ID 107833 kern.notice]   Vendor:
> SEAGATE                            Serial Number: 3DE046S4
> Sep 11 14:59:08 random scsi: [ID 107833 kern.notice]   Sense Key: Not
> Ready
> Sep 11 14:59:08 random scsi: [ID 107833 kern.notice]   ASC: 0x4 (<vendor
> unique code 0x4>), ASCQ: 0x1, FRU: 0x2
> Sep 11 14:59:08 random md_raid: [ID 371651 kern.warning] WARNING: md
> d20: write error on /dev/dsk/c1t11d0s0
> Sep 11 14:59:08 random scsi: [ID 107833 kern.warning] WARNING:
> /pci@1f,4000/scsi@3,1/sd@a,0 (sd24):
> Sep 11 14:59:08 random         Error for Command:
> read(10)                Error Level: Fatal
> Sep 11 14:59:08 random scsi: [ID 107833 kern.notice]   Requested Block:
> 47632810                  Error Block: 47632810
> Sep 11 14:59:08 random scsi: [ID 107833 kern.notice]   Vendor:
> SEAGATE                            Serial Number: 3DE047KQ
> Sep 11 14:59:08 random scsi: [ID 107833 kern.notice]   Sense Key: Not
> Ready
> Sep 11 14:59:08 random scsi: [ID 107833 kern.notice]   ASC: 0x4 (<vendor
> unique code 0x4>), ASCQ: 0x1, FRU: 0x2
> Sep 11 14:59:08 random md_raid: [ID 112651 kern.warning] WARNING: md
> d20: read error on /dev/dsk/c1t10d0s0
> Sep 11 14:59:11 random proftpd[395]: [ID 567783 daemon.warning] random -
> MaxInstances (50) reached, new connection denied.
> Sep 11 14:59:11 random scsi: [ID 107833 kern.warning] WARNING:
> /pci@1f,4000/scsi@3,1/sd@c,0 (sd26):
> Sep 11 14:59:11 random         Error for Command:
> read(10)                Error Level: Fatal
> Sep 11 14:59:11 random scsi: [ID 107833 kern.notice]   Requested Block:
> 8441182                   Error Block: 8441182
> Sep 11 14:59:11 random scsi: [ID 107833 kern.notice]   Vendor:
> SEAGATE                            Serial Number: 3DE046W4
> Sep 11 14:59:11 random scsi: [ID 107833 kern.notice]   Sense Key: Not
> Ready
> Sep 11 14:59:11 random scsi: [ID 107833 kern.notice]   ASC: 0x4 (<vendor
> unique code 0x4>), ASCQ: 0x1, FRU: 0x2
> Sep 11 14:59:11 random md_raid: [ID 112651 kern.warning] WARNING: md
> d20: read error on /dev/dsk/c1t12d0s0
> Sep 11 14:59:13 random proftpd[395]: [ID 567783 daemon.warning] random -
> MaxInstances (50) reached, new connection denied.
> Sep 11 14:59:13 random scsi: [ID 107833 kern.warning] WARNING:
> /pci@1f,4000/scsi@3,1/sd@b,0 (sd25):
> Sep 11 14:59:13 random         Error for Command:
> read(10)                Error Level: Fatal
> Sep 11 14:59:13 random scsi: [ID 107833 kern.notice]   Requested Block:
> 41253546                  Error Block: 41253546
> Sep 11 14:59:13 random scsi: [ID 107833 kern.notice]   Vendor:
> SEAGATE                            Serial Number: 3DE046S4
> Sep 11 14:59:13 random scsi: [ID 107833 kern.notice]   Sense Key: Not
> Ready
> Sep 11 14:59:13 random scsi: [ID 107833 kern.notice]   ASC: 0x4 (<vendor
> unique code 0x4>), ASCQ: 0x1, FRU: 0x2
> Sep 11 14:59:13 random scsi: [ID 107833 kern.warning] WARNING:
> /pci@1f,4000/scsi@3,1/sd@a,0 (sd24):
> Sep 11 14:59:13 random         Error for Command:
> read(10)                Error Level: Fatal
> 
> <-- snip
> Sep 11 15:01:38 friko02 vxfs: [ID 710715 kern.warning] WARNING: msgcnt 1
> vxfs: mesg 037: vx_metaioerr - /dev/md/dsk/d20 file system meta data
> write error
> Sep 11 15:01:38 friko02 vxfs: [ID 486312 kern.warning] WARNING: msgcnt 2
> vxfs: mesg 025: vx_wsuper - /dev/md/dsk/d20 file system super-block
> update failed
> Sep 11 15:01:38 friko02 vxfs: [ID 863296 kern.warning] WARNING: msgcnt 3
> vxfs: mesg 017: vx_delbuf_flush - /fw1 file system inode 3280982 marked
> bad
> Sep 11 15:01:38 friko02 vxfs: [ID 885974 kern.info] vxfs msgcnt 3 offset
> 0x00000000     41ed        5      258      258
> Sep 11 15:01:38 friko02 vxfs: [ID 885974 kern.info] vxfs msgcnt 3 offset
> 0x00000010        0     2000 3b9e0a4d    8ccb6
> Sep 11 15:01:38 friko02 vxfs: [ID 885974 kern.info] vxfs msgcnt 3 offset
> 0x00000020 3b9e0a4d    80966 3b9e0a4d    80966
> Sep 11 15:01:38 friko02 vxfs: [ID 885974 kern.info] vxfs msgcnt 3 offset
> 0x00000030    10000        0   676a2f        0
> Sep 11 15:01:38 friko02 vxfs: [ID 885974 kern.info] vxfs msgcnt 3 offset
> 0x00000040        1        1        0       92
> Sep 11 15:01:38 friko02 vxfs: [ID 885974 kern.info] vxfs msgcnt 3 offset
> 0x00000050        0        0        0        0
> Sep 11 15:01:38 friko02 vxfs: [ID 885974 kern.info] vxfs msgcnt 3 offset
> 0x00000060  141b240        1        0        0
> Sep 11 15:01:38 friko02 vxfs: [ID 885974 kern.info] vxfs msgcnt 3 offset
> 0x00000070        0        0        0        0
> Sep 11 15:01:38 friko02 vxfs: [ID 885974 kern.info] vxfs msgcnt 3 offset
> 0x00000080        0        0        0        0
> Sep 11 15:01:38 friko02 vxfs: [ID 885974 kern.info] vxfs msgcnt 3 offset
> 0x00000090        0        0        0        0
> Sep 11 15:01:38 friko02 vxfs: [ID 885974 kern.info] vxfs msgcnt 3 offset
> 0x000000a0        0        0        0        0
> Sep 11 15:01:38 friko02 vxfs: [ID 214594 kern.info] vxfs msgcnt 3 offset
> 0x000000b0        0        0
> Sep 11 15:01:38 friko02 vxfs: [ID 486440 kern.warning] WARNING: msgcnt 4
> vxfs: mesg 025: vx_wsuper - /dev/md/dsk/d20 file systSep 11 15:01:38
> friko02 vxfs: [ID 217438 kern.warning] WARNING: msgcnt 5 vxfs: mesg 015:
> vx_ilisterr - /fw1 file system can't
> Sep 11 15:01:38 friko02 vxfs: [ID 962359 kern.warning] WARNING: msgcnt 6
> vxfs: mesg 031: vx_disable - /dev/md/dsk/d20 file sys
> 
> <---snip
> Sep 11 15:01:48 friko02 vxfs: [ID 885974 kern.info] vxfs msgcnt 36128344
> offset 0x00000000    80100 17481e04 191404f0 1c541a4c
> Sep 11 15:01:48 friko02 vxfs: [ID 885974 kern.info] vxfs msgcnt 36128344
> offset 0x00000010  9bc1248 1b280000 17341fb4     16c4
> Sep 11 15:01:48 friko02 vxfs: [ID 885974 kern.info] vxfs msgcnt 36128344
> offset 0x00000020      e0c 165418f4  c8003b0 1cb01018
> Sep 11 15:01:48 friko02 vxfs: [ID 885974 kern.info] vxfs msgcnt 36128344
> offset 0x00000030  ff41ce4     1098  d900000 1a281e14
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
Received on Thu Sep 13 06:19:25 2001

This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:42:25 EST