Summary: IPI croaking?

From: John A. Johnston (johnj@welchlab.welch.jhu.edu)
Date: Thu Jan 23 1992 - 21:16:32 CST


The votes are in on our IPI woes of about a week ago.

There were two classes of system messages. One relating to the controller
and both disks.

  Jan 15 09:31:17 welchlab vmunix: idc1: ctlr message: 'panic: user_int '
  Jan 15 09:31:17 welchlab vmunix: idc1: ctlr message: 'Did panic dump to drive 0 '
  Jan 15 09:31:18 welchlab vmunix: ipi 100: missing interrupt. refnum 32f
  [ ... ]
  Jan 15 09:31:18 welchlab vmunix: is1: resetting slave
  Jan 15 09:31:18 welchlab vmunix: idc1: ctlr message: 'FW revision date = 4/18/91 , level = 50 '
  Jan 15 09:31:18 welchlab vmunix: idc1: Recovery complete.

One relating to one disk (on the above controller) having read problems.

  Jan 21 09:03:47 welchlab vmunix: id011a: block 10300 (10300 abs): read: Conditional Success. Data Retry Performed.
  Jan 21 09:03:57 welchlab vmunix: id011a: block 29994 (29994 abs): read: Conditional Success. Data Retry Performed.
  Jan 21 09:04:05 welchlab vmunix: id011a: block 42262 (42262 abs): read: Conditional Success. Data Retry Performed.
  
The first type never repeated itself. No hardware was changed and diagnostic
(sundiag) running over night came up with no errors.

Type two is still there. Sun-managers replied that they had these
errors and experienced near-term drive failure, others less serious
problems that were hardware revisions, growing problems (eventual
data-loss) and also that format cured the messages.

On page 104 of the Special Notes section of the 4.4.1 release manual
this is recorded as a known "feature" under heavy disk activity.

For now we are status quo. We'll re-format the drive, but stick with
it for now. In the meantime, we're keeping a close eye on the disk
and controller.

Below are the replies from sun-managers. Many thanks to all for sharing
their experience.

>From: speicher@mwunix.mitre.org
>
>I experienced the same thing 1 day before I lost the disk.
>I hope you have good backups!

--

>From: blc@sol.med.ge.com (Brett Chapman x7-4391) > > I have seen theses errors on my IPI drives. When I asked Sun, they >said "Don't worry unless you get several per hour.". Finally, one of my drives >did that, and eventually, got to the point where users were losing data, and >having programs crash. > > I had to reformat that drive. The errors have not returned on that >drive, but do appear on all my other drives. From what I have been able to >gather, it appears that the format that Sun puts on the drives is not what >it should be when coming from the factory. I suspect that the format used >was done while the drives were margined. > > We had, foolishly, not reformatted the drives when they arrived. I >will probably need to format all my drives in the future. > > I suggest you reformat your drive asap, before it becomes a real >problem.

--

>From: Tim Gibbs <gibbs@src.bae.co.uk> > >> Jan 15 03:56:14 welchlab vmunix: id011d: block 374368 (1478908 abs): read: Conditional Success. Data Retry Performed. > >I`ve seen exactly his happen 4 times on seperate IPI's on 1 of my 4/490's > >Every time the disk has completly failed within 48 hours of the start of the >croaking.Once it has started complaining, I backed up the disks with a full > level 0 to grab the latest data. The dumps have always been ok, Ie the data >has been fine. > There was a bad batch of IPI's produced about a year and half ago, we had 4 > of those, maybe you have got one as well. The SUN engineers say it has >something to do with the drive electronics.

--

>From: "Lawrence R. Rogers" <lrr@Princeton.EDU> > >Sun told me there was a new rev of the controller that sun says addresses >these problems. Don't know the rev, though.

--

>From: sitongia@ozzel.hao.ucar.edu (HAO Computer System Managment Group) > >In cases in which we've gotten the conditional success message, we've >reformatted the disk to get rid of them. I've not seen (the more serious >appearing) messages about missing interupt.

-- >From: trudel@cs.rutgers.edu > >Good question. It's hard to tell. > >Have you considered backing up the disk, and then re-formatting it? >You could also back it up, run the format program, select the >analyze sub-function "refresh". > >If you can spend time on this, I'd give one of those tasks a try >before worrying about the disk itself. If, after doing one of them, >you still get errors, then I'd worry about the disk's integrity.

-- >From: liz@ra.cgd.ucar.EDU (Liz Coolbaugh) > >We had the same messages. The README file from 4.1.1 says you can >ignore them. Don't! I lost a server for most of a week before I >finally had the drive replaced. I doubt replacing the drive is >truly necessary. I would recommend reformatting the drive as soon >as possible, though. If that doesn't work, replace it. What happened >to me is that the disk started getting slower and slower responding >from requests on a specific portion of the disk. Eventually, it >got so slow that both drives on that controller were badly affected. >The server was almost useless.



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:34 CDT