[SUMMARY] Sunfire v880 reboot

From: Bill R. Williams <brwms_at_etsu.edu>
Date: Fri Feb 18 2005 - 15:40:19 EST
For the benefit of others who might run into the same problem, Here
are the responses to my post regarding: Sunfire v880 reboot
(My original post from Mon, Feb 14, 2005 is included at the end.)

I should have mentioned in my original post that this system has been
running the same levels of software (Solaris 5.9), firmware (OBP), and
hardware configuration since August 2004, and such a thing has never
happened before.

Many thanks to those who offered opinions, theories, possibilities,
and suggestions!  My remarks concerning my particular situation are
intertwined in their responses.
 Bill R. Williams               <brw@etsu.edu>
 ------------------------ ETSU Library Systems

From: Peter A. van Gemert
>I have no clue on what went on in your system but could it be an
>faulty UPS?
Possibly, but I don't think the UPS is the culprit.

From: Eric Noriega
>Have you looked for a crash dump under /var/crash ?
There was no crash dump in there.  
(That is the area defined in my 'dumpadm'.)

The following from joe_fletcher gets my vote for most probable cause:
From: "joe_fletcher"
>Usual thing in these situations is a watchdog reset. Tends
>to be nothing in the logs as it's about as hard a reset as
>you can get short of using a hammer. The only place you will
>see anything is on the console so, assuming you have it
>configured, take a look in the RSC buffer logs for whatever
>records remain.

>Cause is generally hardware related. I'd also run psrinfo.
>You might find the thing is now running on an odd number of CPUs. I've
>seen this happen a few times.
My CPUs are all online & functioning.
Also, prtdiag -v indicates everything within tolerances and "OK".

From: "Michael Horton"
>How is your power run?
>3 v880 power supplies into 1 ups?
>(no redundancy)
>3 v880 power supplies into 1 power circuit?
>(no redundancy)
>if your ups has a glitch (and they do), you have a power event.
I am not going to rule this out.

From: "Eric Paul"
>We had a similar issue a few months ago with one of our servers...
>They replaced two CPU modules, and several banks of RAM before the
>problem went away.  Something to be aware of, there is an FCO for
>certain memory modules which were installed on a number of 880s
>(though Sun is not talking about it much...)  I only found out from
>my FE.  You might want to put in a call to tech support and see if
>they can give you the lot numbers and check the RAM out.
>The other thing you might want to do it set up syslog to point to a
>central logging server.  I've found a lot of times when Sun boxes go
>down hard, they don't flush the last logs to disk.  But the central
>server does get the logs and that's given me more information to go

From: Daniel Vega
>obp down rev maybe?

On Mon, Feb 14, 2005 at 06:00:17PM -0500, Bill R. Williams wrote:
> SunOS localhost 5.9 Generic_117171-07 sun4u sparc SUNW,Sun-Fire-880
> This afternoon, this machine just rebooted, and I cannot find the why!  
> Following the reboot, all status lights on the v880 are normal, and
> all disk drives are functioning.
> There is no crash dump, and the only thing I can find in the logs
> which indicate a glitch is in the /var/adm/messages file:  the last
> entry before the "new" boot-up entries is a "line" of ~308 NULL bytes.
> I've run prtdiag and all temperatures, fans, etc. look Ok.
> Things look correct from 'metastat'.
> This unit has 3 power supplies which are plugged to UPS, so it wasn't
> a glitch in power service coming to the machine, and if it's a power
> supply the thing is supposed to be able to continue with two of them
> functioning.  And there's no indication of any problems (prtdiag) with
> either of the three.
> Anybody seen this sorta thing happen?
> (Maybe there's some gremlin in the v880 and/or Solaris 9 that I've
> missed.)
> This sorta thing makes me nervous.
sunmanagers mailing list
Received on Fri Feb 18 15:40:55 2005

This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:43:43 EST