SUMMARY: "frequent" crashes without a trace

From: Jarkko Airaksinen <JAiraksinen_at_zed.com>
Date: Thu Feb 08 2007 - 07:46:55 EST
Hello,

Thanks to everyone who replied.

Well, I didn't find any traces why the server crashed. The POR in power log is
only the "power on reset". As the console wasn't connected at the time of the
crash there's nothing there either. Also savecore didn't work.

I got some good suggestions though that you might find usable. Thanks to
Bertrand, Jon, Mehran, Chuch, Brad & JV for the following replies:


Console, log & maintenance tool stuff:
----
POR: Power On Request

check the front panel switch position, it must be in the lock position.

also use /opt/FJSVmadm/sbin/hrdconf -l and /opt/FJSVhwr/sbin/fjprtdiag -v

the XSCF shell may give more informations (telnet to the SCF ip address on
port 8010)
----
check that your PC does not go to sleep mode, this could cause a break on the
serial line and may  reset the PrimePower.

Putting the switch to lock will ignore the break.
----
Yes, that is indeed true, i.e., a PC console will issue a BREAK signal on
resume. I've also seen assertions that it sends BREAK over serial line when it
goes to sleep, doesn't seem possible, tho.

That seems like the most likely scenario, e.g., a BREAK-initiated reset won't
trigger a core dump. If you have the space&time, you might want to test this:
connect PC to non-critical system, set it to go to sleep in 60 seconds, and
watch the machine. That way you can (a) prove the assertion, and (b) determine
whether the BREAK is sent when the PC goes to sleep and/or whether it sends a
BREAK when it wakes up.

This is good information, please do summarize. I just got back from our DR
site, where all machines are set to ignore the BREAK for this reason--and I
needed to send a BREAK !!! :-)
----
Not famailar with Fujitsu hardware, but on Sun hardware, POR inidcated a
"power on reset" - either the system crashed or was rebooted.  I was hoping to
see FATAL there, which generally indicates a fatal hardware problem that
happened so fast that the system couldn't log anything - if it's any
consolation, it *does* output stuff to the console when that happens.

You might want to consider hooking something up to catch the serial console
output in case it happens again .
----

Crash dump stuff:
----
There might be something on the console, there might be a dump use isda (its
on sunsolve) to analyze it. Good luck

Mike Salehi
----
boot cdrom and run SUNWvts for a couple days

if it doesn't crash, it's your OS image
if it does crash, it's your HW.

JV711
----
Unless you'd already turned on the crash dump facility, there's no evidence
besides what was written to the logs on the filesystems--and those may have
been clobbered by the fsck at reboot. Unless you're running VxFS, that is.

To enable crash dumps, see http://slacksite.com/solaris/crashdump.html. In
general, look at /etc/init.d/savecore. You can quickly verify whether crash
dumps are enabled by running dumpadm(1M), viz:

Check for loose power cables, and see about replacing the power supply. If
this machine has more than one power supply, e.g., for failover, this is a bad
sign. Also check for where this machine is connected to AC power. Even though
other machines don't crash like this, this machine may not be connected to the
same power origin.
----

Power failure stuff:
----
Jarkko,
I have had 2 systems that had these symptoms and the fix was a new power
supply.

chuck

-----Original Message-----
From: sunmanagers-bounces@sunmanagers.org
[mailto:sunmanagers-bounces@sunmanagers.org] On Behalf Of Jarkko Airaksinen
Sent: miircoles, 07 de febrero de 2007 15:05
To: sunmanagers@sunmanagers.org
Subject: "frequent" crashes without a trace

Hello to all Gurus out there,



One of our Fujitsu-Siemens PP450's running sol8 just rebooted. It didn't
leave anything in the messages files: the last entry before the crash is
just a normal ftp login message and then 20 minutes later the normal
boot messages start. This has happened twice before as well, last time
105d ago.



I don't think we had a power outage as there are other servers connected
to the same power rails; more servers would have at least shown "psu
failures" but there's nothing there.



In the madmin in the power log at the time of the crash there are two
entries:

  1. Feb  7 14:37:41 2007 CET Reset-Release                   [Unlock]

                              Nothing (Detail=00,00,00,00)

  2. Feb  7 14:37:35 2007 CET POR                             [Unlock]



How could I interpret those messages?



Any ideas how to start investigating what caused the server to crash
from the fly like that?



Thanks to everyone,

Jarkko



__________________________________________________________________________


La informacion incluida en el presente correo electronico es CONFIDENCIAL,
siendo para el uso exclusivo del/os destinatario/s arriba mencionado/s. Si
usted recibe y lee este correo electronico y no es el destinatario senalado,
el empleado o el agente responsable de entregar el mensaje al destinatario, o
ha recibido esta comunicacion por error, le informamos que esta totalmente
prohibida cualquier divulgacion, distribucion, uso o reproduccion del mismo,
y
le rogamos que nos lo notifique inmediatamente respondiendo al mensaje
original a la direccion arriba mencionada y eliminando el mensaje a
continuacion.

The information contained in this e-mail is CONFIDENTIAL and is intended only
for the use of the addressee named above.If the reader of this message is not
the intended recipient or the employee or agent responsible for delivering
the
message to the intended recipient, or you have received this communication in
error, please be aware that any diffusion, distribution or duplication of
this
communication is strictly forbidden, and please notify us immediately by
return to the original message at the address above eliminating it
afterwards.
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
Received on Thu Feb 8 07:47:54 2007

This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:44:04 EST