SUMMARY: What happens whan a CPU fails?

From: Jonathan Day <jond_at_jond.demon.co.uk>
Date: Mon Sep 22 2003 - 16:57:37 EDT
Thank you to everyone who took the time to respond; very helpful and
informative.


Original Question:

>I have a number of Sun 280R machines that Im currently installing for a
client, and I got asked the question about what happens when a >CPU fails.
>
>I guess one of the following may happen:-
>
>1) Server Reboots with a single CPU
>2) Server Dies
>3) Server Continues running with a single CPU
>
>The rough configuration of each machine:-
>
>280R with 2x1.2 CPU/ 2GB RAM / 2x73gb Mirrored disks / Dual PSU's
>Solaris 2.8 / Patched to April 03

Additional Option I missed out:

4) Server reboots, runs until the failure occurs again, and again, and
again...


Short Answer:

General concensus is that any of the above can occur, though nature of the
failure - either consistant or intermittant fault has a large part to play
on what happens in real life.

Most types of CPU or CPU Cache failure will result in a panic restart, if
the failure is on the primary CPU it may render the machine useless (1
reported).

If the failure is consistant the OBP will mark the CPU failed and restart
the machine with the other good CPU(s).

If the failure is intermittant the machine may not know the CPU is a problem
and restart with it running, this will result in further panic restarts
until the problem is resolved.

Intermittant failures look like a pain, so its worth scanning through the
logs, crash dump analysis ect, if your machine is restarting intermittantly.


Additional Information:

Randomly selecting some of the reports on what happens in detail:-

A CPU error detected by the OS will almost certainly cause a panic and
core dump.  Portions of the kernel may be cached on the CPU, so it gives
up.

After the core dump, the hardware is reset and goes through a small POST
procedure.  If the error is persistent, then the OBP will detect the
failure and will not provide the CPU to the OS for use.  If the error
was transient, then the OBP will not know about the failure, and the
reboot will be similar to the last time.

>>

I've seen on other servers (of any architecture, not just Sparcs) is that
typically a CPU will not "die" outright - it'll partially (and often, only
intermittently) fail and make the machine unstable until you diagnose the
problem and either fix or disable the CPU.

>>

It's variable dependent on the nature of the fault.
The most probable course of events is a crash followed
by an attempt to restart using the one good CPU. The
success of this will be decided by the nature of the initial
failure. The 280R is marginally better in this respect
than say the 240/480/880 series of machines as it
shares some architectural features with the higher end
4800/6800 systems.

>>

Server reboots about twice a day with double CPU
but, in my case, the CPU was not completely dead, just defective. I had
to turn off the defective CPU manually and , until the CPU was replaced,
I was able to run the machine with only one CPU.

>>

Around here, the server panics, reboots, panics,
reboots, pa....

>>

I had a Netra 20 with 2 CPU's.  The error messages (as I now remember them)
referred to a dcache error, then said something about CPU failure and then
the machine would go back to the 'OK' prompt.  I do not know if that is the
standard behavior on sun machines when there is a CPU problem on a computer
with multiple CPUs, but that is the experience I had.

>>

I have the unfortunate cpu fail on a blade 2000.  The sun rep was saying
that the 280r and 2000 are same motherboard, so I think that you would find
same results.  The system would not boot at all.  On power up the fans would
spin, the drives would spin up.  That's about it.  The system would never go
green no matter how long it sat in that state.

>>

The system would panic and reboot itself.  Within 15 or 20 minutes of coming
up to login it would panic again and reboot again.  It always listed the
same CPU as the problem in messages so I removed the listed CPU and the
system stabilized out.  This was at 0300 Saturday morning so I waited until
later that day and had SUN come out with a new CPU.  As soon as we put in
the second CPU the system became unstable again.  We moved CPU's around and
it didn't matter which slot a CPU might be in it still kept rebooting.  The
SUN tech had to send the crash files to his backline and they determined the
problem might be the motherboard.  We replaced the motherboard on Sunday and
the system has been stable since.

>>

"Fails" is a surprisingly vague word, at least for Sun. :-) Here are the
scenarios.

1) If a system detects a problem with a CPU that can be recovered from, then
   it will offline the CPU and continue on the remaining one. This is
   theoretical, and I've NEVER seen it happen.
2) If something happens (bad CPU or memory) that leaves the system unable to
   guarantee the integrity of the system (i.e. OS and kernel), it will panic
   and reboot. Now...
2a) If the hardware is detected as faulty during the pre-boot diagnostics,
   it will offline the suspect the flakey processor and boot on only on CPU.
2b) If the problem was transient or at least not a really consistent error,
   then the system will boot with both processors active. If you do have a
   bad processor, then it will probably panic and reboot again before long.

If you have a bad processor but the system isn't offlining it on
panic/reboot,
then you can force it off from the boot prom (which masks it entirely
from the OS), or offline it from the OS level (which _mostly_ protects you
from using that processor--but not entirely, especially if it's CPU 0).

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
<<

BTW: Perhaps I spoke too soon, the 250 dual cpu in my office has started
having throwing wobblies and rebooting intermittanly, guess what its
reporting in the logs, EBP CPU problem :O

/me calls sun support


Regards

Jonathan

--
This e-mail has been scanned for computer viruses, it is recommended that
you re-scan this message and any attachments with your own anti-virus tools
before use.

Checked by AVG Anti-Virus (http://www.grisoft.com).
Version: 7.0.176 / Virus Database: 260.1.2 - Release Date: 18/09/2003
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
Received on Mon Sep 22 16:57:33 2003

This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:43:20 EST