SUMMARY: V440 rebooted unexpectedly !?

From: Pascal Grostabussiat <pascal_at_azoria.com>
Date: Tue Nov 16 2004 - 15:15:19 EST
Sorry, realized I never summarized this topic.

Subject was (June 2004):
=====================================================================
Hi all,

I was just wondering if some of you have experienced similar problems
with V440 machines ?

One of my customers is having several V440 and one of them has already
rebooted unexpectedly three times over a month. Nothing can be found in
/var/adm/messages, everything looks fine, no warnings nor errors, and
from one line to another I suddenly have the begining of a reboot !?!?
I cannot find anything neither in other log-files pointing to a
potential serious problem or warning !?!? The system is currently
part of a test-cluster and when I came in this morning a service group
had failed-over during the night and logs tell me the node went down.

Jun 21 11:59:55 sam gab: [ID 316943 kern.notice] GAB INFO V-15-1-20036 
Port h gen   31df4d membership 0123
Jun 23 20:00:37 sam genunix: [ID 540533 kern.notice] ^MSunOS Release 5.9 
Version Generic_112233-12 64-bit

The customer told me they have already had some unexpected reboot on 
V440 on site for some of their own customers. Seldom however.

So ... have some of you experienced similar behaviour ? The cluster has 
some other V440 with identical configurations (hardware, firmware, 
patches and software), but only one V440 is unstable !?!? and reboots, 
even if seldom, apparently are random (I have so far not been able to 
identify any possible pattern).

The V440 has 4 CPUs (1062 MHz), 8GB RAM, connected to a SAN (using 
QLogic 2300 cards), 4x36GB internal disks, OBP 4.13.0 and as far as I 
can see everything looks fine !?!?

/Pascal
=====================================================================


Solution or let's say explanation:
=====================================================================

As few of you mentioned at that time, the problem was supposed to be
due to a bug on the V440 machines. You were right. I have been talking
with Sun those past few months and the problem is related to the V440 
motherboards. Before providing some more information, there is no
patch for this problem, the motherboard has to be changed. New boards
are now expected within a few weeks.

Let me now provide you with more information. There is an internal
document at Sun "InfoPartner Document FIN Doc ID: I1099-1" which in
details describes the problem which has been known for a while now.
I only got a copy of the document (5 pages) so I cannot just cut and
paste.

Synopsis: Sun Fire V440 and Netra 440 systems using a specific
networking configuration may unexpectedly reset.

Platform: A42 and N42, model ALL.

Part number affected: 540-5919-XX, FRU, ASSY, Motherboard, Netra440
and 540-5418-XX, ASSY, Motherboard W/CPU cage, CHLPA

BugID: 5039862

Problem description: In an extremely limited number of applications,
and with a single system configuration, the Sun Fire V440 or Netra 440
system may experience an unexpected reset and will reboot.

The specific configuration which triggers this situation is as follows.

Some or all of the data being transferred is transported via the first
onboard ethernet interface "ce0" (Cassini ASIC)

When this issue occurs, the system will reset and an error message
appears on the console. The system then reboots. No core files are
generated and the reset output will not be logged to the
/var/adm/messages file.

If it is suspected that the V440 is experiencing this issue, change
the OBP variables as follows to provide more verbose output on the
next failure

diag-switch? true
post-trigger none
obdiag-trigger none

Corrective action: There is currently no permanent resolution. Customer
sites experiencing this issue should use the workaround procedures
provided below. A long-term corrective action plan is being developed
by Sun and will be delivered via Sun's service team.

- Use only the second "ce1" (net1) onboard network interface

OR

- Install a PCI ethernet card in any available PCI slot. The following
Sun card is tested and supported as a workaround for full gigabit
network replacement functionality: X1150. Other tested and supported
card but without gigabit support is X2222A.

It is highly recommended that to ensure the "ce0" (net0) is never
accessed inadvertantly in a matter that could trigger this issue,
that the "ce0" interface be completely disabled. It is also recommended
due to Solaris instance numbering, that this be done after initial
Solaris intallation, to ensure net1 is assigned "ce1" instance,
instead of "ce0".

To completely disable "ce0" (net0) from the system, use the following
commands to install an NVRAM script at the OBP "ok" prompt:

1 ok nvedit
2   0: probe-all install-console banner
3   1: " /pci@1c,600000/network@2" $delete-device drop
4   2:
   Type "Ctrl-C" to exit nvedit
5 ok nvstore
6 ok setenv use-nvramrc? true
   use-nvramrc? =      true
7 ok reset-all

Ather the system resets, "ce0" should not be visible by OBP (i.e. you
should not see a path to "ce0" (/pci@1c,600000/network@2) when you run
"show-devs" from OBP). ce0 device should not be seen by Solaris (i.e.
prtconf or prtpicl).


Anyway, if you experience this issue, contact Sun and propre action
will be taken.

Sorry for summarizing so late, but better late than never.

Regards,
/Pascal
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
Received on Tue Nov 16 15:14:53 2004

This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:43:40 EST