SUMMARY: System Failures

From: JSirois@Forrester.com
Date: Mon Jun 12 2000 - 09:05:30 CDT


Many thanks to the many useful replies. Of course, without a HW contract it
took some time to get someone in here with all the correct parts to figure
out what was really going on. I initially replaced the FC cable and both of
the GBIC adapters which did not do the trick. Eventually we determined it
was a bad interface board (also had a failing power supply that wasn't
helping matters either).

Thanks to:
James Ranks
Matthew Santimore
Amit Mohan
Terry Franklin
Merrell, Vince
oleg olovyannikov
Rick Reineman
David Evans
Walter Reed
jonathan loh
vogelke
Ross Lonstein
Mike Penny

===========================================================================================
The original question:
Hello, I have a SUN E-3500 running 2.6 that is repeatedly reporting the
following errors in /var/adm/messages. I am not sure if they are all
related and have not been able to find any info in the archives. Eventually
the system will not accept any connections at all and has to be rebooted. I
would appreciate any help from admins who may have seen these types of
errors before and how they corrected the problem. Thanks in advance, I
will summarize.

May 19 08:55:24 oberon unix: ID[SUNWssa.socal.link.5010] socal0: port 0:
Fibre Channel is OFFLINE
May 19 08:55:24 oberon unix: ID[SUNWssa.socal.link.6010] socal0: port 0:
Fibre Channel Loop is ONLINE
May 19 08:55:24 oberon unix: WARNING: /sbus@2,0/SUNW,socal@d,10000/sf@0,0
(sf0):
May 19 08:55:24 oberon unix: soc lilp map failed status=0x5
May 19 08:55:24 oberon unix: sf0: Target 0x0 Reset Failed.
Ret=105sf0: sf:Target driver initiated lip
May 19 08:55:24 oberon unix: sf0: Target 0x0 Reset successful
May 19 08:55:24 oberon unix: WARNING:
/sbus@2,0/SUNW,socal@d,10000/sf@0,0/ssd@w2200002037130094,0 (ssd0):
May 19 08:55:24 oberon unix: SCSI transport failed: reason 'tran_err':
retrying command
May 19 08:55:24 oberon unix:
May 19 08:55:24 oberon unix: WARNING:
/sbus@2,0/SUNW,socal@d,10000/sf@0,0/ssd@w2200002037130094,0 (ssd0):
May 19 08:55:24 oberon unix: SCSI transport failed: reason 'reset':
retrying command
May 19 08:55:24 oberon unix: WARNING:
/sbus@2,0/SUNW,socal@d,10000/sf@0,0/ssd@w2200002037130094,0 (ssd0):
May 19 08:55:24 oberon unix: transport rejected (-2)
===========================================================================================
The responses:

I had a similar problem with fiber channel going offline, online, all the
time,
then disk errors, then complete halt of the system.
The problem turned out to be bad GBIC's.
:
we had similar problems, that ended up being bad fiber channel cards. I
guess there
were a bunch of them out there that Sun had to replace. Don't know if it's
your
problem but it's a start.
:
Looks like your GBIC module or SCSI controller has gone bad.
:
These are fibre channel errors for your internal disks. I doubt they are
directly related to not being able to accept connections. Looks like one
disk is bad incidentally. Often times a bad disk can cause channel
offline/onlines on the enclosure.
:
If memory serves, it turned out to be the fibre cable that connects your
box
to the associated disk array.
Other suspects were the cards that the fibre cable connects between on the
machine side and the disk side.
:
In a situation such as this, it is almost impossible to tell what the
actual problem is. Basically, an error occurs on the FC-AL chain and is
propagated all the way up to the kernel from the hardware.

Basically, what can happen is this: an apparent error may occur during the
access of a single disk within an enclosure, such as an A5200, for example
(let's say that the disk arm could not be positioned over a desired track
during a specified time period); this error will be picked up by the disk's
firmware; reported through the disk's controller interface through the
FC-AL loop to the enclosure's interface board (IB), which has its own
firmware; then, the error goes through the GBIC on that IB, perhaps through
an FC-AL hub, to the corresponding GBIC on the E3500, through the socal
firmware, the I/O board firmware, to the kernel, possibly going through
some volume-management software, such as Veritas. Whew!

At any point, the error (or, simply a warning) might be magnified,
misinterpreted or, simply, obfuscated.

Sorry to be so long-winded, but, we need a little background in order to
track down the problem.

Try to approach it as follows, starting at the bottom of the chain:

- Is the SCSI transport error always a "transport reject" and always from
the SAME disk (look at the WWN)?
  If so, it could mean that it's a bad disk that needs to be replaced.

- If not, is the error always from the same "sf" driver (which drives the
FC interface, such as a GBIC)? For example,
  sf0 on socal0. If so, this could simply be a bad GBIC which can be
hot-swapped (GBICs go bad frequently).

- If not, it could be the socal host adapter itself, which can host two
FC-AL intefaces, sf0 and sf1, in ports 0 and 1. Are there
  two interfaces on it? If yes, and sf1 never gives an error, then, most
likely, it's not the socal adapter, since it would probably
  effect both (this, however, is not a certainty).

Please not that since there is a socal/sf combo on EACH END of a
connection, the above could apply to either the server
end or the storage end. I'd recommend strating at the storage end, as it's
a bit safer and more hot-swappable, especially if you have any mirroring
going on.

- Additionally, the problem could be with the IB on the storage-array end,
which is also hot-swappable, or, in a remote case, the storage enclosure's
midplane.

- Don't neglect the possibility that someone damaged the fiberoptic cable
between the two.

All of the above could produce the errors that you are getting.

BTW, the I/O board that's reporting the error is in slot "1"; the
fibre-channel host adapter (socal@d, ????) is right-on
the board (right-hand side) and the FC-AL inteface (e.g., GBIC) is in port
0 on that card.

I'd recommend looking first at a faulty cable, GBIC, disk, socal (on both
ends) first; if that doesn't uncover the problem, start looking at the more
difficult scenarios.
:
I have seen an old SSA Model 112 and 114 do this when a drive was failing.
Your messages do seem to list a specific drive ssd0. I would start by
locating ssd0 and remove it from the volume and check it out. A complete
power cycle may be necessary.

The initial errors seem to indicate a fiber channel problem but like I said
I've seen a disk do this.
:
I'd check the cables and termination first. The SCSI chain should be less
than 3m for all devices on the chain. Some devices that claim to be
autoterminating don't act that way so using an active terminator may
help but if this was the problem I doubt the device would be usable to
begin with.

I'd tend to check the cables and try reseating the disks and then
check each of the disks for bad blocks.
:
A failing GBIC caused it for us. We have an E3500 with an A5000 array
that back in January showed similar errors on one channel. Solution was
to replace the GBICs at both the I/O board and the Array (since it's not
easy to determine which is failing) and the fiber between them. Well, it
started just a few days ago on the other channel. Solution. Replace the
GBICs and fiber.

I read in the SSA-Admin archive that circa 1999 there were GBICs
that were failing frequently. These were a known problem and Sun was
replacing them with ones made by IBM. Might be related.
:
===========================================================================================



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:14:09 CDT