[SUMMARY] Qlogic fibre-channel failover problem

From: John Horne <john.horne_at_plymouth.ac.uk>
Date: Mon Jul 07 2008 - 08:44:30 EDT
Apologies for the late summary reply. I received a variety of hints and
suggestions from the following, for which many thanks:

Jim Musso
Markus Mayer
Dean Ross-Smith
JayJay Florendo
inemes
Chris Liles
Thomas Leyer
Andrey Borzenkov


There was no one specific 'answer' to the problem. Some people requested
a bit more information, to which I did not reply. The reason being that
the problem 'resolved' itself when three things occurred!
These were:

1) The '/kernel/drv/fp.conf' file had 2 entries in it for fibre-channel
- as if there was a dual-port card present. In our case we only had the
one port, so I commented out one of the entries. (Suggested by Markus
Mayer.)

2) The 'mpathadm show lu ...' command showed the 'Current Load Balance'
as round-robin. This was changed to 'none'. (Suggested by Dean
Ross-Smith.)

3) It seems that Sun recently released a patch fixing some problems with
Qlogic cards. I tend to run 'pca' to patch my systems, and wasn't really
paying too much attention to it I'm afraid! I think the patch was
113042.

Rebooting and reconfiguring the system, the FC card then seemed to work
correctly when one of the channels was disabled. Given that a few people
(including myself!) asked why we hadn't bought 2 cards or at least a
dual-port card if this was going to be a production server, we got
approval to buy a second card. As far as I can tell running Solaris 10
with 2 FC cards should work pretty much out of the box with respect to
failover. Because of this I did not analyse the initial problem any
further to see if there was any one solution. (I'm still awaiting
delivery of the second FC card, so this problem may yet come back and
bite me again!)



Regards,

John.


On Tue, 2008-06-24 at 10:48 +0100, John Horne wrote:
> Hello,
> 
> We have a T2000 running Solaris 10 5/08 with a single QLA2460
> fibre-channel card in it - so one card, one port. I have no control over
> the fibre side of things, so am not completely sure what the
> configuration is, but I gather it (the SAN) is provided by FalconStor.
> The card/OS have been configured to see the switch the card is connected
> to, and this seems to work fine. I am told that the switch provides 2
> routes from the actual SAN, hence Solaris initially sees 2 disks (when
> using 'format'). I have configured multipathing (mpxio), and Solaris now
> sees one disk. I have formatted/newfs'd the disk, and mounted it with no
> problems. The disk provides user data, so it is not booted off.
> 
> However, when I asked our Ops people to disable one of the fibre
> 'channels' (on the fabric switch), to simulate a hardware fault, Solaris
> detected the problem but disabled all access to the disk. Trying to
> access the mounted disk gave an 'I/O error'; format showed the disk as
> 'disk information unavailable', and 'mpathadm' likewise gave I/O errors
> and stated that it could not get disk information. The only solution
> seemed to be to unload the qlc module (using modunload), and reload it.
> Then the system saw the disk again. Thinking this might just be a timer
> issue, I left the system for a good 30 mins, but the disk never became
> accessible again. The messages file showed errors such as:
> 
> ==================================================================
> Jun 20 16:58:53 lib-srvr7 scsi: [ID 107833 kern.warning]
> WARNING: /scsi_vhci/ssd@g6000d775000032d11ada4f3e5d6a37ea (ssd2):
> Jun 20 16:58:53 lib-srvr7       Error for Command: read(10)
> Error Level: Retryable
> Jun 20 16:58:53 lib-srvr7 scsi: [ID 107833 kern.notice]
> Requested Block: 64                Error Block: 64
> Jun 20 16:58:53 lib-srvr7 scsi: [ID 107833 kern.notice]         Vendor:
> FALCON                Serial Number: OF1S3WS894OA
> Jun 20 16:58:53 lib-srvr7 scsi: [ID 107833 kern.notice]         Sense
> Key: Unit Attention
> Jun 20 16:58:53 lib-srvr7 scsi: [ID 107833 kern.notice]         ASC:
> 0x29 (power on occurred), ASCQ: 0x1, FRU: 0x0
> Jun 20 16:58:53 lib-srvr7 scsi: [ID 107833 kern.warning]
> WARNING: /scsi_vhci/ssd@g6000d775000032d11ada4f3e5d6a37ea (ssd2):
> Jun 20 16:58:53 lib-srvr7       Error for Command: read(10)
> Error Level: Retryable
> Jun 20 16:58:53 lib-srvr7 scsi: [ID 107833 kern.notice]
> Requested Block: 64                Error Block: 64
> Jun 20 16:58:53 lib-srvr7 scsi: [ID 107833 kern.notice]         Vendor:
> FALCON                Serial Number: OF1S3WS894OA
> Jun 20 16:58:53 lib-srvr7 scsi: [ID 107833 kern.notice]         Sense
> Key: Unit Attention
> Jun 20 16:58:53 lib-srvr7 scsi: [ID 107833 kern.notice]         ASC:
> 0x3f (reported LUNs data has changed), ASCQ: 0xe, FRU: 0x0
> Jun 20 16:59:03 lib-srvr7 scsi: [ID 243001 kern.warning]
> WARNING: /pci@7c0/pci@0/pci@1/pci@0,2/SUNW,qlc@1/fp@0,0 (fcp1):
> Jun 20 16:59:03 lib-srvr7       INQUIRY to D_ID=0xe30700 lun=0x0 failed:
> sense key=IllegalRequest, ASC=24, ASCQ=0. Giving up
> Jun 20 16:59:03 lib-srvr7 scsi: [ID 243001
> kern.info] /pci@7c0/pci@0/pci@1/pci@0,2/SUNW,qlc@1/fp@0,0 (fcp1):
> Jun 20 16:59:03 lib-srvr7       offlining lun=0 (trace=0), target=e30700
> (trace=b10101)
> Jun 20 16:59:03 lib-srvr7 genunix: [ID 834635
> kern.info] /scsi_vhci/ssd@g6000d775000032d11ada4f3e5d6a37ea (ssd2)
> multipath status: degraded,
> path /pci@7c0/pci@0/pci@1/pci@0,2/SUNW,qlc@1/fp@0,0 (fp1) to target
> address: w50060b00006441e2,0 is offline Load balancing: round-robin
> Jun 20 17:01:13 lib-srvr7 scsi: [ID 107833 kern.warning]
> WARNING: /scsi_vhci/ssd@g6000d775000032d11ada4f3e5d6a37ea (ssd2):
> Jun 20 17:01:13 lib-srvr7       Error for Command: read(10)
> Error Level: Retryable
> Jun 20 17:01:13 lib-srvr7 scsi: [ID 107833 kern.notice]
> Requested Block: 1528                Error Block: 1528
> Jun 20 17:01:13 lib-srvr7 scsi: [ID 107833 kern.notice]         Vendor:
> FALCON                Serial Number: OF1S3WS894OA
> Jun 20 17:01:13 lib-srvr7 scsi: [ID 107833 kern.notice]         Sense
> Key: Unit Attention
> Jun 20 17:01:13 lib-srvr7 scsi: [ID 107833 kern.notice]         ASC:
> 0x3f (reported LUNs data has changed), ASCQ: 0xe, FRU: 0x0
> Jun 20 17:01:23 lib-srvr7 scsi: [ID 243001 kern.warning]
> WARNING: /pci@7c0/pci@0/pci@1/pci@0,2/SUNW,qlc@1/fp@0,0 (fcp1):
> Jun 20 17:01:23 lib-srvr7       INQUIRY to D_ID=0xe30900 lun=0x0 failed:
> sense key=IllegalRequest, ASC=24, ASCQ=0. Giving up
> Jun 20 17:01:23 lib-srvr7 scsi: [ID 243001
> kern.info] /pci@7c0/pci@0/pci@1/pci@0,2/SUNW,qlc@1/fp@0,0 (fcp1):
> Jun 20 17:01:23 lib-srvr7       offlining lun=0 (trace=0), target=e30900
> (trace=b10101)
> Jun 20 17:01:23 lib-srvr7 scsi: [ID 107833 kern.warning]
> WARNING: /scsi_vhci/ssd@g6000d775000032d11ada4f3e5d6a37ea (ssd2):
> Jun 20 17:01:23 lib-srvr7       transport rejected fatal error
> Jun 20 17:01:48 lib-srvr7 ufs: [ID 702911 kern.warning] WARNING: Error
> writing master during ufs log roll
> Jun 20 17:01:48 lib-srvr7 ufs: [ID 127457 kern.warning] WARNING: ufs log
> for /m1 changed state to Error
> Jun 20 17:01:48 lib-srvr7 ufs: [ID 616219 kern.warning] WARNING: Please
> umount(1M) /m1 andrun fsck(1M)
> Jun 20 17:02:23 lib-srvr7 scsi: [ID 107833 kern.warning]
> WARNING: /scsi_vhci/ssd@g6000d775000032d11ada4f3e5d6a37ea (ssd2):
> Jun 20 17:02:23 lib-srvr7       offline or reservation conflict
> Jun 20 17:03:16 lib-srvr7 scsi: [ID 107833 kern.warning]
> WARNING: /scsi_vhci/ssd@g6000d775000032d11ada4f3e5d6a37ea (ssd2):
> Jun 20 17:03:16 lib-srvr7       offline or reservation conflict
> Jun 20 17:03:33 lib-srvr7 scsi: [ID 107833 kern.warning]
> WARNING: /scsi_vhci/ssd@g6000d775000032d11ada4f3e5d6a37ea (ssd2):
> Jun 20 17:03:33 lib-srvr7       offline or reservation conflict
> Jun 20 17:03:37 lib-srvr7 scsi: [ID 107833 kern.warning]
> WARNING: /scsi_vhci/ssd@g6000d775000032d11ada4f3e5d6a37ea (ssd2):
> Jun 20 17:03:37 lib-srvr7       offline or reservation conflict
> Jun 20 17:03:48 lib-srvr7 scsi: [ID 107833 kern.warning]
> WARNING: /scsi_vhci/ssd@g6000d775000032d11ada4f3e5d6a37ea (ssd2):
> Jun 20 17:03:48 lib-srvr7       offline or reservation conflict
> Jun 20 17:03:50 lib-srvr7 scsi: [ID 107833 kern.warning]
> WARNING: /scsi_vhci/ssd@g6000d775000032d11ada4f3e5d6a37ea (ssd2):
> Jun 20 17:03:50 lib-srvr7       offline or reservation conflict
> ==================================================================
> 
> 
> when the disk becomes available again (after modunload/modload), we see:
> 
> ==================================================================
> Jun 20 17:23:56 lib-srvr7 genunix: [ID 408114
> kern.info] /scsi_vhci/ssd@g6000d775000032d11ada4f3e5d6a37ea (ssd2)
> offline
> Jun 20 17:23:56 lib-srvr7 genunix: [ID 834635
> kern.info] /scsi_vhci/ssd@g6000d775000032d11ada4f3e5d6a37ea (ssd2)
> multipath status: failed,
> path /pci@7c0/pci@0/pci@1/pci@0,2/SUNW,qlc@1/fp@0,0 (fp1) to target
> address: w50060b00006441e2,0 is offline Load balancing: round-robin
> Jun 20 17:23:56 lib-srvr7 genunix: [ID 408114
> kern.info] /scsi_vhci/ssd@g6000d775000032d11ada4f3e5d6a37ea (ssd2)
> offline
> Jun 20 17:24:06 lib-srvr7 scsi: [ID 243001 kern.warning]
> WARNING: /pci@7c0/pci@0/pci@1/pci@0,2/SUNW,qlc@1/fp@0,0 (fcp1):
> Jun 20 17:24:06 lib-srvr7       ns_registry: failed name server
> registration
> Jun 20 17:24:06 lib-srvr7 scsi: [ID 799468 kern.info] ssd2 at
> scsi_vhci0: name g6000d775000032d11ada4f3e5d6a37ea, bus address
> g6000d775000032d11ada4f3e5d6a37ea
> Jun 20 17:24:06 lib-srvr7 genunix: [ID 936769 kern.info] ssd2
> is /scsi_vhci/ssd@g6000d775000032d11ada4f3e5d6a37ea
> Jun 20 17:24:06 lib-srvr7 genunix: [ID 936769 kern.info] fp1
> is /pci@7c0/pci@0/pci@1/pci@0,2/SUNW,qlc@1/fp@0,0
> Jun 20 17:24:06 lib-srvr7 genunix: [ID 408114
> kern.info] /scsi_vhci/ssd@g6000d775000032d11ada4f3e5d6a37ea (ssd2)
> online
> Jun 20 17:24:06 lib-srvr7 genunix: [ID 834635
> kern.info] /scsi_vhci/ssd@g6000d775000032d11ada4f3e5d6a37ea (ssd2)
> multipath status: degraded,
> path /pci@7c0/pci@0/pci@1/pci@0,2/SUNW,qlc@1/fp@0,0 (fp1) to target
> address: w50060b000064487a,0 is online Load balancing: round-robin
> ==================================================================
> 
> 
> Looking on the Internet, it seems that the 'cfgadm -c configure' command
> may re-enable the disk as well. The problem seems to be that the QLA
> card 'logs out' (?) from the switch, and cannot re-establish the disk
> connection until it logs in again. The point is that we want the
> failover to be automatic, and not to have to run commands should a fault
> occur on the SAN side.
> 
> Has anyone else had this problem, and if so was there a solution?
> Obviously what we want is to not to have to run commands should a
> problem occur on the SAN, we want automatic failover (albeit that having
> 2 cards or 2 ports might have been better for resilience!).
> 
> 
> 
> Thanks,
> 
> John.
> 
-- 
---------------------------------------------------------------
John Horne, University of Plymouth, UK  Tel: +44 (0)1752 587287
E-mail: John.Horne@plymouth.ac.uk       Fax: +44 (0)1752 587001
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
Received on Mon Jul 7 08:46:18 2008

This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:44:11 EST