Well, it's not really a summary, but I was *finally* able find
the fix. The real problem, which is not at all obvious from
my original message to sun-managers (appended), is that
our (fast differential SCSI) Micropolis 1924D disks were timing
out on an SS10 like this:
esp2: Disconnected command timeout for Target 0 Lun 0
This happens both with Sun's DSBE/S and with Performance Technologies'
PT-SBUS430 controllers. It can be triggered by any moderate disk
activity that causes lots of SCSI disconnects to be interleaved (say,
a find(1) running in parallel on each of two disks).
I won't bore you with how many dead ends we explored. The problem
turned out to be a bug in Micropolis' firmware. We haven't seen
a timeout since Micropolis updated the f/w two weeks ago. Micropolis
says that new disks are now being shipped with the fixed f/w.
So if your 1924D's are over a month old and if you are seeing these
timeouts, then you need to talk with Micropolis.
> From fletcher Wed May 26 23:30:03 1993
> From: Fletcher Mattox <fletcher@cs.utexas.edu>
> To: sun-managers@eecs.nwu.edu
> Subject: SCSI overruns?
>
> Our new SS10 running SunOS 4.1.3 is getting SCSI overruns.
> There are four 2.4GB Micropolis 1924 disks on this SCSI bus.
> The bus is passive terminated. (We will soon try active termination).
> This machine has prestoserve installed, and there appears to be a
> correlation with prestoserve and the overruns. I.e. the overruns
> haven't recurred since we turned off presto.
>
> I don't think it's the disk since I see errors on both sd4 and sd6.
>
> Is this a cable/termination problem? Is prestoserve known to aggravate
> this problem?
>
> Thanks
> Fletcher
>
>
> sd4: SCSI transport failed: reason 'data_ovr': retrying command
> sd4: SCSI transport failed: reason 'incomplete': retrying command
> sd4: disk not responding to selection
> sd4: disk not responding to selection
> presto: error on dev (7, 34)
> esp1: data transfer overrun
> State=DATA Last State=DATA_DONE
> Latched stat=0x11<XZERO,IO> intr=0x10<BUS> fifo 0x80
> last msg out: <unknown msg 0xff>; last msg in: COMMAND COMPLETE
> DMA csr=0x40040010<INTEN>
> addr=fff0017c last=fff00168 last_count=14
> Cmd dump for Target 3 Lun 0:
> cdb=[ 0x3 0x0 0x0 0x0 0x14 0x0 ]
> pkt_state 0xb<XFER,SEL,ARB> pkt_flags 0x0 pkt_statistics 0x0
> cmd_flags=0x25 cmd_timeout 35
> Mapped Dma Space:
> Base = 0x168 Count = 0x14
> Transfer History:
> Base = 0x168 Count = 0x14
> current phase 0x26=DATAIN stat=0x11 0x14
> current phase 0x20=SELECT stat=0x10 0x3 0x0
> current phase 0x1=CMD_START stat=0x10 0x3 0x20
> current phase 0xb=CMD_CMPLT stat=0x17 0xc00
> current phase 0x27=STATUS stat=0x17 0x2
> current phase 0xb=CMD_CMPLT stat=0x13
> current phase 0x20=SELECT stat=0x0 0x3 0x0
> current phase 0x1=CMD_START stat=0x0 0xa 0x20
> current phase 0x20=SELECT stat=0x0 0x3 0x0
> current phase 0x1=CMD_START stat=0x0 0xa 0x20
> current phase 0x20=SELECT stat=0x0 0x3 0x0
> current phase 0x1=CMD_START stat=0x0 0x3 0x20
> current phase 0x60=SELECT_SNDMSG stat=0x0 0x3 0x0
> current phase 0x23=SYNCHOUT stat=0x0 0x19 0xf
> current phase 0x1c=RESET stat=0x0 0x10
> current phase 0x1c=RESET stat=0x11 0x7
> sd4: SCSI transport failed: reason 'data_ovr': giving up
> presto: disabling...
> sd4: disk not responding to selection
> sd6: SCSI transport failed: reason 'reset': retrying command
>
>
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:07:58 CDT