SUMMARY: 4/380 running 4.1.1 w/ flakey ethernet behavior on le0

From: Jim Mattson (mattson@cs.UCSD.EDU)
Date: Mon Nov 25 1991 - 20:16:41 CST

Last week, I wrote:

  Recently, the le0 interface one of our 4/380's has developed a tendency to
  "wedge" in a state where no packets can get through. Ifconfig shows the
  interface to be UP, in its normal state, but to everything on that subnet,
  it looks like the machine is down. Pinging the broadcast address from the
  4/380 results in a response only from its own address. Doing an "ifconfig
  le0 down" followed by an "ifconfig le0 up" always fixes the problem.

  There is nothing in /var/adm/messages about this. In fact, there is no
  indication that there is anything wrong at all. The 4/380 seems to think
  that there's a network out there with nobody on it, and the other machines
  seem to think that the 4/380 has disappeared from existence.

  Any ideas?

  The machine has three ethernet interfaces: le0, ie1, and ie2, where ie1 and
  ie2 are Multibus interfaces with VME adapters.

  We're running 4.1.1 with the following kernel patches:

  [ patches elided ]

Leonardo Topa suggested using etherfind to see if anything was actually
getting out on the network. (When the interface is wedged, nothing is.) He
said that he has seen Lance chips bite the dust in power surges/outages.
(We have had instances of both, but nothing recently that might account for
the new misbehavior.)

Hal Stern recommended turning on ledebug to get some informative messages
from the Lance driver. He suggested that the possible problems included a
flaky cable, running out of mbufs, or something else (like running out of
tfds.) Turning on ledebug was quite informative, but in this case not
helpful enough. With this flag on, I was rewarded with many messages like
the following:

Nov 21 03:17:58 thor vmunix: le0: Receive: overflow error
Nov 21 03:20:05 thor last message repeated 6 times
Nov 21 03:20:07 thor vmunix: le0: Transmit: BUFF set in tmd
Nov 21 03:20:07 thor vmunix: le0: Transmit underflow
Nov 21 03:20:07 thor vmunix: le0: Transmission stopped
Nov 21 03:20:07 thor vmunix: le0: csr: 4e3<RINT,INTR,INEA,RXON,STRT,INIT>
Nov 21 03:20:09 thor vmunix: le0: Receive: overflow error
Nov 21 03:20:10 thor vmunix: le0: Receive: overflow error

This happens every few minutes, until the interface wedges, and then the
messages just stop (in the example above, le0 wedged just after the last
message.) However, I don't think these messages are "abnormal," because all
three of our 4/3xx systems complain like this when ledebug is turned on.

Hal then suggested looking at the output from vmstat 10 and vmstat -i
to determine whether there might be a serial port coupling noise, a modem
gone mental, or something else on the machine eating CPU time.

Though nothing jumped out at me, Hal did point out that the attach rate and
interrupts reported by vmstat were both high. I installed patch 100259-03
(the ufs_inactive patch) to reduce the attach rate, and I kept an eye on the
interrupts to try to identify the culprit. The high interrupt rate (as many
as 1350 per second sometimes) was almost entirely due to software interrupts
posted by the zs driver when transmitting to a PostScript printer on ttyb.
It appears that the Zilog 8530 doesn't do DMA on transmit, so the zs driver
has to fake it by queuing a software interrupt for every character that gets
sent. The moral of this story is not to drive your printers on the zs lines
if you can help it.

Although the high interrupt rate may have exacerbated the problem with le0,
I don't think it's the final answer. (The logs show that nothing was
printed to that printer between 3:12 and 9:57 on the 21st, and yet le0 still
wedged at 3:20.) We've swapped CPU boards with a sister machine, and in so
doing, we've changed the rev level of the board from 54 to 70. We still see
complaints from the Lance driver when ledebug is on, but we have not seen
the interface get wedged since the swap was made (3 days ago). The problem
board in the sister machine has not wedged since the swap either, but the
other machine has no printers, fewer disks, and less ethernet traffic, so
the conditions which led to the problem on the original machine may not
arise as frequently over there (if at all).

Tres Hofmeister reported seeing the same problem on a similarly configured
4/360. I'd be curious to know what rev level your CPU is, Tres. Who knows?
Something in the hardware may have been fixed between revs 54 and 70.

Thanks to: (Leonardo C. Topa)
stern@sunne.East.Sun.COM (Hal Stern - NE Area Tactical Engineering)
tres@roke.rap.ucar.EDU (Tres Hofmeister)

Jim Mattson Internet:
UCSD CSE Dept. 0114 Bitnet: jmattson@ucsd
9500 Gilman Drive UUCP: ...!uunet!ucsd!jmattson
La Jolla, CA 92093-0114 Voice: (619) 534-7371
USA FAX: (619) 534-7029

This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:16 CDT