SUMMARY: SPARCstation 5 lockup

From: John Reynolds x56352 (reynolds@acetsw.amat.com)
Date: Thu Jul 17 1997 - 09:02:14 CDT


Hello, again.

----- The Original Message -----

>From reynolds Wed Jul 16 07:33:12 1997
To: sun-managers@ra.mcs.anl.gov
Subject: SPARCstation 5 lockup

Hello.

I know we went through this in April, but the conditions have changed
and the problem is still there.

We are using SPARCstation 5s running SunOS 4.1.3_U1 as an operator's
platform for a product. We have a few hundred of them in use. Most
of the customization of the OS has been kernel parameter stuff, like
increasing MAXUSERS. We have added the SUN Serial/Parallel Controller
interface, and a lightpen from Interactive COmputer Products. The
user interface is X11R5, based on OSF/Motif.
 
We are still seeing a few systems lock up completely : the
output to the monitor is stopped, the pointer does not move with the
mouse, even STOP-A on the keyboard does nothing. The operators have
to cycle power to get the workstation restarted.

Now, on these systems, we are seeing a lot (a LOT) of messages : "zs0:
silo overload", which means the Zilog 8530 character input silo (or
serial portFIFO) overflowed before it could be serviced. zs0 points to
the ttya port, which is where some kind of factory automation PC is
connected. It's supposedly running 9600 baud, no one knows what flow
control.

Has anyone seen a case where overloading a serial port can disable the
keyboard and halt the system?

Will summarize for answers...

----- End Original Message -----

Thanks for fast responses : it seems that overloading the serial ports
will definitely clog up the workstation. The FA connection apparently
does not care about flow control (I've never heard of a serial
conenction that didn't use some kind of flow control, but they say
no). SOmething is coming down the pipe too fast for the workstation to
handle. Michael Maciolek and Glenn Satchell give some good monitoring
advice, too.

Thanks all!

John Reynolds
Applied Materials There's so much comedy on television.
2901 Patrick Henry Dr. MS 5502 Does that cause comedy in the streets?
Santa Clara CA 95054 -Dick Cavett
(408) 235-6352
reynolds@acetsw.amat.com
-------------------------------------------------------------------------

 Anecdotal evidence from our SUN service rep, and responses from :

--------------------------------------
 Michael Pavlov <misha@ml.com>

Few years ago I remember seeing this kind of lockups/panics on 4.X
a) make sure your patch level is up to date
b) try high speed SBUS cards

good luck

--------------------------------------
 Michael Maciolek <mikem@centerline.com>

Yes. This is a shot in the dark, so you'll have to decide if it
applies to your particular situation.

What is the direction of data flow between the Sun and the factory
automation machine? If the Sun is READING data, check the settings of
the serial port to see if ECHO is turned on. If it is, your problem
may occur because the external device isn't reading its own echoed
data, and that data is backing up and filling the streams buffers,
ultimately consuming them all. Once the streams buffers are used up,
nothing that does stream I/O will work...mouse, keyboard, rlogin,
telnet, etc. Non-streams operations, like NFS service, may continue to
work. This behavior is described in Bug ID 1071453, see below.
Although this bug report talks specifically about pty/tty devices, the
same situation can arise on other streams devices like ttya/ttyb.

(bug ID 1071453 removed - jfr)

Diagnosis:

Set up cron job to do a "pstat -S" every 5 minutes or so, saving the
output in a time-stamped file. Note: pstat -S will spew a lot of
information, and you don't need most of it - you only need the portion
that talks about the particular serial port that's attached to your
external gizmo.

For each section of data produced by "pstat -S", look down the DEVICE
column (in this example, it's 12, 0) for your serial port device.
(12,0 is correct for /dev/ttya. 12,1 is ttyb) Follow down the "COUNT"
column and look for large numbers (10,000 - 50,000 or more).

If you save the output of a 'pstat -s' to a time-stamped file, you'll
be able to go back to those files when the system crashes. Look for
increasing buffer COUNT values up to the time when the system crashed.

That will confirm whether your problem is actually caused by serial
port back-up. Figure out how to avoid the back-up : either turn off
character echo, or arrange for the external device to read its own
echoed characters from the serial port.

--------------------------------------
 celeste@celestial.stokely.com (Celeste Stokely)

Absolutely. There can be enough interrupts on a serial port that the
system can appear to grind to a screeching halt. The Sun SPC is a truly
bad board, and can cause no end of problems. If it were my system, I'd
replace the board with a DMA-type serial card from some other vendor.

--------------------------------------
 Karl von Jena <kvj@ix.netcom.com>

I have a similar problem with a graphics tablet. When I reboot the IPX,
the machine and the tablet fail to sync up about 30% of the time,
necessitating another reboot. However, after hundreds of messages just
like yours, it has never locked up the machine.

I am running 4.1.1 or 4.1.2 on these machines however, if that makes a
difference.

BTW, you might want to consider upgrading to 4.1.4, I've found that it
is quite stable.

--------------------------------------
 Glenn Satchell <Glenn.Satchell@Uniq.com.au>

Check vmstat -i on these systems to see what the interrupt load is
like. Something interrupting the hell out of the serial port could
cause some problems. Also make sure yo have the TTY and kernel patches
are installed.

101508-15 SunOS 4.1.3_U1: Sun4m kernel patch
101621-04 SunOS 4.1.3_U1: tty patch

--------------------------------------
 Mariel Feder <unix.support@central.meralco.com.ph>

WE DID HAVE THIS PROBLEM ONCE. I DON'T REMEMBER IF THE STOP-A WAS
DISABLED OR NOT, BUT WHAT HAPPENED WAS THAT WE STARTED GETTING A LOT OF
MESSAGES: SZ0: SILO OVERFLOW, THE MACHINE STARTED TO BE SLOWER AND
SLOWER, AND FINALLY HUNG.

WHEN THAT HAPPENS, TRY TO DISCONNECT WHAT YOU HAVE ON THE SERIAL PORT
TO SEE IF THE PROBLEM DISAPPEARS.

OUR PROBLEM WAS A HARDWARE PROBLEM (I DON'T REMEMBER IF WITH THE PORT
ITSELF OR THE WIRE CONNECTED TO IT).

--------------------------------------
 Stefan Voss <s.voss@terradata.de>

Our Sparc 10s (or better: some of them, others not) have locked up
completely, when the network was heavily overloaded. When we switched
from thinwire to twisted pair, upgraded some machines to 100 MBit and
used a 10/100 MBit switch, these problems did not occur any longer.

But in our case, we got no messages about the zs0 silo overflow. So you
might have a different problem.

Although... I remember an Ultra Sparc, which copmplained all the day
about network cable problems AND zs? silo overflow. These problems
disappeared, when we changed the CPU board (it was defect) and changed
our network. Hence I do not exactly know, what the problem was (but i
suspect the CPU board in our case).

--------------------------------------



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:11:59 CDT