SUMMARY: help with 690MP system hangs

From: Timothy Baum 432-2765 (satmb@gauss.med.harvard.edu)
Date: Mon Jul 11 1994 - 18:08:41 CDT


A followup to my posting regarding our mysterious 690 system hangs:

We have tracked down the problem to a PC attached to one of the ALM-2
serial ports. When the PC is turned on or off, the 690 hangs. The PC
user's arrival/departure schedule approximately matches the pattern of
the system hangs, and she was away from the office on days when the
hangs did not occur. Disconnecting the serial cable is a quick fix
(except for that user). I have not yet found the problem with the PC --
the machine is properly grounded and I don't see voltage spikes on the
RS-232 lines -- but will keep looking. There may be a transient
voltage on an RS-232 pin during on/off that my voltmeter is not
sensitive enough to detect. There is also a time-slicing multiplexor
in the path which may be the culprit. The PC hasn't been modified in
at least 6 months, so we don't know why it started acting up now.

I discovered this by methodically turning off and on pieces of
equipment directly connected to the 690 until one of them produced the
system hang.

The problem appears to have been unrelated to the following possible
causes:
   * Sun 690 AC power or grounding
   * SunOS 4.1.3 multiprocessing deficiencies (we tried using a single
     supersparc processor instead of four Ross CPUs, system still hung at
     same times)
   * System board, cpus or power supply
   * ALM-2 board that had been acting up earlier and was replaced. (We
     even replaced it again for good measure. The bad PC was attached to
     a different ALM.)
   * Processes running at the time of hang
   * The weather (it was apparently a coincidence that it only happened
     on hot days, but that got us thinking of air conditioning and other
     power issues rather than a user turning on her PC)

Many thanks to the sun-managers who offered help. One reply hit the
bullseye:

Mike Frank:
> ... You mentioned "an ALM-2" board that
> stopped working. I'm guessing you have some devices "hard-wired"
> with copper wire to the 690. Since you seem to looking at the right
> things regarding input AC power, I'm also assumming the server is in
> a computer room like environment that has had due attention paid to
> power supply and good building grounding. "AC grounding" might be the
> problem.
>
> We once experienced a problem with a PC that was sharing an AC
> extension cable with a hotplate or some such "user" device. The
> ground pin was removed to use an old AC receticle, and the PC's
> chassis was "hot" and caused the async chips of a terminal server to
> cook when the user turned on the PC. In this case the terminal
> server felt the full force of 120v AC current through the RS-232
> wiring, but your case might be just a momentary hit caused by an
> on/off switch of a user device (considering your 9:00/5:00 weekday
> schedule of the problem).
>
> This problem derives from the computer having relativly new
> up-to-date wiring designed to minimize potential to ground. Older
> wiring and big machinery might have missing grounds, higher
> resistance grounds, or might be wired to secondary paths to ground
> such as the building steel frame. These problems are especially bad
> in older buildings. I know, our building is sixty years old and is
> undergoing an electrical renovation that won't be complete until next
> year.
>
> Anyway, come tuesday morning and you still don't have answers, try
> doing a walk-around eyeball survey of everything directly connected
> to the 690, looking for screwed up extention cords, faulty grounds
> using an electrician's recepticle tester ($5 at the hardware store),
> and any device having some kind of problem (this can be hard to find
> as we didn't find the culprit PC until a technian removed the case
> and touched the chassis to find it had a bad power supply the made
> the metal chassis 120v HOT).
>
(Note that we couldn't detect this problem by eyeball or by electrician's
outlet tester, only by power-cycling the PC.)

Some other helpful suggestions:

Kevin W. Thomas:
> o Try unplugging the keyboard and plugging it back in.
> o Try hitting L1-A a *lot* of times.
>
> One of them might get you to an "ok" prompt so you can get a crash dump.
(Unplugging the keyboard worked! This was the only way I was able to
produce a crash dump so we could see exactly what was going on at time
of hang, which helped us rule some things out. Note that even using
this procedure the system wouldn't reliably sync the disks -- usually
it didn't even try, but just proceeded to the crash dump; other times
it started to sync but failed; only once did the disk sync actually
succeed. This still seems easier on the system than power-cycling.)

Dotty Pon:
> Try looking to see who logs in close to 9am and who logs out at close
> to 5pm. It looks like someone is causing the machine to hang when they
> log in (by starting some process) and when they log out (start/stop
> a process). I've seen this happen before.
(In this case, it wasn't the login/logout processes, but the act of
turning on/off the PC.)

Henk Melching:
(Henk wrote several pieces of good advice to check power to the 690,
voltage between neutral and ground, check for current in ground wire,
check connections in main power plug, and connections in service panel,
and to make sure RS-232 data cable is only grounded on one side for
equipment attached to serial ports. He also recommends using an isolation
transformer for input power, especially for 110V-110V circuits, which
sounds like a good idea on general principle; our engineer from Boston
Edison also recommended one of these, plus a separate ground wire back
to the main building ground, to isolate the computer's power from
whatever else is happening in the building.)

Mark Allyn:
> Is the system near any labs dealing with RF or high voltage
> work? Things such as testla coils, VanDefraph generators, and medical
> equipment that uses RF could be culprits.
> [Try borrowing] a decent osciloscope (one with a very high
> frequency bandwidth and very short rise time and with a very
> high impedence probe) and connect the scopt to one of the
> power busses in the Sun and closesly watch for spikes. ...
> Another suggestion is to put a spectrum analyser in the same
> location as the Sun with an antenna and see what kind of RF
> energy is in the environment.
(Plausible suggestions, not applicable here. I don't know where to
get an osciloscope or spectrum analyzer, and am glad we didn't need them.)

David St. Pierre:
> Patch-ID# 101408-01
> Synopsis: SunOS 4.1.3: SS10-51 or SS600-51 may hard hang or watchdog reset
> [Patch description only mentions model 51 supersparc cpus,]
> ... but i also have ross cpus and it definitely fixed my problems.
> i feel that the readme understates the level of the problem. we had the
> 670 crashing every few days for about 2 months before we finally applied
> the patch.
(Didn't fix my problem, but might help someone else.)

Thanks again to all who responded:

dotty@elvis.tgivan.wimsey.bc.ca (Dotty Pon)
david@srv.PacBell.COM (David St. Pierre)
Chris Wozniak TISC <chris@tisc.edu.au>
Henk Melching <hmelchin@nl.oracle.com>
Birger.Wathne@vest.sdata.no (Birger A. Wathne)
allyn@netcom.com (allyn)
mfrank@ftc.gov (Mike Frank)
davee@lightning.mitre.org (David N. Edwards)
Dan Stromberg - OAC-DCS <strombrg@bingy.acs.uci.edu>
kwthomas@nsslsun.nssl.uoknor.edu (Kevin W. Thomas)

Original posting follows:

>
> My primary server has been hanging twice a day, around 9:00am and
> 5:00pm weekdays during hot/humid weather. Machine is a 690MP with 4
> original Ross CPUs (not supersparc), running SunOS 4.1.3 with numerous
> patches:
>
> 100075-11 100383-06 100513-04 100726-16
> 100170-10 100407-09 100557-03 100804-03
> 100173-10 100412-02 100564-07 100890-08
> 100224-06 100444-54 100581-04 100972-01
> 100257-05 100448-02 100623-03 101008-01
> 100296-04 100458-03 100631-01 101072-01
> 100342-03 100468-03 100645-01 101080-01
> 100359-06 100492-09 100713-01 101200-02
> 100377-09 100507-05
>
> System has hung in the past, but not this frequently, and SunOS
> patches (100726-16 in particular) seemed to have reduced that problem
> until now. No changes had been made to the kernel for over a month
> prior to these recent problems.
>
> System was originally a 490, upgraded to 690MP in 1992. It has 128MB
> memory (32 x 4MB SIMMS), 6 ALM-2 controllers, 3 ISP-80 IPI controllers,
> DSBE (diff. scsi) Sbus card, and an Sbus adapter for the old mono
> monitor.
>
> System is in an air-conditioned room that is kept between 60 and 65
> degrees Farenheit. System is not on a UPS.
>
> System has been hard hanging repeatedly around 9:00am and 5:00pm on
> weekdays (exact times vary by up to 20 minutes). This seems to happen
> on days when weather is particularly hot and/or humid (high 80s - 90s) --
> 6 days out of the last 18. System has not hung on weekends despite
> heat. When system hangs, L1-A does not interrupt, and CPU LED lights
> are frozen (but always in a different pattern). No error messages
> appear on console. We must reboot by cycling power to CPU; no crash
> dump is produced.
>
> These system hangs started soon after some hardware failures in which
> an ALM-2 board and an IPI disk controller were replaced.
>
> During system hangs, there are no visible changes in room lighting, in
> operation of PCs or other Sun workstations (except that they hang
> waiting for server to come up), or other electrical equipment. A
> continually-repeating "ps" listing, "perfmeter" graphs of cpu, load,
> collisions, errors, etc. just prior to hangs show nothing out of the
> ordinary. There are no "cron" jobs scheduled for those times. We are
> not aware of any new equipment nearby that could be emitting RF noise
> or affecting power, especially at those times. There is construction
> nearby, but their schedule is more like 7-to-3, not 9-to-5. Most
> building systems (e.g. air conditioning) run 24 hours.
>
> Because of the timing of the hangs, we suspect power involvement, but
> a voltage analyzer attached to the AC line does not show large fluctuations.
> 690MP electrical spec gives operating range of 180 - 264 volts, our range
> appears to be 196 - 205 volts. Dips to 196 volts were at night and not
> closely associated in time with system hangs. Analyzer sampling frequency
> has been increased, so we may yet see something that was missed before.
> I am still suspicious of external causes such as power but don't know
> what else to look for.
>
> System board, CPUs, and power supply have been replaced, and air
> filters cleaned to make sure air flow is adequate, without making
> a difference. One IPI disk controller, which had been replaced shortly
> before system started hanging, was removed and slot jumpered over (disks
> moved to another controller), with no effect.
>
> Our maintenance provider (Polaris), power company (Boston Edison) and I are
> running out of ideas. Would greatly appreciate suggestions.
>
> Thanks in advance for any help you can provide. Replies by
> email, I will summarize to the list.
>

--
Timothy Baum
System Administrator
Channing Laboratory
satmb@gauss.med.harvard.edu



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:09:05 CDT