Dear Sun managers,
The problem that I reported on January 16 was this:
>Jan 16 14:07:23 cm2f vmunix: BAD TRAP
>Jan 16 14:07:23 cm2f vmunix: pid 944, `moldyn': Data fault
>Jan 16 14:07:23 cm2f vmunix: kernel read fault at addr=0x0, pme=0x0
>Jan 16 14:07:23 cm2f vmunix: Bus Error Reg 80<INVALID>
>Jan 16 14:07:23 cm2f vmunix: pc=0xf8006bf0, sp=0xf810cfa0, psr=0x110015c6, context=0x2a
>Jan 16 14:07:23 cm2f vmunix: g1-g7: f8006bf8, f8114dfc, 5f8, f8184b60, 2fc, f8132c00, f8132c00
>Jan 16 14:07:23 cm2f vmunix: Begin traceback... sp = f810cfa0
>Jan 16 14:07:23 cm2f vmunix: Called from f8060108, fp=f84a8ca0, args=0 0 ff24252c f8138800 14 ff1f8c30
>Jan 16 14:07:23 cm2f vmunix: End traceback...
>Jan 16 14:07:23 cm2f vmunix: panic: Data fault
>
>In all cases, the program causing the panic was the above "moldyn".
Note added: this is now known to occur for other programs as well.
>We are running SunOS 4.1.1, no Sun patches applied (to my knowledge).
>
>The 470 is primarily used as the front-end to a Connection Machine
>supercomputer, and it has a second SCSI controller (Ciprico) with a
>bunch of 1 GB disks. Both of these devices are attached to the VME bus.
>
>I do not know where to look for the cause of the problem. Is this a
>hardware or a SunOS problem ? I looked in old sun-managers messages,
>and found that both types of problems have caused data faults in the
>past (SunOS'es prior to 4.1.1).
As pointed out by Hal Stern and Chris Drake, the first thing to do is
to produce a symbolic traceback:
adb -k /vmunix /dev/mem
physmem 1ffd
f8006bf0?i
_vme_read_vector+0x4c: ld [%o0], %o0
The _vme_read_vector code chokes on a "spurious interrupt", i.e., an
interrupt which appears to originate from a non-existing VME device.
The intended action was to log a "spurious interrupt" message, but a
panic resulted in stead. This analysis was done by shj@ultra.com
(Steve Jay) in a summary kindly provided by Todd Pfaff.
There may be a way to adb the kernel to catch the faulty interrupts.
There is no official Sun patch for the problem, which however is reported
to have been solved in SunOS 4.1.2. The real cause of the problem is
that some device sends a bad interrupt onto the VME bus, or the CPU
made one up itself, so in the end our problem results from some
(at the moment unknown) hardware failure. If you want a more complete
analysis, I can mail it upon request.
My thanks go to these kind folks:
stern@sunne.East.Sun.COM (Hal Stern - NE Area Systems Engineer)
cadence!esanborn@uunet.UU.NET (Ed Sanborn)
todd@flex.Eng.McMaster.CA (Todd Pfaff)
shj@ultra.com (Steve Jay)
Chris.Drake@Corp.Sun.COM (Chris Drake)
len@math.nwu.edu (Len Evens)
dave mankins <dm@Think.COM>
With best regards,
Ole
Ole Holm Nielsen
Laboratory of Applied Physics, Building 307
Technical University of Denmark, DK-2800 Lyngby, Denmark
E-mail: Ole.Holm.Nielsen@ltf.dth.dk
Telephone: (+45) 42 88 24 88 ext. 3187
Telefax: (+45) 45 93 23 99
Permanent address:
UNI-C, Building 305
Technical University of Denmark, DK-2800 Lyngby, Denmark
E-mail: Ole.H.Nielsen@uni-c.dk
Telephone: (+45) 42 88 39 99 (dial-tone) 2404 or 2244
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:34 CDT