SUMMARY: SPARCstation 10/41 crashes frequently

From: Claus Assmann (ca@informatik.uni-kiel.de)
Date: Fri Dec 17 1993 - 04:29:15 CST


Our SPARCstation 10/41 is running fine for about one week by now.
I'm not sure which of the following actions are responsible for
this, but it seems as if the tcp_wrapper triggered a bug in the kernel.
>From the README of a new version (5.1):

[option KILL_IP_OPTIONS]
! All this cannot be used with SunOS 4.x because of a kernel bug in the
! implementation of the getsockopt() system call. Kernel panics have been
! reported for SunOS 4.1.1 and SunOS 4.1.2. The symptoms are "BAD TRAP"
! and "Data fault" while executing the tcp_ctloutput() kernel function.
There is a patch (100804-02) for a problem like this:
Keywords: getsockopt, RESET, trap, mbuf, leak, bus, panic, TCP
Synopsis: SunOS 4.1.1,4.1.2,4.1.3: TCP socket and reset problems

        If an application does a getsockopt() on a SOCK_STREAM (TCP) socket
        after the other side of the connection has sent a TCP RESET for
        the stream, the kernel gets a Bus Trap in the tcp_ctloutput() or
        ip_ctloutput() routine.

We have not installed this patch up to now, because we compiled the new
version of the tcp_wrapper without KILL_IP_OPTIONS.

Additionally, we have installed patch 100726-12 (part of the README:
Synopsis: SunOS 4.1.3: sun4m jumbo patch for kernel performance and memory bugs

Problem Description:

1097555 kernel panics with kmem_free: block already free
1101875 Heavily Loaded SPARCstation 10 May Hang
1102235 User Programs May Halt and Coredump
1106399 fault address register (MFAR) failures on Viking machines
1116706 User Progs occasionally dump core on SS10/20, 30
1110382 bug in locore.s logic which made the system loop forever.
1125085 mfar workaround can fail for kernel store using out registers.
1118195 kernel panics with freeing free frag, mapsearch corrupted, or
        free block overlap.
1127988 User program core dumps with SIGSEGV only on a SS10 model 41
1121791 bad trap, Invalid Address on supv data store running lwp
1123885 BAD TRAP memory addr align from _hat_map_percpu
1130786 multiple mbus-to-sbus asynchronous faults panic system
1109160 4.1.3 sun4m hard hangs at random intervals: GENERIC kernel
)

Debugging the core in /var/crash/`hostname` can be done with:
# adb -k vmunix.N vmcore.N
$c

I got a script named corecheck which produces some interesting informations.

Many thanks to:
Boyd Johnson <johnson@spectra.com>
Dan Stromberg - OAC-DCS <strombrg@hydra.acs.uci.edu>
Houman Safai <hsafai@esri.com>
Kevin W. Thomas <kwthomas@nsslsun.nssl.uoknor.edu>
Stewart Doyle <s.m.doyle@ucc.hull.ac.uk>

Original question:
! Our main server (SPARCstation 10/41, SunOS 4.1.3 with various patches)
! crashed 5 times in a row today!
!
! >From /var/adm/messages:
! BAD TRAP: cpu=0 type=9 rp=f0630d4c addr=20 mmu_fsr=326 rw=1
! MMU sfsr=326: Invalid Address on supv data fetch at level 3
! regs at f0630d4c:
! psr=419000c2 pc=f001f338 npc=f001f33c
! y: 0 g1: f001f32c g2: 8000000 g3: ffffff00
! g4: 0 g5: f0631000 g6: 0 g7: 0
! o0: f04ae600 o1: f04ae600 o2: ffbfffff o3: 20088001
! o4: f0630fe0 o5: effff378 sp: f0630d98 ra: f0631000
! pid 2811, `tcpd': Data fault
! kernel read fault at addr=0x20, pme=0x0
! MMU sfsr=326: Invalid Address on supv data fetch at level 3
! rp=0xf0630d4c, pc=0xf001f338, sp=0xf0630d98, psr=0x419000c2, context=0x189
! g1-g7: f001f32c, 8000000, ffffff00, 0, f0631000, 0, 0
!
! We have savecore enabled, but I can't debug the core:
! # file vmcore.7
! vmcore.7: data
!
! using trace in /usr/etc/crash, I get:
! FP PC SYM+ OFF ARGS
! f068ad88 f004e064 _panic+ b4 41800ae3 f04ac9e0 fd00a800 0 fd009b84 3
! (?)
! f068ade8 f014022c _sleep+ 114 f01db180 1a 0 100 1a f04adf2c
! f068ae48 f005f3dc _connect+ 1b8 f068b000 0 f068b000 f068b000 0 ff64b20c
! f068aec0 f01414f0 _syscall+ 3bc f068b000
!
! or:
! FP PC SYM+ OFF ARGS
! f05d6cc0 f004e064 _panic+ b4 41800ae5 f01db1a0 fd00a800 0 fd009b84 3
! (?)
! f05d6d20 f014022c _sleep+ 114 f01db1a0 1a 0 100 1a f04ac344
! f05d6d80 f005e17c _sbwait+ 14 ff647ab8 fb08f9f8 1ced 0 ffffffff 2
! f05d6de0 f0075494 _svc_run+ 28 fb06ab20 186a3 2 f0023454 0 410000e7
! f05d6e40 f0021cd0 _nfs_svc+ 260 f05d6fe0 4d8 f01b0ef8 f01b13d0 f05d7000 f01b13d0
! f05d6ec0 f01414f0 _syscall+ 3bc f05d7000
!
! Last week, the system crashed 3 times in one hour. Sun exchanged the
! CPU and the motherboard (they assumed a defect in the MMU). It is
! always 'tcpd' (the tcp_wrapper) which is listed in the messages. The
! same tcpd runs on the rest of our workstations (about 40 Suns and
! others) without problems.

Regards,

Claus
University of Kiel, Germany
Department of Computer Science , Preusserstr. 1 - 9, D - 24105 Kiel
ca@informatik.uni-kiel.de, Phone: +49-431-5604-57, Fax: +49-431-566143



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:08:34 CDT