SUMMARY- Re: BAD TRAP - how serious?

From: Lisa Weihl (lweihl@cs.bgsu.edu)
Date: Tue Jun 02 1998 - 16:55:21 CDT


It's been a few weeks so I thought I should send this out. The consensus
was it could be hardware or software. Since I was just getting setup with a
company that sells Kingston memory (Univ. PO's are a pain) I was waiting to
see if any memory died. It'll be much easier to order quickly now if I need
to. If it had been the MMU it would have required a Sun call which would
have been another long process since we don't pay for service.

Chris O'Neal asked why we don't pay for a service contract for a main
server. Answer: It's just a department level server and they used to pay
for service but faculty decided it was better to save the $5000k they were
paying to buy another machine and have one configured to take it's place if
hardware goes down. Well, that idea hasn't been functioning for a while
(old admin always had a Sun on his desk configured to take over if
necessary, I don't have a desktop machine) and so makes me nervous. We're
attempting to clean up our Unix systems in the next year. Being a new
admin(it's been a year I'm going to have to quit saying that soon:-) I have
a lot to learn but am looking forward to the challenge.

I solved 2 other problems today that I had posted about earlier and didn't
get definite answers on so I'm feeling good!!

Thanks to all who helped: (Original post and replies with names are below)
This is a long post but I always hope that a complete SUMMARY will help
someone searching the archives.

Lisa

ORIGINAL POST>>>>>>>>>

>My main server(SS10 running SunOS 4.1.3) rebooted twice today. It just has
>an old VT320 terminal as the console. By the time I got to the machine room
>the first time it was already rebooting itself and no messages of any
>consequence were left behind in the logs.
>
>The only thing I could find was a client machine that mounts /home off of
>the main server had a full / partition. I didn't think that that should
>cause the main server to crash. However, when I attempted to logon to that
>client to clear the full / my main server crashed immediately.
>Coincendence or could nfs be acting up?
>
>On the second reboot I did get relevant info:
>
>May 18 18:06:05 maestro vmunix: BAD TRAP: cpu=0 type=9 rp=f04bdd4c addr=20
>mmu_f
>sr=326 rw=1
>May 18 18:06:05 maestro vmunix: MMU sfsr=326: Invalid Address on supv data
>fetch
> at level 3
>May 18 18:06:05 maestro vmunix: regs at f04bdd4c:
>May 18 18:06:05 maestro vmunix: psr=419000c7 pc=f0020634 npc=f0020638
>May 18 18:06:05 maestro vmunix: y: 0 g1: f0020628 g2: 8000000 g3: ffffff00
>May 18 18:06:05 maestro vmunix: g4: c g5: f04be000 g6: 0 g7: 0
>May 18 18:06:05 maestro vmunix: o0: 0 o1: 0 o2: 2e000 o3: 0
>May 18 18:06:05 maestro vmunix: o4: 0 o5: 0 sp: f04bdd98
>
>
>I searched the archives and found 2 different responses to questions like
>these. One definitely said an error of this type was bad MMU or RAM. The
>other said it was software.
>
>I have no Sun maintenance contract and no spare RAM on hand for this
>machine should I be quickly finding some RAM? And what about MMU is that
>easily replaceable like RAM or will that require a Sun call?

REPLIES>>>>>>>>

****************************************************************
Chris Drake <Chris.Drake@Corp.Sun.COM>

>I searched the archives and found 2 different responses to questions like
>these. One definitely said an error of this type was bad MMU or RAM. The
>other said it was software.

To be honest, it could be either.

The best thing to do is get a corefile from a crash, if it happens again,
and check it out. To do this, you have to enable the "savecore" line inside
the /etc/rc.local file (I think; it should be towards the end). (My 4.x
knowledge is a bit rusty, I'm afraid). It may be prefaced by some shell
code that creates a directory to put it in which should be uncommented also.

Bad traps are caused by bogus pointers in the kernel, mostly. The system
equivalent of a Segmentation Violation. Hardware is certainly a possible
cause, but I assume software on these until it's proven otherwise.

With the information you have in the /var/adm/messages file, you should be
able to get a rough idea where it happened, if you're really interested. It's
not *completely* trivial, but it can be done. Drop me a line if you want
more info...

****************************************************************
Thomas Anders <anders@hmi.de>

First be sure to have installed all the critical patches -- at least
the ip driver can cause crashes when network traffic is high (this
could explain that you only see it when nfs mounting).
We've had similar problems with a SS10 under SunOS 4.1.4 that went away
after patching.

****************************************************************
Duncan C. White" <css1dw@ee.surrey.ac.uk>

Certainly sounds like a RAM/MMU problem to me - one thing you could try is
taking the machine down and removing and reinserting all RAM SIMMs, ie. reseat
them! Then, at the monitor prompt (ok) try the diagnostics:

        setenv diag-switch? true
        reset

to do the extensive tests. You'll need to 'setenv diag-switch? false' again
at the end! Another test explicitly tests the memory:

        setenv selftest-#megs 64 (replace 64 with how much memory your
                                    machine has!)
        test-memory

If the fault continues, you can try (if you have the patience) removing one
SIMM at a time, putting the machine back into service and seeing if it crashes
after N days.. If you had one SIMM you could nick from another machine, you
could remove ALL the mail server's memory and just put in the borrowed SIMM,
see if the machine crashes with none of the original memory.. Failing that,
I'm afraid it's going to be a hardware support call - maybe it's a good time
to persuade your employer that it's insane not to maintain a machine that's
in a high-availability role!

****************************************************************
"O'Neal,Chris" <onealwc@AGEDWARDS.com>

We run alot of SS10 w/ SunOS 4.1.3 and 4.1.4. and overtime I have had
reboots without any errors being logged (bummer). I have never seen the
type of error that you incountered at the second reboot.

This is a primary server and its not under a hardware contract?

To me, MMU replacement means motherboard replacement.

OTHER INFO THAT MAY BE OF HELP TO YOU:

OK PROMPT
Sometimes you can determine cause of a system fault at ok prompt. Using
the open boot command the Synchronous Fault Status Register (SFSR)
provides information
on exceptions (faults) issued by the Memory Management Unit (MMU).

At the the ok prompt type:

             .sfsr

Look at the fault type (FT). This number is in hexadecimal
format and means:

        0 no error
        1 Invalid address error
        2 Protection error
        3 Privilege violation error
        4 Translation error
        5 Access bus error(timeout)
        6 Internal error
        7 -reserved

What causes an "asynchronous memory fault" panic?

KERNEL ASYNCHRONOUS MEMORY FAULT PANICS

There are two main causes of asynchronous memory fault panics.

1) The CPU cache did not flush properly to main memory.

The CPU can modify cache rows in its cache, such as cached data which
has
been changed by a program. This data must be written out to main memory
at some point if it is to be accessed by other processors or stored onto
disk. The write that takes place is asynchronous to the part of the CPU
that uses the data (the part which makes calculations, etc). It takes
place from an on-chip write buffer, to which cache rows are queued;
writes from this buffer out to main memory are completed by a different
part of the chip. The "asynchronous memory fault" occurs when the
asynchronous write from the cache to main memory terminates with an
error.

(Note that the actual write always takes place from the on-chip write
buffer regardless of whether the MMU is in write-through or copy-back
mode, or uses data that is marked non-cacheable.)

The error can be due to any hardware along the path between the cache
itself and the memory, including the CPU module, the motherboard or the
memory. Look elsewhere for more clues as to what could be causing the
problem, to narrow down the bad hardware. Check the /var/adm/messages
files and dmesg output for other kinds of errors, perhaps (ecc) memory
errors which would indicate memory problems, other kinds of CPU errors
which would indicate a bad cpu module, Mbus timeout errors (which point
to a potentially bad motherboard), and so on.

2) An external device attempted to read or write a bad memory address.

This could be a hardware problem where the device was properly set up
but accessed a bad address, or the memory could be bad; or it could be a
software problem, because a device driver did not set up its device to
access
the proper part of memory. Such a memory fault is asynchronous with
respect to the CPU, because the device tried to do DMA to the memory,
independent of the CPU.

The way to tell whether or not the case is (1) or (2) is to observe the
logistics of when the problem happens. Does this problem happen
consistently while a particular thing is going on? Software problems
tend to be consistent, predictable and replicable; whereas hardware
problems tend to be more random. Things to look for:

 - Is there a third party device, which, when operated, triggers this
  panic condition?

 - Are there DMA errors in the /var/adm/messages file to point the way
  to a suspect device?

BAD TRAP:

One type of crash is a BAD TRAP. Bad traps happen when the kernel takes
an unexpected trap. Things that can cause a trap are trying to access
unaligned memory, trying to access memory which is not currently mapped.
An example of messages from a bad trap follow:

Dec 21 03:36:49 mysun unix: BAD TRAP: type=7 rp=f0bbeb8c addr=0
mmu_fsr=0 rw=0
Dec 21 03:36:49 mysun unix: find: Memory address alignment
Dec 21 03:36:49 mysun unix: pid=916, pc=0xfc2550e4, sp=0xf0bbebd8,
                                psr=0x1f0000c0, context=1930
Dec 21 03:36:49 mysun unix: g1-g7: f004f51c, 8000000, f007702c, c0,
fd7a1a68,
                                1, fcbaa020
Dec 21 03:36:49 mysun unix: panic: cross-call at high interrupt level
Dec 21 03:36:49 mysun unix: syncing file systems... 3 3 3 3 3 3 3 3 3 3
3 3 3
                                3 3 3 3 3 3 3 done
Dec 21 03:36:49 mysun unix: 14849 static and sysmap kernel pages
Dec 21 03:36:49 mysun unix: 197 dynamic kernel data pages
Dec 21 03:36:49 mysun unix: 144 kernel-pageable pages
Dec 21 03:36:49 mysun unix: 1 segkmap kernel pages
Dec 21 03:36:49 mysun unix: 0 segvn kernel pages
Dec 21 03:36:49 mysun unix: 153 current user process pages
Dec 21 03:36:49 mysun unix: 15344 total pages (15344 chunks)
Dec 21 03:36:49 mysun unix: dumping to vp fcb00734, offset 121920

In order to troubleshoot this kind of crash, it is necessary to get a
stack
trace of the thread which caused the crash. This stack trace can then
be
compared with traces found in bug reports to see if this is a known
problem.

****************************************************************
"Jody L. Baze" <jody@BlueSkyTours.COM>

Hi Lisa -- We had a SS10 do this for a while (we were running Solaris 2.4
at the time, though). Sun replaced *every* piece of hardware in the machine
except for the power supply to no avail. We would get reboots at odd
intervals - sometimes we'd go a few days, sometimes we'd reboot several
times a day. We had been running 2.4 for about 6 months before this started
happening so we're pretty sure it wasn't the OS'es fault - we finally
concluded that it was most likely a piece of software that had a bad
interaction combined with the OS and hardware. I had just written a
critical piece of software that did heavy memory mapping... don't know if
that was it, but it's my best guess.

Anyway, we upgraded to Solaris 2.5.1 and the problem went away. Weird. Being
that the problem went away, we never pursued it further. Sun was stumped
anyway ;^)

Good luck, I know I haven't given you a real solution but maybe an idea or
two as to what to look for. I certainly wouldn't start paying money to swap
out hardware until I did a thorough examination of what software I'd
installed recently...

****************************************************************
 Ed Weller <edw@solarsys.com>

Replace CPU

****************************************************************
Kevin Sheehan <Kevin.Sheehan@uniq.com.au>

>
> I searched the archives and found 2 different responses to questions like
> these. One definitely said an error of this type was bad MMU or RAM. The
> other said it was software.

Much more likely in general to be software.
>
> I have no Sun maintenance contract and no spare RAM on hand for this
> machine should I be quickly finding some RAM? And what about MMU is that
> easily replaceable like RAM or will that require a Sun call?

The MMU is either implemented in hardware using SRAM (not easily replaced)
or with the SRMMU, which uses RAM anyway...
>
> The machine has been running since 6pm without a reboot. This machine is
> our dept. mail server and that's the most important service that my users
> can't live without for a while.

The bottom line is that you need to find out where it died. addr=20
looks like somebody derefrencing a null pointer.

Checking out the PC to see where it was or getting a crash dump and
doing $c to see what it was doing will be the clincher. hardware
problems tend to be about the place, software problems tend to die
in the same place over.
>
> Thanks and as usual I'll summarize.

I guess the last comment is you should think about upgrading. Been lots
of fixes since 4.1.3...

******************************************************************
Lisa Weihl, System Administrator E-mail: lweihl@cs.bgsu.edu
Department of Computer Science Office: Hayes 225
Bowling Green State University Phone: (419) 372-0116
Bowling Green, Ohio 43403-0214 Fax: (419) 372-8061



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:12:41 CDT