AMENDED SUMMARY: nfsd "bad trap"

From: Doris Harrington (djh@igor.rational.com)
Date: Wed Jun 02 1993 - 19:43:32 CDT


My first summary was incorrect. So sorry. When the machine stayed
up for over an hour, I thought we'd fixed the problem. However, the
REAL problem was corrupt data from the file restores. It just took
over an hour before anyone hit the corrupt data again. What tipped
us off was that every time we rebooted and ran fsck, the exact same
file reported fsck errors. Last night we re-restored the data to
the disk. The system has been up all day today, so I'm fairly sure
the problem is resolved now.

Sorry for the previous, incorrect summary.

...Doris.
djh@rational.com

----- Begin Included Message -----

>From sun-managers-relay@ra.mcs.anl.gov Wed Jun 2 02:53:42 1993
Sender: sun-managers-relay@ra.mcs.anl.gov
Reply-To: djh@igor.Rational.COM (Doris Harrington)
Followup-To: junk
To: sun-managers@eecs.nwu.edu
Subject: SUMMARY: nfsd "bad trap"
Cc: djh@igor.Rational.COM
Content-Length: 1919

Problem solved. There were some unknown number of machines that
still thought they had files nfs-mounted from this machine's bad disk.
We stopped the machine from crashing by booting to single-user mode.
then we edited the /etc/exports file so it would no longer export files
from the dead/rebuilt filesystem. Then we rebooted the system and it
came up fine. The next step was to go around to all the machines which
might be trying to mount that filesystem and reboot them, so their
automounter was no longer confused. Then we went back to the server,
added the line that would export the filesystem again, and did
exportfs -a to re-export the filesystem again. The system has been
up since then.

My original post follows. Thanks!

...Doris.
djh@rational.com

----- Begin Included Message -----

>From djh Tue Jun 1 14:25:27 1993
To: sun-managers@eecs.nwu.edu
Subject: nfsd "bad trap"
Cc: djh
Content-Length: 958

HELP!!! We had to replace an external disk on a Sparcserver 690MP
last night. After the files were restored onto the new disk, the
system crashed with this:

pid123 'nfsd': Data fault
Bad Trap: cpu=1 type=9 rp=f844f7bc addr=14 mmu_fsr=126 rw=1
MMU sfsr=126: Invalid address on supv data fetch at level 1
regs at f844f7bc:

        {There are 5 rows of address values represented within
        these brackets that probably won't matter to you}

Then all these bad trap messages keep streaming across the console
screen for awhile. Eventually it stops. We run fsck manually,
reboot, and it all happens all over again.

Another message we saw flash by was:

        panic on 3: memory address alignment

I have a call in with Sun for assistance and am rapidly searching the
Answerbook pages for ideas. If any of you have any pointers or
recommendations, please answer. I have lots of highly-paid software
developers twiddling their thumbs.

Thanks...Doris.
djh@rational.com

----- End Included Message -----

----- End Included Message -----



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:07:54 CDT