My original question:
> 
> Hello,
>    One of our Sparc 5's rebooted the other day due to a memory fault.  The
> log files follow:
> 
> Jan 28 10:46:42 jayhawk unix: panic: asynchronous memory fault: MFSR=81804040 
MFAR=c8d620
> Jan 28 10:46:42 jayhawk unix:  4164 static and sysmap kernel pages
> Jan 28 10:46:42 jayhawk unix:   108 dynamic kernel data pages
> Jan 28 10:46:42 jayhawk unix:   170 kernel-pageable pages
> Jan 28 10:46:42 jayhawk unix:     2 segkmap kernel pages
> Jan 28 10:46:42 jayhawk unix:     0 segvn kernel pages
> Jan 28 10:46:42 jayhawk unix:     0 current user process pages
> Jan 28 10:46:42 jayhawk unix:  4444 total pages (4444 chunks)
> Jan 28 10:46:42 jayhawk unix: dumping to vp fc1e4d0c, offset 271864
> Jan 28 10:46:42 jayhawk unix: WARNING: /iommu@0,10000000/sbus@0,10001000/espdm
a@5,8400000/esp@5,8800000 (esp0):
> Jan 28 10:46:42 jayhawk unix:  Unrecoverable DMA error on dma
> Jan 28 10:46:42 jayhawk unix: panic: asynchronous memory fault: MFSR=81004040 
MFAR=c8d620
> 
>    My question is...does this indicate bad memory, which should be replaced?
> Or is this just something that happened, and will likely not happen again?
The people kind enough to reply:
RAVKRISH.IN.ORACLE.COM.ofcmail@in.oracle.com
kwong@scis.acast.nova.edu
raju@ecologic.net
css1dw@ee.surrey.ac.uk
peter.allan@aeat.co.uk
sweh@mpn.com 
reynolds@acetsw.amat.com
The answers:
1) Many people recommended going to the ok> prompt and running some
   diagnostics, such as:
setenv selftest-#megs 64  (or whatever)
test-memory
I did this, and it tested clean.
2) One person said that it's possibly a symptom of a motherboard problem.
3) Here is a full explanation provided by RAVKRISH.IN.ORACLE.COM.ofcmail@in.oracle.com, who found it on comp.unix.solaris:
>  
> I have worked on Sun workstations for about 2 years but 
> only encountered a "Level 15 interrupt" a couple of times 
> (now, being the second time). [...] 
>  
> Can anyone give me a "clear" explanation of this...? 
 
Sure.  The level 15 asynchronous interrupt is caused by 
a memory error.  I don't know what precise error you have 
seen but you may have MFAR and MFSR values displayed in 
the error message.  These are the Memory Fault Address 
Register and the Memory Fault Status Register respectively. 
 
You may well be experiencing this as a fatal error (depending 
upon whether your machine is ECC or parity).  If it's fatal 
(ie. panic) and it happens again and again, test the memory 
from the ok prompt or use the values from the MFAR and MFSR 
 
This explanation is taken from the Sun information document 
repository: 
 
There are two main causes of asynchronous memory fault panics. 
 
 
1) The CPU cache did not flush properly to main memory. 
 
The CPU can modify cache rows in its cache, such as cached data 
which has been changed by a program.  This data must be written 
out to main memory at some point if it is to be accessed by other 
processors or stored onto disk.  The write that takes place is  
asynchronous to the part of the CPU that uses the data (the part  
which makes calculations, etc).  It takes place from an on-chip  
write buffer, to which cache rows are queued; writes from this  
buffer out to main memory are completed by a different part of  
the chip.  The "asynchronous memory fault" occurs when the 
asynchronous write from the cache to main memory terminates with an 
error. 
 
(Note that the actual write always takes place from the on-chip  
write buffer regardless of whether the MMU is in write-through or 
copy-back mode, or uses data that is marked non-cacheable.) 
 
The error can be due to any hardware along the path between the  
cache itself and the memory, including the CPU module, the  
motherboard or the memory.  Look elsewhere for more clues as to  
what could be causing the problem, to narrow down the bad hardware.   
Check the /var/adm/messages files and dmesg output for other kinds  
of errors, perhaps (ecc) memory errors which would indicate memory 
problems, other kinds of CPU errors which would indicate a bad cpu 
module, Mbus timeout errors (which point to a potentially bad  
motherboard), and so on. 
  
 
2) An external device attempted to read or write a bad memory 
   address. 
 
This could be a hardware problem where the device was properly set 
up but accessed a bad address, or the memory could be bad; or it  
could be a software problem, because a device driver did not set up  
its device to access the proper part of memory.  Such a memory fault  
is asynchronous with respect to the CPU, because the device tried to  
do DMA to the memory, independent of the CPU. 
 
The way to tell whether or not the case is (1) or (2) is to observe  
the logistics of when the problem happens.  Does this problem happen 
consistently while a particular thing is going on?  Software problems 
tend to be consistent, predictable and replicable; whereas hardware 
problems tend to be more random.  Things to look for:  
 
- Is there a third party device, which, when operated, triggers this 
  panic condition? 
 
- Are there DMA errors in the /var/adm/messages file to point the way 
  to a suspect device? 
-mike
-----------------------------------------------------------------------------
Michael Hawk
mike@gi.net
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:11:44 CDT