My original message:
> I have an SS10/30 running Solaris 2.2 with the following patches:
>
> 100999-06
> 101027-03
> 101095-01
>
> I occasionally see a kernel panic on physio_unlock. The traceback is:
>
> complete_panic(0xf00b6c00,0x0,0xfffd,0xfc42a000,0x0,0x1) + 118
> do_panic(?) + 1c
> vcmn_err(0xf00c8e04,0xf0d9e6fc,0xf0d9e6fc,0x0,0xfc44cd6d,0x3)
> cmn_err(0x3,0xf00c8e04,0x5,0x1000,0x3,0x1) + 1c
> physio(0xfc44cd50,0xf0d9e730,0xf0d9e834,0x704000,0x1,0xf0d9e808) + 35c
> rw(0xfc6cb084,0xf0d9e920,0x2002,0x0,0xf0d9ee90,0x800) + 254
> syscall(0xf0d9e9d8) + 3b4
Sun-managers and Sun technical support both pointed me towards a later
revision of patch #100999. As of this writing, I have received 100999-21
from Sun. Several of the notices in the patch README indicate that
heavy I/O may cause a panic like this.
Thanks to Sun technical support and the following sun-managers:
Casper Dik <casper@fwi.uva.nl>
glenn@uniq.com.au (Glenn Satchell - Uniq Professional Services)
Here are excerpts from the 100999-21 README:
Keywords: nfs, lock, load, unix. adb, hung, srmnu, swap, mount, panic, NVSIMM
Synopsis: SunOS 5.2: jumbo patch for fixes to kernel/unix & NFS driver
Problem Description:
Loading the NVSIMM driver on a sun4d machine could
panic the system with "Memory ECC Error."
This fix is necessary to use NVSIMM on sun4d with on493.
(from 100999-20)
1103091:
On systems with very large amounts of memory, the size of kernel
corefiles can become extremely large. this patch should reduce
the size of corefiles on heavily loaded systems.
1137685:
If the dump device is not large enough to contain the dump,
the dump will be silently corrupted. this patch causes an
error message to be printed and no dump to be taken if there
is insufficient space.
(from 100999-19)
Loading the NVSIMM driver on a sun4d machine could
panic the system with "Memory ECC Error."
This fix is necessary to use NVSIMM on sun4d with on493.
(from 100999-18)
A software debugging message was mistakenly left in the production
kernel. The following debug message shows up under heavy VME bus
usage, and unduely alarms many customers:
unix: VME dropped an INT-ACK cycle
unix: MMU sfsr=b36:
unix: Bus Access Error
unix: on supv data fetch at level 3
unix: M-Bus Timeout Error
(from 100999-17)
1132866: running ODS 2.0 on Solaris 2.2 sun4m and sun4d machines
causes srmmu_unlock_ptbl panics.
1123762: heavy system loads can cause a data fault panic in
srmmu_setup()
These problems only affect sun4d and sun4m platforms.
(from 100999-16)
A kernel core dump can sometimes be interrupted by a "panic timeout"
because it takes longer than the (inadequate) built-in time limits for
kernel core dumping.
(from 100999-15)
SunOS 5.1 and SunOS 5.2 can panic with the following message:
panic: page_unlock: pp xxxxxxx is not locked
An I/O-intensive process runs much slower than it should when all CPUs
are busy with CPU-bound processes, because the CPU-bound processes
aren't preempted when they should be.
(from 100999-14)
A watchdog reset is caused when running sundiag on a diskless machine
(swapping over NFS) when the 'mod_uninstall_daemon()' runs out of kernel
stack space. This only happens when swapping over NFS because the call
stack is much deeper. The bug synopsis has nothing to do with the actual
problem. This patch does not fix the clget() warning.
(from 100999-13)
When the same number of cpu intensive programs are run as the
number of processors in a system, the system will appear to
be hung. Actually the system is just running very slowly, so
slowly that it can take minutes to echo characters on the console.
(from 100999-12)
The kernel panics because a fault taken by bcopy_asm()
is mis-handled.
(from 100999-11)
When NFS is used over the loopback (in other words, the NFS client and
server are the same system), a deadlock can occur which can lead to the
NFS client and server being hung up, and ultimately, the entire
system can hang. While one can avoid this by not deliberating doing
NFS mounts through the loop back, the automounter will sometimes use
NFS mounts to access local filesystems.
(from 100999-10)
Part of the fix for bug 1113596 (incorporated into these
two patches) was to create a new routine, klm_init, which
initialises the kernel locking (KLM) code.
Among other things the routine performs a lookuppn("/dev/ticlts"),
i.e. root had better be mounted.
This is not true for diskless clients, which panic with
a data fault trying to read address zero:
trap(0x9, ., 0, ...)
mutex_enter()
klm_init()
nfs_clntinit()
(from 100999-09)
You can panic the system if you try to unmap an area of an address
space that currently has raw i/o being done on it. This typically
requires two threads to be active in the address space at once,
one doing the unmap (e.g., munmap() or shmdt()) the other doing the
raw i/o.
(from 100999-08)
These patches enable support of the Presto driver on NVSIMM's on SS10.
They allow the system to correctly identify the presence of the NVSIMM,
correctly handle ECC errors generated by uninitialized NVSIMM boards,
and correctly handle checking the NVSIMM's battery low register.
And...
Using AF_UNIX sockets to pass file descriptors will either cause a kernel
panic or cause the processes passing the file descriptors to hang in
the kernel so that they can not be killed.
(from 100999-07)
The clean windows trap causes a segmentation violation on the sun4m systems.
(from 100999-06)
lockd may spin and generate multiple lock/unlock requests if it
receives a signal while waiting for a reply to an NFS lock/unlock request.
This is most often manifested when a ksh user logs in or out of a machine
which NFS mounts his/her home directory and types ^C during the brief period
that ksh is locking or unlocking its history file. This causes ksh to hang
and the machine's lockd to consume lots of CPU time.
(from 100999-04)
A NFS server will leak 9 kilobytes of memory every time a NFS client
request is retransmitted and received by the server while it is
processing the the original request. This kind of situation can occur
if the server experiences a heavy load (peak or sustained).
An NFS client that reads over the network a quantity of data that
isn't a multiple of 4 bytes will leak 64 bytes of main memory for every
such request.
(from 100999-03)
In a classic deadlock algorithm, deadlock is not produced with
mandatory locking with write() or writev().
Specifically, GABI.os/ioprim/write_A and writev failed. This is
standards violation.
(from 100999-02)
The t_kspoll is holding a mutex across an untimeout, where the timeout
routine is attempting to grab the same mutex contrary to untimeout(9f).
It deadlocks.
Symptoms are a process that is access an NFS file will hang and is
unkillable.
(from 100999-01)
Under moderate to heavy system loads, or under particular I/O (like
meta-disk I/O) the system can panic like (as an example) :
% panic[cpu3]/thread=0xf83e7600: srmmu_unlock_ptbl: ptbl
f61f80cc not locked
(from 100998-02)
Kernel panics with data faults while doing nfs activities.
(from 100998-01)
If an NFS client mounts a filesystem read-only, access() will still
claim that writes are possible.
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:08:04 CDT