>From mikej@mtvmail.Corp.Sun.COM Tue Jun 17 14:59 PDT 1997
Date: Tue, 17 Jun 1997 14:59:48 -0700
From: mikej@mtvmail.Corp.Sun.COM (Michael Jaffe)
To: miken@mwh.com
Subject: my paper

Optimizing and Measuring the Solaris Kernel For Large Oracle Servers.
by Mike Jaffee, Sun Microsystems

The first part of the paper will discuss the basics of Solaris Internals that 
are relevant to the Oracle DBA along with tips to common technical questions 
and relevant header files. The second part is quoted tuning information taken 
from Sun Experts. The final part is a discussion of kernel memory allocation, 
how to measure it, and some things that can be done to prevent starvation. 

Solaris Internals
Sparc has two rings of execution. The inner ring is for kernel functions and 
the outer ring is for user process functions. The process address space is 
virtual, and normally only part of a process is in physical memory. The kernel 
stores the contents of the process address space in physical memory, on-disk 
files, and specially reserved swap areas. Over time the kernel shuffles pages 
of the processes between physical memory and disk. Each process has registers 
that are stored in the kernel and are place in the hardware registers at run 
time. A process must block if it is waiting for a resource and allow another 
process to run. The kernel allows each process a brief period of time, usually 
10 milliseconds, to run before performing a context switch. (Vahalia p.20-25) 
On startup once the kernel is loaded, user processes can request system 
services from the kernel through the system call interface. If the process 
misbehaves by dividing by zero or overflow its stack, a hardware exception 
occurs,  and the kernel intervenes, usually aborting the process. Interrupts 
come from peripheral devices usually indicating a status change or I/O 
completion. Two important processes that manage memory are the swapper and 
pagedaemon. (Vahalia p.22-25) 

Each process has a virtual memory address space (VMA) that is translated to 
physical memory addresses by page tables. This mapping is done by the chip's 
MMU. (Tip - System panics can be either hardware or software related. The MMU 
registers give helpful hints on what actually caused the panic.) In addition to 
kernel and user mode, there is kernel and user space.  This refers to regions 
in virtual memory address space of the process. There is only one kernel and 
many processes and hence every process must map in a single kernel address 
space. The kernel portion of the VMA maintains global data structures and some 
per process objects. These can only be accessed by the kernel when the chip is 
running in kernel mode (ring 0). Since the kernel is shared by all processes, 
kernel space must be protected by user-mode access. This is done by requiring 
the processes to use the system call interface.  This requires the chip to go 
into kernel mode, transfer program control to the kernel, have the kernel 
execute system code instructions, then switch back to user mode and user 
control of the process. (Vahalia p.22-23) 

System Services
Oracle uses many Solaris system services such as file and record locking, 
inter process communications, virtual memory, and process scheduling. Common 
system calls are open, read, write, fcntl, kill, priocntl, plock, memcntl, 
sync. Common Signals are SIGSEGV - usually means user stack overflow, SIGBUS 
- out of the process address space, SIGTERM - user has "hung up" without 
exiting gracefully, SIGUSR1 - defined signal for asynchronous events, SIGKILL 
- kill process immediately no exceptions. Oracle uses file and record locking 
by  setting read write locks on portions of a file. Any process can read a 
file that is locked but only the owner of the lock can update the file. A 
write lock is sometimes called an exclusive lock and a read lock is sometimes 
called a shared lock. Process scheduling is usually managed very well by the 
kernel, however a slow job can be speeded up by the priocntl system call. 
(System Services Guide p.1-25) Jim Skeen of Sunsoft - "Oracle gets locked-
down memory as a consequence of using intimate shared memory (ISM), not 
through plock.  It controls sharing inside shared memory through latches, not 
memcntl or plock." He also cautions against changing the priority of the 
Oracle processes "This is something we in DBE actually strongly discourage.  
Only the most daring and knowledgable DBA's should attempt this.  The problem 
is that system threads can get starved if Oracle processes are not "well 
behaved" when running in real time class.   Oracle processes may easily hog a 
cpu for extended periods of time (time being measured in Unix quantums).  We 
in DBE have experimented with changing the dispatch table in useful/clever 
ways, to minimize the number of involuntary context switches.  But Oracle 
processes still run in TS class." (private letter Skeen)

Oracle Internals and Solaris System Services
Mark Johnson of Oracle and Jim Skeen provide the following expert insight and 
information. The system global area is defined as "One or more shared 
segments visible to all Oracle processes that are used to store precompiled 
SQL and PL/SQL (library cache), database buffers (buffer cache), and for 
interprocess communication" (Johnson). As far as process control - "Oracle 
does use semaphores, but latches are the usual synchronizing mechanism, as 
mutexes implemented as spin locks" (Johnson). On the subject of locks "Oracle 
maintains database transaction integrity through use of database locks of 
various sorts--shared read, exclusive read, exclusive write, etc.  These are 
implemented through database locks, not using Unix file locks.  Thus, the 
scope of a database lock can be limited to a single row in the database. Or, 
the database may choose to lock a database page (which may be quite a bit 
smaller than a Unix page).  Or, the database may choose to lock an entire 
database table (which may be composed of multiple database files, which in 
turn may or may not map into Unix files)." (private letter Skeen). 

Oracle uses heavyweight processes that are in the shared memory portion of the 
process address space. The DBWR (data buffer writer) process uses aio threads
known as light weight processes (LWP). An LWP is a kernel-supported user 
thread that is based on kernel threads. They are independently scheduled and
share the address space of the process. Vahalia's book has a nice discussion 
on LWPs. (Jaffee) Kernel Asynchronous I/O and Intimate Shared Memory are two
key technologies used by Oracle on the Solaris platform. 

Asynchronous I/O is needed because a single blocking thread in a multi-
threaded application causes all threads to wait until the thread wakes up. 
What needs to happen is for the thread to issue an asynchronous I/O request 
and then pass control to another thread in the process. Also heavy I/O is not 
efficient when done synchronously because of the large number of context 
switches that must occur every time a thread is blocked. (Hyuck Yoo)

Asynchronous I/O under Solaris is implemented two ways - under Solaris 2.3 it 
is using the library and under Solaris 2.4 and beyond it is in the file 
system layer of the kernel. The library approach uses kernel-level threads 
where each I/O request is handled by a newly created kernel-level thread that 
acts synchronously (i.e. issuing read and write calls). The library lives 
outside of the kernel and the kernel threads that perform the I/O are 
separate from the calling process. The kernel approach is much more 
sophisticated and efficient. The basic concept is to not maintain the queue 
in user space but to put the request directly into the device driver queue. 
The biowait function is bypassed (which is the device driver equivalent to a 
blocking function) and the thread transfers control rather than sleep in the 
kernel. The kernel has buffers with slots called AIO that maintain a listing 
of all I/O requests. (Hyuck Yoo) 

Solaris has provided the ISM feature since 2.2.  The main feature of ISM is 
in addition to sharing the "memory" pages (like the normal shared memory), it 
also shares the page table entries for those pages (therefore, it's 
"intimate").  Another side feature, which is more important for this 
discussion, is that ISM also locks down the shared memory segment in real 
physical RAM. Since the main purpose of ISM is for the DBMS products' buffer 
cache usage, this makes sense. (Jaffee)  

Sharing page table entries solves the problem of page table stealing which is 
expensive because all  the pages mapped in the stolen page table have to be 
flushed before being given to another process. This avoids the condition 
where the whole system may thrash as processes steal page tables from each 
other. (H. Yoo) 

The design team created a new segment in the process address space called 
segshm so that they could create one set of page tables for a shared memory 
segment and share the page tables among the processes that attach that same 
shared memory. In addition to saving page table allocation, sharing page 
tables have other advantages such as having a higher cache hit rate on memory 
map lookups because the tables are in a buffer cache rather than in memory. 
It also avoids the amount of overhead done by the hardware address 
translation layer since it no longer needs go through page tables for every 
process to monitor whether a page has been modified. These are both huge 
savings and speed up the virtual memory paging algorithm within Solaris. (H. 
Yoo) 

IPC
The Oracle RDBMS is a complex program that uses multiple cooperating processes 
that must communicate with each other and share resources. The kernel provides 
a mechanism in user space called inter process communication or IPC. The 
processes operate in a shared memory segment such that if one process modifies 
data it will be immediately visible to the other processes. Data transfer and 
event notifications occur between the various Oracle processes in the Oracle 
SGA.  Semaphores are used for Oracle's own locking and synchronization 
scheme. Asynchronous events such as errors are reported to the processes 
using signals. The default action for most signals from the kernel is to 
terminate the process, however the process may specify an alternate response 
by providing a signal handler function. (Tip - Before installing the kernel 
jumbo patch read the readme file to see if there are any known signal 
problems with Oracle).  (Vahalia - p150)  The relevant IPC system calls 
Oracle makes are shmget, semget, shmat, shmdt, shmctl, and semctl.  The ipc 
information is stored in the kernel with the ipc_perm structure.  shmget(key, 
size,flag) creates a portion of shared memory (which will be the size of the 
Oracle SGA) and shmat(shmid, shmaddr, shmflag) attaches the region to a 
virtual memory address of the process. (shmsys is how Oracle sets up the 
intimate shared memory segment). The structure of a shared memory segment 
includes access permission, segment size, the PID of the process performing 
last operation, and the memory map segment descriptor pointer as well as 
other fields.  (tip - sgabeg in the ksms.s file is a virtual address not 
physical address (0-0xffffffff = 2 GB). Choose small beginning addresses for 
large SGAs. Also watch out for 28 bit Sparc chips. They have a smaller 
virtual addresses. Hal Stern notes "They're really not 28 bit chips, but 
instead the system architecture only passes 28 bits of virtual address space 
on to the memory bus. [private letter]) Once attached the region may be 
accessed like any other memory location without requiring system calls to 
read or write data to it. Hence shared memory is the fastest mechanism for 
processes to share data. (Tip - don't be confused by the SZ field in ps -elf. 
It is in 4 KB pages and represents shared memory in the case of Oracle. For 
example Oracle may have 60 server processes in a shared memory segment all 
approximately 25000 4 KB pages. A common misconception is to think that 
Oracle needs 60 X 4KB X 25000 = 6 GB of virtual memory. Those 60 processes 
are mainly using the shared memory region in the process address space).  
(Tip - shared memory pages are backed by swap space, not by a file. The 
absolute minimum swap must be at least the size of the SGA.)  A process 
detaches the shared memory with shmdt(addr) and destroys the shared memory 
region completely with the IPC_RMID command of the shmctl system call. (Tip  
- the important commands are ipcs -b; look at field SEGSZ for shared memory 
size in use ; sysdef -i and sysdef -i -n /dev/ksyms for IPC and resource 
table definitions; kill -9 <process id> to terminate (no core file) a hung 
process or kill -6 <process id> to abort (core file) a hung Oracle process. 
modload -p sys/shmsys at the command line or forceload: sys/shmsys in the 
system file maybe needed if ipcs -b doesn't work) correctly. This is because 
the kernel is dynamic meaning that file systems, drivers, and modules are 
loaded into memory when they are used, and the memory is returned if the 
module is no longer needed.  (Vahalia - p155-158, p162-164) Semaphores are 
counters that are used by Oracle to monitor and control the availability of 
shared memory segments. Typically the process initializes the semaphore with 
semget, assigns ownership of the semaphore with semctl , and then updates the 
semaphore with semop. A process has to block until the semaphore operation 
has reached zero. A semaphore structure contains the following information - 
semaphore value, the PID of the process that last performed successfully, the 
number of processes waiting for the semaphore to increase, and the number of 
processes waiting for the semaphore to reach zero. (tip-ipc_perm and sem in 
ipc.h, sem.h) (System Services Guide - p68-77). Shared Memory and Semaphore 
Tunables in Solaris 2 relevant to Oracle. (Tip - semmnu = semmns = semmsl X 
semmni). There is no harm in setting the numbers too high since the Oracle 
instance will only allocate semaphores and shared memory as needed.  The 
values are definitions not declarations. 

Name     Default     Min        Max         Reference              Suggested
____     _______     ___        ___         _________              ________
shmmax    1048576     1048576    Available   Maximum shm segment   50% of RAM
                                 RAM         size in bytes
shmmin    1           1          -           Minimum shm segment   1
                                             size in bytes
shmni     100         100        -           Number of shm id      100
                                             to pre-allocate
shmseg    6           6          -           Maximum number shm    32
                                             seg per process
semmni    10          10         65535       Number of semaphore   64
                                             identifiers
semmns    60          -          -           Number of semaphores  1600
                                             in system
semmnu    30          -          -           Number of undo        1250
                                             structures in sys
semmsl    25          -          -           Maximum number of     25 (fixed)
                                             semaphores per ID

Solaris Tuning According to the Experts
Every month in SunWorld Online, the performance experts at Sun write articles 
on tuning.  In addition to the well known book, "Sun Performance and Tuning", 
Adrian Cockcroft with the help of Rich Pettit have put together a series of 
scripts called se2.5 (www.sun.com/960301/columns/adrian /se2.5.html.  Hal 
Stern, another well known Sun tuning guru, has written an O'Reilly press book 
on "Managing NFS & NIS" and he too writes articles that can be downloaded off 
of the web. Fellow SunService Engineers Chris Drake and Kimberley Woods wrote 
"Panic - System Core dump Analysis" which contains detailed information on the 
Solaris kernel and common techniques used in to analysis core files.  Brian 
Wong the hardware expert has written a book called "Configuration and Capacity 
Planning of Large Sun Servers". Most of the tuning information for large Sun 
Servers running Oracle can be found in these sources. Since many customers 
often call SunService for further explanations, it is appropriate to highlight 
some common questions and answer them as the experts would. 

Question 1 - Where is all my Memory?
Probably the most common performance question of all is "Why does vmstat report 
only xxxx about of free memory available?" To use an example, type the  
vmstat 5 and suppose the system shows freemem of 80708 and available swap is 
330000. Now start the  application and observe that the freemem goes down to 
8824 and swap goes to 300000. Now stop the application and observe that all 
of the available swap returns to 330000 but the freemem returns only to 
21260. Where then is all of the ram? Does we have a memory leak? The answer 
is probably no because as Cockcroft notes "(the app) starts up more quickly 
than it did the first time, and with less disk activity. The application code 
and its data files are still in memory, even though they are not active. The 
memory they occupy is not "free." If you restart the same application it 
finds the pages that are already in memory. The pages are attached to the 
inode cache entries for the files. If you start a different application, and 
there is insufficient free memory, the kernel will scan for pages that have 
not been touched for a long time, and "free" them. Once you quit the first 
application, the memory it occupies is not being touched, so it will be freed 
quickly for use by other applications. "(Cockcroft 1) Leaving parts of the 
app in memory even after termination is efficient because "Attaching to a 
page in memory is around 1,000 times faster than reading it in from disk." 
(Cockcroft 1) So how can one know if he has a memory leak in his application? 
The answer is there will be a shortage of swap space after the program runs 
a while and the SZ field in ps -elf for that app will grow over time. 

Question 2 - My Oracle Server is slow. Can you help me tune the kernel?
The answer depends on the version of the operating system and the level of the 
patches.  Early versions of the os had performance bugs and incompatible 
hardware that were the cause of slow performance.  The latest version of the os 
is self-tuning for high performance and will work quite successfully on systems 
ranging from a huge SparcCenter 2000 to small desktops. As Cockcroft says "In 
normal use there is no need to tune the Solaris 2 kernel, since it dynamically 
adapts itself to the given hardware configuration and application workload. " 
(Cockcroft 2)  However for really large Oracle servers some tuning may be 
needed if using early versions of Solaris 2.3 2.4 and 2.5 without a kernel 
patch that automatically adjusts the the paging algorithm. Solaris 2.5.1 is 
self tuning for large memory systems. Paul Faramelli of the kernel TSE group 
has put together the following list of tunables for Solaris. Recommendations 
for large Oracle servers (Ram > 1 GB) are listed. (Tip - Use crash to display 
kernel tunables. As root type crash. At the greater than prompt, type "od -d 
maxuser" or "od -d lotsfree". The od stands for octal dump, and the -d stands 
for decimal. By the way every Solaris tunable [even undocumented ones] can be 
displayed by typing nm /kernel/unix). Note these recommendations are only 
necessary for early versions of Solaris. The some recommendations are 
provided by Steve O'Neil of SunService. (Caution - there is no right answer)

Parameter   Description                                            Recommended
---------   -----------                                            -----------
dump_cnt    Size of the dump                                                 
autoup      Used in struct var for dynamic configuration of the age    300
that a delayed-write buffer must be, in seconds, before        
            bdflush will write it out (default = 60)                          
bufhwm      Used in struct var for v_bufhwm; it's the high water mark  8000   
            for buffer cache memory usage, in Kbytes (2% of memory).       
maxusers    Maximum number of users (In 2.3 and 2.4 the default is            
            number of Megabytes in memory)                                    
max_nprocs  Maximum number of processes (10 + 16 * maxuser)                   
maxuprc     The maximum number of user processes. (max_nprocs - 5)            
rstchown    POSIX_CHOWN_RESTRICTED is enabled (default = 1 )                  
ngroups_max Maximum number of supplementary groups per user (def 32).         
rlim_fd_cur Maximum number of open file descriptors per process sysem         
            wide (default = 64, max = 1024)                                   
ncallout    Number of callout buffers (default = 16 + max_nprocs).            
            (No longer exists in Solaris 2.2 and later releases)              
nautopush   Number of entries in the autopush free list                1024     
sadcnt      Number allowed of concurrent opens of both /dev/sad/user   2048     
            and /dev/sad/admin (default 16).                                  
npty        Number of 4.X psuedo-ttys configured (default 48)          1024    
pt_cnt      Number of 5.X psuedo-ttys configured (default 48)          1024 
physmem     Sets the number of pages usable in physical memory. Only          
            use this for testing, it reduces the size of  memory.             
minfree     Memory threshold which determines when to start swapping    100    
            processes, when free memory falls to this level swapping          
            begins (default: 2.4 - 4d = 50 pages, all others 25               
            pages, 2.3 -  physmem / 64 ).                                     
desfree     This is the "desperation" level, this determines when       200   
            paging is abandoned for swapping. When free memory stays          
            below this level for 30 seconds, swapping kicks in ( 2.4          
            4d = 100 pages, all others 50 pages, 2.3 physmem / 32 ).          
lotsfree    Memory threshold which determines when to start paging.     512   
            When free memory falls below this level paging begins (2.4        
            4d = 256 pages all others 128 pages, 2.3 physmem /16)             
fastscan    The number of pages scanned per second when free memory           
            is zero, the scan rate increases as free memory falls             
            from lotsfree to zero, reaching fastscan ( default: 2.4           
            physmem / 4 with 64Mb being max, 2.3 physmem / 2 ).               
slowscan    The number of pages scanned per second when free memory           
            is equal to lotsfree, also see fastscan ( defaults: 2.4           
            is fixed at 100, 2.3 fastscan /10 ).                              
handspr-    Is the distance between the front hand and backhand in            
eadpages    the clock algorithm. The larger the number the longer an          
            idle page can stay in memory (default: 2.4 physmem / 4            
            2.3 physmem / 2 ).                                                
maxpgio     The maximum number of page-out I/O operations per second.   120
            This acts as a throttle for the page deamon to prevent            
            page thrashing ((DISKRPM * 2) /3 = 40). This parameter
            must be set higher if using two swap partitions. 
t_gpgslo    2.1 through 2.3, Used to set the threshold on when to             
            swap out processes (default 25 pages ).                           
ufs_ninode  Maximum number of inodes. (max_nprocs+16+maxusers+64)     34906     
ndquot      Number of disk quota structures. (default = (maxusers *           
            NMOUNT / 4) + max_nprocs)                                         
ncsize      Number of dnlc entries. (default = max_procs + 16 +       34906     
            maxusers + 64); dnlc is the directory-name lookup cache           

Cockcroft on maxusers
"I never set maxusers. It sizes itself based on the amount of RAM in the 
system. In some cases on configurations with gigabytes of RAM it needs to be 
reduced to avoid problems with lack of kernel address space. The kernel uses up 
a lot of space keeping track of all the RAM in a system. Several other kernel 
table sizes and limits are derived from maxusers." (Cockcroft 2) 

Cockcroft on ncsize 
"The directory name lookup cache (DNLC) is sized to a default value based on 
maxusers. A large cache size (ncsize) significantly helps NFS servers that 
have a lot of clients. On other systems the default is adequate."(Cockcroft 2) 

Question 3: How much swap is needed for a large Oracle database?
Many people are under the impression that very little swap is needed for Oracle 
because the architecture uses temporary tablespaces for sorting and the SGA is 
fixed in memory. Well the truth is large databases require a lot of swap. The 
shared memory segment is backed by swap so the allocated swap MUST be at least 
as large as  the shared memory segments. In addition when the database uses 
intimate shared memory this is also backed by swap. All of the Oracle 
processes must be partially backed by swap. Steve Schuettinger, the Oracle 
applications specialist at Sun, recommends at least 2 GB of swap for benchmark 
testing on large servers. Obviously since RAM plus swap equals virtual memory, 
once swap is gone, the program will halt and  no new apps can be started until 
other programs have stopped. As Adrian Cockcroft says "The important thing to 
realize about swap space is that it is the combined total size of every program 
running and dormant on the system that matters. When a system runs out of swap 
space it can be very difficult to recover. Sometimes you find that there is 
insufficient swap space left to login as root or run the commands needed to 
kill the errant process that is consuming all the swap space." (Cockcroft 3) In 
Theory Solaris 2 changes the rules by adding the RAM and the disk space so if 
the system has enough RAM for the workload, "it can run with no swap disk. In 
practice common database applications that are sized to run in a few gigabytes 
of RAM will actually need many gigabytes of disk allocated as swap space." 
(Cockcroft 3) In the same article Cockcroft says "The consequences of running 
out of swap space affect a larger number of users on a big server, so it wise 
to allocate a lot more than you normally need to cope with any usage peaks. To 
start with, add twice as much disk as you have RAM." (Cockcroft 3) (Tip - It is 
not worth making a striped metadevice to swap on - that would just add overhead 
and slow it down. There is also a limit of 2 gigabytes on the size of each swap 
partition, so striping disks together tends to make them too big. 

/usr/ucb/ps alx, fields SZ or SIZE, /usr/proc/bin/pmap

% /usr/ucb/ps alx
F   UID   PID  PPID CP PRI NI   SZ  RSS    WCHAN S TT        TIME COMMAND
8  2595  1133  1130  0  48 20  988  360 modlinka S pts/4     0:00 -bin/csh

There is confusion between what ps reports. The "/bin/ps prints a field 
labelled SZ, but this is the resident set size in RAM -- printed as RSS by the 
/usr/ucb/ps. You need to use the SZ or SIZE field reported by /usr/ucb/ps alx 
in units of kilobytes to determine the amount of swap space used by the 
process." (Cockcroft 3) 

Oracle's Mark Johnson adds the following "I had thought the standard Oracle 
rule of thumb was 2 to 4 times physical memory (can be a bit less on very 
large memory systems).  Smaller memory systems may want to use higher ratios 
of SGA size to physical memory size and higher swap space ratios.  (I ended 
up using ratios of 1:1 and 1:4 for a very small Solaris for Intel system with 
surprisingly good results.)" 

Hal Stern says "So why do you need swap space if your SGA << phys mem? The 
short answer is that the "phys mem" in that calculation is the non-locked-
down physical memory, and when you allocate an oracle SGA, you allocate 
intimate shared memory (ISM) that is taken out of the physical memory pool 
(ie, it gets locked down).  so on a 1 Gbyte machine, you may think you're ok 
with a 256M SGA, leaving 700M+ for processes.  BUT: the 256M SGA gets taken 
out of the available memory pool, so your maximum VM is only 700M+, and you 
could probably use the swap space....as the SGA/memory ratio goes up, this is 
even more true." (private letter from Stern)


Question 4 - Will a faster cpu help performance?
The answer is not easy to answer. As Hal Stern noted " Noticing that you're 
using 20 percent of the CPU doesn't mean anything until you know the kind of 
work that's using the cycles. If you're CPU-bound, then you have headroom to 
increase the workload by a factor of four or five. An I/O-bound job, however, 
that uses 20 percent of the CPU might be improved by adding disk spindles. As 
you increase the disk count and I/O load, to ease the bottleneck, you'll use 
more CPU to deal with the I/O setup, system calls, and interrupts from the 
additional work. You run the risk of morphing a disk problem into a CPU 
shortage. How do you know when relaxing one constraint pops another one into 
the foreground? Define the right relationships -- CPU time used per disk I/O 
tells you how much system time you eat up as you add disk load -- and measure 
with your tailored yardstick." (Stern 1) 

Preventing Kernel Memory Starvation
When Oracle is working very hard and the operating system is Solaris 2.3 or 
early Solaris 2.4, it is possible to have kernel memory allocation faults 
that can eventually lead to kernel memory starvation. A new memory allocator 
algorithm has been developed and integrated into Solaris 2.5.1 (the old 
allocator had paging thresholds that were too low which causing kernel memory 
allocation failures on very large systems). The allocator has been back 
ported to rev 40 of the Solaris 2.4 jumbo patch and to a future rev of the 
2.5 jumbo patch.  No fix has yet been developed for Solaris 2.3. (Tip - large 
database users should upgrade to Solaris 2.4 or better). In the past Oracle 
customers could manually adjust paging thresholds. The actual value that 
needed to be set was proportional and depended upon the amount of memory and  
the number of cpus on the system. Also in some cases decreasing maxusers and 
bufhwm would mitigate the problem. The total allowable size for the kernel on 
the ultrasparc servers running 2.5 is now so large that kernel memory 
allocation problems on very large systems is virtually impossible.  See 
examples below. The crash output displaying kernel memory starvation is taken 
from a SparcServer 1000 running Solaris 2.3 with 1 GB of ram and 8 cpus. 

Solaris 2.4:       Solaris 2.5:            Kernel memory limits
    sun4c 33MB        sun4c 33MB
    sun4m 61MB        sun4m 100MB
    sun4d 139MB       sun4d 251MB
                      sun4u 2525MB
$> kas crash 15 
>map kernelmap FREE: 2042      WANT: 1 SIZE: 2042 SIZE    ADDRESS TOTAL 
NUMBER OF SEGMENTS 0 TOTAL SIZE 0 
> kmastat 
                       total bytes     total bytes
size        # pools       in pools       allocated     # failures
-----------------------------------------------------------------
small       6807           26138880        25677584     1989915
big         2652           75276288        73046528       0
outsize       -                  -        18571264     45351

Crash is a very powerful tool that helps analyze kernel memory allocation 
failures. We see from the output "TOTAL SIZE 0" indicates that no more free 
kernel memory exists. The FREE field (2042) indicates that there is still 
plenty of memory in the user portion of the virtual address space. Carl of
Sunsoft provides an explanation of kernel map scarcity under Solaris 2.3 and 
Solaris 2.4.  "In the overwhelming majority of cases on large database 
servers, we have found that 64MB is overly generous for bufhwm in that it can 
be cut back by one-half (to 32MB) without too much of an impact on the cache 
hit ratio. What is usually in short supply on these machines is not the 
buffer cache but the amount of kernel heap (mapped by kernelmap) that remains 
for non-buffer cache usage.  Limiting buffer cache growth to 32MB frees up an 
addition 32MB to the heap and has proven successful in avoiding kernelmap 
scarcity at a number of sites running large database applications. Kernelmap 
scarcity (or equivalently kernel heap scarcity as the size of the kernel heap 
is limited by the size of the address space the kernelmap can map) results in 
an extreme slowdown of processing in the systems.  All of a sudden kernelmap 
becomes a scarce resource that every thread contends for and to exacerbate 
the situation the rate of release is slowed by the very same contention to 
the point that kernelmap turnover grinds down almost to the point of 
deadlock.  Why 64MB's worth of kernelmap is inadequate for the largest 
database servers is unknown.  The sites on which this has been a problem have 
been checked for kernelmap leakage and none has been found.  There has also 
been a problem in the past with some kernel data structures being pre 
allocated from the heap and the size of this pre allocation being 
inappropriately scaled to physical memory.  As it is fairly common now for 
machines to be equipped with 3GB of physical memory, this was not the right 
thing to do and did account for some kernelmap depletion headaches. But this 
particular bug has been fixed.  With these two things discounted, the only 
conclusion is that modern database workloads are driving up peak transient 
demands for kernelmap to the 100MB level." (Tip -For large databases running 
Solaris 2.4 or less set bufhwm to 8000 on 4c, 4m, and 4d or upgrade to 
Solaris 2.5 which has a large kernel map address space.) 

Acknowledgements
I want to thank Sun performance gurus Adrian Cockcroft and Hal Stern for 
their contributions to this paper. UNIX architect Mark Johnson of Oracle and 
database expert Jim Skeen of Sunsoft provided comments on Oracle internals. 
Kernel architect Jeff Bonwick has added explanations and suggestions 
regarding kernel memory allocation and kernel memory starvation. SunService 
kernel engineer Paul Faramelli documented the Solaris tuning parameters and 
SunService Technical Expert Steve O'Neil provided recommendations for tuning 
large Oracle databases on versions of Solaris that are not self tuning. 
Finally I want to thank Uresh Vahalia who gave me permission to quote at 
length from his wonderful book "UNIX Internals - The New Frontiers". 

Disclaimer
The author alone is responsible for the contents of this paper. No one at Sun 
Microsystems, Sunsoft, SunService, or the Oracle corporation has reviewed or 
approved the paper for completeness or accuracy in it's published format and 
nothing in the paper can be construed as the official policy of Sun 
Microsystems or the Oracle Corporation. 

References 
UNIX Internals - The New Frontiers by Uresh Vahalia, Prentice Hall 1996 
"How the Solaris Kernel is Optimized for Oracle" by Mike Jaffee 1996
"Shared Page Table: Virtual Memory Enhancement for Data Sharing in UNIX" H.Yoo
"Comparative analysis of Asynchronous I/O in Multithreaded UNIX" Hyuck Yoo
"Help! I've lost my memory!" by Adrian Cockcroft, SunWorldOnline 1995 (1) 
"What are the tunable kernel parameters for Solaris 2?" by Adrian Cockcroft (2)
"How does swap space work?" by Adrian Cockcroft, SunWorldOnline 1995 (3) 
"We suggest creative ways to better your system" performance by Hal Stern 
System Service Guide - Solaris 2.4 Manual, SunSoft, 1994 
"The Slab Allocator: An Object-Caching Kernel Memory Allocator" Jeff Bonwick