>From mikej@mtvmail.Corp.Sun.COM Tue Jun 17 14:59 PDT 1997 Date: Tue, 17 Jun 1997 14:59:48 -0700 From: mikej@mtvmail.Corp.Sun.COM (Michael Jaffe) To: miken@mwh.com Subject: my paper Optimizing and Measuring the Solaris Kernel For Large Oracle Servers. by Mike Jaffee, Sun Microsystems The first part of the paper will discuss the basics of Solaris Internals that are relevant to the Oracle DBA along with tips to common technical questions and relevant header files. The second part is quoted tuning information taken from Sun Experts. The final part is a discussion of kernel memory allocation, how to measure it, and some things that can be done to prevent starvation. Solaris Internals Sparc has two rings of execution. The inner ring is for kernel functions and the outer ring is for user process functions. The process address space is virtual, and normally only part of a process is in physical memory. The kernel stores the contents of the process address space in physical memory, on-disk files, and specially reserved swap areas. Over time the kernel shuffles pages of the processes between physical memory and disk. Each process has registers that are stored in the kernel and are place in the hardware registers at run time. A process must block if it is waiting for a resource and allow another process to run. The kernel allows each process a brief period of time, usually 10 milliseconds, to run before performing a context switch. (Vahalia p.20-25) On startup once the kernel is loaded, user processes can request system services from the kernel through the system call interface. If the process misbehaves by dividing by zero or overflow its stack, a hardware exception occurs, and the kernel intervenes, usually aborting the process. Interrupts come from peripheral devices usually indicating a status change or I/O completion. Two important processes that manage memory are the swapper and pagedaemon. (Vahalia p.22-25) Each process has a virtual memory address space (VMA) that is translated to physical memory addresses by page tables. This mapping is done by the chip's MMU. (Tip - System panics can be either hardware or software related. The MMU registers give helpful hints on what actually caused the panic.) In addition to kernel and user mode, there is kernel and user space. This refers to regions in virtual memory address space of the process. There is only one kernel and many processes and hence every process must map in a single kernel address space. The kernel portion of the VMA maintains global data structures and some per process objects. These can only be accessed by the kernel when the chip is running in kernel mode (ring 0). Since the kernel is shared by all processes, kernel space must be protected by user-mode access. This is done by requiring the processes to use the system call interface. This requires the chip to go into kernel mode, transfer program control to the kernel, have the kernel execute system code instructions, then switch back to user mode and user control of the process. (Vahalia p.22-23) System Services Oracle uses many Solaris system services such as file and record locking, inter process communications, virtual memory, and process scheduling. Common system calls are open, read, write, fcntl, kill, priocntl, plock, memcntl, sync. Common Signals are SIGSEGV - usually means user stack overflow, SIGBUS - out of the process address space, SIGTERM - user has "hung up" without exiting gracefully, SIGUSR1 - defined signal for asynchronous events, SIGKILL - kill process immediately no exceptions. Oracle uses file and record locking by setting read write locks on portions of a file. Any process can read a file that is locked but only the owner of the lock can update the file. A write lock is sometimes called an exclusive lock and a read lock is sometimes called a shared lock. Process scheduling is usually managed very well by the kernel, however a slow job can be speeded up by the priocntl system call. (System Services Guide p.1-25) Jim Skeen of Sunsoft - "Oracle gets locked- down memory as a consequence of using intimate shared memory (ISM), not through plock. It controls sharing inside shared memory through latches, not memcntl or plock." He also cautions against changing the priority of the Oracle processes "This is something we in DBE actually strongly discourage. Only the most daring and knowledgable DBA's should attempt this. The problem is that system threads can get starved if Oracle processes are not "well behaved" when running in real time class. Oracle processes may easily hog a cpu for extended periods of time (time being measured in Unix quantums). We in DBE have experimented with changing the dispatch table in useful/clever ways, to minimize the number of involuntary context switches. But Oracle processes still run in TS class." (private letter Skeen) Oracle Internals and Solaris System Services Mark Johnson of Oracle and Jim Skeen provide the following expert insight and information. The system global area is defined as "One or more shared segments visible to all Oracle processes that are used to store precompiled SQL and PL/SQL (library cache), database buffers (buffer cache), and for interprocess communication" (Johnson). As far as process control - "Oracle does use semaphores, but latches are the usual synchronizing mechanism, as mutexes implemented as spin locks" (Johnson). On the subject of locks "Oracle maintains database transaction integrity through use of database locks of various sorts--shared read, exclusive read, exclusive write, etc. These are implemented through database locks, not using Unix file locks. Thus, the scope of a database lock can be limited to a single row in the database. Or, the database may choose to lock a database page (which may be quite a bit smaller than a Unix page). Or, the database may choose to lock an entire database table (which may be composed of multiple database files, which in turn may or may not map into Unix files)." (private letter Skeen). Oracle uses heavyweight processes that are in the shared memory portion of the process address space. The DBWR (data buffer writer) process uses aio threads known as light weight processes (LWP). An LWP is a kernel-supported user thread that is based on kernel threads. They are independently scheduled and share the address space of the process. Vahalia's book has a nice discussion on LWPs. (Jaffee) Kernel Asynchronous I/O and Intimate Shared Memory are two key technologies used by Oracle on the Solaris platform. Asynchronous I/O is needed because a single blocking thread in a multi- threaded application causes all threads to wait until the thread wakes up. What needs to happen is for the thread to issue an asynchronous I/O request and then pass control to another thread in the process. Also heavy I/O is not efficient when done synchronously because of the large number of context switches that must occur every time a thread is blocked. (Hyuck Yoo) Asynchronous I/O under Solaris is implemented two ways - under Solaris 2.3 it is using the library and under Solaris 2.4 and beyond it is in the file system layer of the kernel. The library approach uses kernel-level threads where each I/O request is handled by a newly created kernel-level thread that acts synchronously (i.e. issuing read and write calls). The library lives outside of the kernel and the kernel threads that perform the I/O are separate from the calling process. The kernel approach is much more sophisticated and efficient. The basic concept is to not maintain the queue in user space but to put the request directly into the device driver queue. The biowait function is bypassed (which is the device driver equivalent to a blocking function) and the thread transfers control rather than sleep in the kernel. The kernel has buffers with slots called AIO that maintain a listing of all I/O requests. (Hyuck Yoo) Solaris has provided the ISM feature since 2.2. The main feature of ISM is in addition to sharing the "memory" pages (like the normal shared memory), it also shares the page table entries for those pages (therefore, it's "intimate"). Another side feature, which is more important for this discussion, is that ISM also locks down the shared memory segment in real physical RAM. Since the main purpose of ISM is for the DBMS products' buffer cache usage, this makes sense. (Jaffee) Sharing page table entries solves the problem of page table stealing which is expensive because all the pages mapped in the stolen page table have to be flushed before being given to another process. This avoids the condition where the whole system may thrash as processes steal page tables from each other. (H. Yoo) The design team created a new segment in the process address space called segshm so that they could create one set of page tables for a shared memory segment and share the page tables among the processes that attach that same shared memory. In addition to saving page table allocation, sharing page tables have other advantages such as having a higher cache hit rate on memory map lookups because the tables are in a buffer cache rather than in memory. It also avoids the amount of overhead done by the hardware address translation layer since it no longer needs go through page tables for every process to monitor whether a page has been modified. These are both huge savings and speed up the virtual memory paging algorithm within Solaris. (H. Yoo) IPC The Oracle RDBMS is a complex program that uses multiple cooperating processes that must communicate with each other and share resources. The kernel provides a mechanism in user space called inter process communication or IPC. The processes operate in a shared memory segment such that if one process modifies data it will be immediately visible to the other processes. Data transfer and event notifications occur between the various Oracle processes in the Oracle SGA. Semaphores are used for Oracle's own locking and synchronization scheme. Asynchronous events such as errors are reported to the processes using signals. The default action for most signals from the kernel is to terminate the process, however the process may specify an alternate response by providing a signal handler function. (Tip - Before installing the kernel jumbo patch read the readme file to see if there are any known signal problems with Oracle). (Vahalia - p150) The relevant IPC system calls Oracle makes are shmget, semget, shmat, shmdt, shmctl, and semctl. The ipc information is stored in the kernel with the ipc_perm structure. shmget(key, size,flag) creates a portion of shared memory (which will be the size of the Oracle SGA) and shmat(shmid, shmaddr, shmflag) attaches the region to a virtual memory address of the process. (shmsys is how Oracle sets up the intimate shared memory segment). The structure of a shared memory segment includes access permission, segment size, the PID of the process performing last operation, and the memory map segment descriptor pointer as well as other fields. (tip - sgabeg in the ksms.s file is a virtual address not physical address (0-0xffffffff = 2 GB). Choose small beginning addresses for large SGAs. Also watch out for 28 bit Sparc chips. They have a smaller virtual addresses. Hal Stern notes "They're really not 28 bit chips, but instead the system architecture only passes 28 bits of virtual address space on to the memory bus. [private letter]) Once attached the region may be accessed like any other memory location without requiring system calls to read or write data to it. Hence shared memory is the fastest mechanism for processes to share data. (Tip - don't be confused by the SZ field in ps -elf. It is in 4 KB pages and represents shared memory in the case of Oracle. For example Oracle may have 60 server processes in a shared memory segment all approximately 25000 4 KB pages. A common misconception is to think that Oracle needs 60 X 4KB X 25000 = 6 GB of virtual memory. Those 60 processes are mainly using the shared memory region in the process address space). (Tip - shared memory pages are backed by swap space, not by a file. The absolute minimum swap must be at least the size of the SGA.) A process detaches the shared memory with shmdt(addr) and destroys the shared memory region completely with the IPC_RMID command of the shmctl system call. (Tip - the important commands are ipcs -b; look at field SEGSZ for shared memory size in use ; sysdef -i and sysdef -i -n /dev/ksyms for IPC and resource table definitions; kill -9 to terminate (no core file) a hung process or kill -6 to abort (core file) a hung Oracle process. modload -p sys/shmsys at the command line or forceload: sys/shmsys in the system file maybe needed if ipcs -b doesn't work) correctly. This is because the kernel is dynamic meaning that file systems, drivers, and modules are loaded into memory when they are used, and the memory is returned if the module is no longer needed. (Vahalia - p155-158, p162-164) Semaphores are counters that are used by Oracle to monitor and control the availability of shared memory segments. Typically the process initializes the semaphore with semget, assigns ownership of the semaphore with semctl , and then updates the semaphore with semop. A process has to block until the semaphore operation has reached zero. A semaphore structure contains the following information - semaphore value, the PID of the process that last performed successfully, the number of processes waiting for the semaphore to increase, and the number of processes waiting for the semaphore to reach zero. (tip-ipc_perm and sem in ipc.h, sem.h) (System Services Guide - p68-77). Shared Memory and Semaphore Tunables in Solaris 2 relevant to Oracle. (Tip - semmnu = semmns = semmsl X semmni). There is no harm in setting the numbers too high since the Oracle instance will only allocate semaphores and shared memory as needed. The values are definitions not declarations. Name Default Min Max Reference Suggested ____ _______ ___ ___ _________ ________ shmmax 1048576 1048576 Available Maximum shm segment 50% of RAM RAM size in bytes shmmin 1 1 - Minimum shm segment 1 size in bytes shmni 100 100 - Number of shm id 100 to pre-allocate shmseg 6 6 - Maximum number shm 32 seg per process semmni 10 10 65535 Number of semaphore 64 identifiers semmns 60 - - Number of semaphores 1600 in system semmnu 30 - - Number of undo 1250 structures in sys semmsl 25 - - Maximum number of 25 (fixed) semaphores per ID Solaris Tuning According to the Experts Every month in SunWorld Online, the performance experts at Sun write articles on tuning. In addition to the well known book, "Sun Performance and Tuning", Adrian Cockcroft with the help of Rich Pettit have put together a series of scripts called se2.5 (www.sun.com/960301/columns/adrian /se2.5.html. Hal Stern, another well known Sun tuning guru, has written an O'Reilly press book on "Managing NFS & NIS" and he too writes articles that can be downloaded off of the web. Fellow SunService Engineers Chris Drake and Kimberley Woods wrote "Panic - System Core dump Analysis" which contains detailed information on the Solaris kernel and common techniques used in to analysis core files. Brian Wong the hardware expert has written a book called "Configuration and Capacity Planning of Large Sun Servers". Most of the tuning information for large Sun Servers running Oracle can be found in these sources. Since many customers often call SunService for further explanations, it is appropriate to highlight some common questions and answer them as the experts would. Question 1 - Where is all my Memory? Probably the most common performance question of all is "Why does vmstat report only xxxx about of free memory available?" To use an example, type the vmstat 5 and suppose the system shows freemem of 80708 and available swap is 330000. Now start the application and observe that the freemem goes down to 8824 and swap goes to 300000. Now stop the application and observe that all of the available swap returns to 330000 but the freemem returns only to 21260. Where then is all of the ram? Does we have a memory leak? The answer is probably no because as Cockcroft notes "(the app) starts up more quickly than it did the first time, and with less disk activity. The application code and its data files are still in memory, even though they are not active. The memory they occupy is not "free." If you restart the same application it finds the pages that are already in memory. The pages are attached to the inode cache entries for the files. If you start a different application, and there is insufficient free memory, the kernel will scan for pages that have not been touched for a long time, and "free" them. Once you quit the first application, the memory it occupies is not being touched, so it will be freed quickly for use by other applications. "(Cockcroft 1) Leaving parts of the app in memory even after termination is efficient because "Attaching to a page in memory is around 1,000 times faster than reading it in from disk." (Cockcroft 1) So how can one know if he has a memory leak in his application? The answer is there will be a shortage of swap space after the program runs a while and the SZ field in ps -elf for that app will grow over time. Question 2 - My Oracle Server is slow. Can you help me tune the kernel? The answer depends on the version of the operating system and the level of the patches. Early versions of the os had performance bugs and incompatible hardware that were the cause of slow performance. The latest version of the os is self-tuning for high performance and will work quite successfully on systems ranging from a huge SparcCenter 2000 to small desktops. As Cockcroft says "In normal use there is no need to tune the Solaris 2 kernel, since it dynamically adapts itself to the given hardware configuration and application workload. " (Cockcroft 2) However for really large Oracle servers some tuning may be needed if using early versions of Solaris 2.3 2.4 and 2.5 without a kernel patch that automatically adjusts the the paging algorithm. Solaris 2.5.1 is self tuning for large memory systems. Paul Faramelli of the kernel TSE group has put together the following list of tunables for Solaris. Recommendations for large Oracle servers (Ram > 1 GB) are listed. (Tip - Use crash to display kernel tunables. As root type crash. At the greater than prompt, type "od -d maxuser" or "od -d lotsfree". The od stands for octal dump, and the -d stands for decimal. By the way every Solaris tunable [even undocumented ones] can be displayed by typing nm /kernel/unix). Note these recommendations are only necessary for early versions of Solaris. The some recommendations are provided by Steve O'Neil of SunService. (Caution - there is no right answer) Parameter Description Recommended --------- ----------- ----------- dump_cnt Size of the dump autoup Used in struct var for dynamic configuration of the age 300 that a delayed-write buffer must be, in seconds, before bdflush will write it out (default = 60) bufhwm Used in struct var for v_bufhwm; it's the high water mark 8000 for buffer cache memory usage, in Kbytes (2% of memory). maxusers Maximum number of users (In 2.3 and 2.4 the default is number of Megabytes in memory) max_nprocs Maximum number of processes (10 + 16 * maxuser) maxuprc The maximum number of user processes. (max_nprocs - 5) rstchown POSIX_CHOWN_RESTRICTED is enabled (default = 1 ) ngroups_max Maximum number of supplementary groups per user (def 32). rlim_fd_cur Maximum number of open file descriptors per process sysem wide (default = 64, max = 1024) ncallout Number of callout buffers (default = 16 + max_nprocs). (No longer exists in Solaris 2.2 and later releases) nautopush Number of entries in the autopush free list 1024 sadcnt Number allowed of concurrent opens of both /dev/sad/user 2048 and /dev/sad/admin (default 16). npty Number of 4.X psuedo-ttys configured (default 48) 1024 pt_cnt Number of 5.X psuedo-ttys configured (default 48) 1024 physmem Sets the number of pages usable in physical memory. Only use this for testing, it reduces the size of memory. minfree Memory threshold which determines when to start swapping 100 processes, when free memory falls to this level swapping begins (default: 2.4 - 4d = 50 pages, all others 25 pages, 2.3 - physmem / 64 ). desfree This is the "desperation" level, this determines when 200 paging is abandoned for swapping. When free memory stays below this level for 30 seconds, swapping kicks in ( 2.4 4d = 100 pages, all others 50 pages, 2.3 physmem / 32 ). lotsfree Memory threshold which determines when to start paging. 512 When free memory falls below this level paging begins (2.4 4d = 256 pages all others 128 pages, 2.3 physmem /16) fastscan The number of pages scanned per second when free memory is zero, the scan rate increases as free memory falls from lotsfree to zero, reaching fastscan ( default: 2.4 physmem / 4 with 64Mb being max, 2.3 physmem / 2 ). slowscan The number of pages scanned per second when free memory is equal to lotsfree, also see fastscan ( defaults: 2.4 is fixed at 100, 2.3 fastscan /10 ). handspr- Is the distance between the front hand and backhand in eadpages the clock algorithm. The larger the number the longer an idle page can stay in memory (default: 2.4 physmem / 4 2.3 physmem / 2 ). maxpgio The maximum number of page-out I/O operations per second. 120 This acts as a throttle for the page deamon to prevent page thrashing ((DISKRPM * 2) /3 = 40). This parameter must be set higher if using two swap partitions. t_gpgslo 2.1 through 2.3, Used to set the threshold on when to swap out processes (default 25 pages ). ufs_ninode Maximum number of inodes. (max_nprocs+16+maxusers+64) 34906 ndquot Number of disk quota structures. (default = (maxusers * NMOUNT / 4) + max_nprocs) ncsize Number of dnlc entries. (default = max_procs + 16 + 34906 maxusers + 64); dnlc is the directory-name lookup cache Cockcroft on maxusers "I never set maxusers. It sizes itself based on the amount of RAM in the system. In some cases on configurations with gigabytes of RAM it needs to be reduced to avoid problems with lack of kernel address space. The kernel uses up a lot of space keeping track of all the RAM in a system. Several other kernel table sizes and limits are derived from maxusers." (Cockcroft 2) Cockcroft on ncsize "The directory name lookup cache (DNLC) is sized to a default value based on maxusers. A large cache size (ncsize) significantly helps NFS servers that have a lot of clients. On other systems the default is adequate."(Cockcroft 2) Question 3: How much swap is needed for a large Oracle database? Many people are under the impression that very little swap is needed for Oracle because the architecture uses temporary tablespaces for sorting and the SGA is fixed in memory. Well the truth is large databases require a lot of swap. The shared memory segment is backed by swap so the allocated swap MUST be at least as large as the shared memory segments. In addition when the database uses intimate shared memory this is also backed by swap. All of the Oracle processes must be partially backed by swap. Steve Schuettinger, the Oracle applications specialist at Sun, recommends at least 2 GB of swap for benchmark testing on large servers. Obviously since RAM plus swap equals virtual memory, once swap is gone, the program will halt and no new apps can be started until other programs have stopped. As Adrian Cockcroft says "The important thing to realize about swap space is that it is the combined total size of every program running and dormant on the system that matters. When a system runs out of swap space it can be very difficult to recover. Sometimes you find that there is insufficient swap space left to login as root or run the commands needed to kill the errant process that is consuming all the swap space." (Cockcroft 3) In Theory Solaris 2 changes the rules by adding the RAM and the disk space so if the system has enough RAM for the workload, "it can run with no swap disk. In practice common database applications that are sized to run in a few gigabytes of RAM will actually need many gigabytes of disk allocated as swap space." (Cockcroft 3) In the same article Cockcroft says "The consequences of running out of swap space affect a larger number of users on a big server, so it wise to allocate a lot more than you normally need to cope with any usage peaks. To start with, add twice as much disk as you have RAM." (Cockcroft 3) (Tip - It is not worth making a striped metadevice to swap on - that would just add overhead and slow it down. There is also a limit of 2 gigabytes on the size of each swap partition, so striping disks together tends to make them too big. /usr/ucb/ps alx, fields SZ or SIZE, /usr/proc/bin/pmap % /usr/ucb/ps alx F UID PID PPID CP PRI NI SZ RSS WCHAN S TT TIME COMMAND 8 2595 1133 1130 0 48 20 988 360 modlinka S pts/4 0:00 -bin/csh There is confusion between what ps reports. The "/bin/ps prints a field labelled SZ, but this is the resident set size in RAM -- printed as RSS by the /usr/ucb/ps. You need to use the SZ or SIZE field reported by /usr/ucb/ps alx in units of kilobytes to determine the amount of swap space used by the process." (Cockcroft 3) Oracle's Mark Johnson adds the following "I had thought the standard Oracle rule of thumb was 2 to 4 times physical memory (can be a bit less on very large memory systems). Smaller memory systems may want to use higher ratios of SGA size to physical memory size and higher swap space ratios. (I ended up using ratios of 1:1 and 1:4 for a very small Solaris for Intel system with surprisingly good results.)" Hal Stern says "So why do you need swap space if your SGA << phys mem? The short answer is that the "phys mem" in that calculation is the non-locked- down physical memory, and when you allocate an oracle SGA, you allocate intimate shared memory (ISM) that is taken out of the physical memory pool (ie, it gets locked down). so on a 1 Gbyte machine, you may think you're ok with a 256M SGA, leaving 700M+ for processes. BUT: the 256M SGA gets taken out of the available memory pool, so your maximum VM is only 700M+, and you could probably use the swap space....as the SGA/memory ratio goes up, this is even more true." (private letter from Stern) Question 4 - Will a faster cpu help performance? The answer is not easy to answer. As Hal Stern noted " Noticing that you're using 20 percent of the CPU doesn't mean anything until you know the kind of work that's using the cycles. If you're CPU-bound, then you have headroom to increase the workload by a factor of four or five. An I/O-bound job, however, that uses 20 percent of the CPU might be improved by adding disk spindles. As you increase the disk count and I/O load, to ease the bottleneck, you'll use more CPU to deal with the I/O setup, system calls, and interrupts from the additional work. You run the risk of morphing a disk problem into a CPU shortage. How do you know when relaxing one constraint pops another one into the foreground? Define the right relationships -- CPU time used per disk I/O tells you how much system time you eat up as you add disk load -- and measure with your tailored yardstick." (Stern 1) Preventing Kernel Memory Starvation When Oracle is working very hard and the operating system is Solaris 2.3 or early Solaris 2.4, it is possible to have kernel memory allocation faults that can eventually lead to kernel memory starvation. A new memory allocator algorithm has been developed and integrated into Solaris 2.5.1 (the old allocator had paging thresholds that were too low which causing kernel memory allocation failures on very large systems). The allocator has been back ported to rev 40 of the Solaris 2.4 jumbo patch and to a future rev of the 2.5 jumbo patch. No fix has yet been developed for Solaris 2.3. (Tip - large database users should upgrade to Solaris 2.4 or better). In the past Oracle customers could manually adjust paging thresholds. The actual value that needed to be set was proportional and depended upon the amount of memory and the number of cpus on the system. Also in some cases decreasing maxusers and bufhwm would mitigate the problem. The total allowable size for the kernel on the ultrasparc servers running 2.5 is now so large that kernel memory allocation problems on very large systems is virtually impossible. See examples below. The crash output displaying kernel memory starvation is taken from a SparcServer 1000 running Solaris 2.3 with 1 GB of ram and 8 cpus. Solaris 2.4: Solaris 2.5: Kernel memory limits sun4c 33MB sun4c 33MB sun4m 61MB sun4m 100MB sun4d 139MB sun4d 251MB sun4u 2525MB $> kas crash 15 >map kernelmap FREE: 2042 WANT: 1 SIZE: 2042 SIZE ADDRESS TOTAL NUMBER OF SEGMENTS 0 TOTAL SIZE 0 > kmastat total bytes total bytes size # pools in pools allocated # failures ----------------------------------------------------------------- small 6807 26138880 25677584 1989915 big 2652 75276288 73046528 0 outsize - - 18571264 45351 Crash is a very powerful tool that helps analyze kernel memory allocation failures. We see from the output "TOTAL SIZE 0" indicates that no more free kernel memory exists. The FREE field (2042) indicates that there is still plenty of memory in the user portion of the virtual address space. Carl of Sunsoft provides an explanation of kernel map scarcity under Solaris 2.3 and Solaris 2.4. "In the overwhelming majority of cases on large database servers, we have found that 64MB is overly generous for bufhwm in that it can be cut back by one-half (to 32MB) without too much of an impact on the cache hit ratio. What is usually in short supply on these machines is not the buffer cache but the amount of kernel heap (mapped by kernelmap) that remains for non-buffer cache usage. Limiting buffer cache growth to 32MB frees up an addition 32MB to the heap and has proven successful in avoiding kernelmap scarcity at a number of sites running large database applications. Kernelmap scarcity (or equivalently kernel heap scarcity as the size of the kernel heap is limited by the size of the address space the kernelmap can map) results in an extreme slowdown of processing in the systems. All of a sudden kernelmap becomes a scarce resource that every thread contends for and to exacerbate the situation the rate of release is slowed by the very same contention to the point that kernelmap turnover grinds down almost to the point of deadlock. Why 64MB's worth of kernelmap is inadequate for the largest database servers is unknown. The sites on which this has been a problem have been checked for kernelmap leakage and none has been found. There has also been a problem in the past with some kernel data structures being pre allocated from the heap and the size of this pre allocation being inappropriately scaled to physical memory. As it is fairly common now for machines to be equipped with 3GB of physical memory, this was not the right thing to do and did account for some kernelmap depletion headaches. But this particular bug has been fixed. With these two things discounted, the only conclusion is that modern database workloads are driving up peak transient demands for kernelmap to the 100MB level." (Tip -For large databases running Solaris 2.4 or less set bufhwm to 8000 on 4c, 4m, and 4d or upgrade to Solaris 2.5 which has a large kernel map address space.) Acknowledgements I want to thank Sun performance gurus Adrian Cockcroft and Hal Stern for their contributions to this paper. UNIX architect Mark Johnson of Oracle and database expert Jim Skeen of Sunsoft provided comments on Oracle internals. Kernel architect Jeff Bonwick has added explanations and suggestions regarding kernel memory allocation and kernel memory starvation. SunService kernel engineer Paul Faramelli documented the Solaris tuning parameters and SunService Technical Expert Steve O'Neil provided recommendations for tuning large Oracle databases on versions of Solaris that are not self tuning. Finally I want to thank Uresh Vahalia who gave me permission to quote at length from his wonderful book "UNIX Internals - The New Frontiers". Disclaimer The author alone is responsible for the contents of this paper. No one at Sun Microsystems, Sunsoft, SunService, or the Oracle corporation has reviewed or approved the paper for completeness or accuracy in it's published format and nothing in the paper can be construed as the official policy of Sun Microsystems or the Oracle Corporation. References UNIX Internals - The New Frontiers by Uresh Vahalia, Prentice Hall 1996 "How the Solaris Kernel is Optimized for Oracle" by Mike Jaffee 1996 "Shared Page Table: Virtual Memory Enhancement for Data Sharing in UNIX" H.Yoo "Comparative analysis of Asynchronous I/O in Multithreaded UNIX" Hyuck Yoo "Help! I've lost my memory!" by Adrian Cockcroft, SunWorldOnline 1995 (1) "What are the tunable kernel parameters for Solaris 2?" by Adrian Cockcroft (2) "How does swap space work?" by Adrian Cockcroft, SunWorldOnline 1995 (3) "We suggest creative ways to better your system" performance by Hal Stern System Service Guide - Solaris 2.4 Manual, SunSoft, 1994 "The Slab Allocator: An Object-Caching Kernel Memory Allocator" Jeff Bonwick