SUMMARY: Optimizating Disk throughput on a Sparc 1+ file server

From: Henrik Schmiediche (henrik@stat.tamu.edu)
Date: Thu Feb 11 1993 - 01:00:10 CST

Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

Hi,
Thanks to everyone who responded to my question concerning
"Optimizating Disk throughput on a Sparc 1+ file server."
Here is the original question with a summary/selection of
responses I received.

Original Post:
--------------

  Recently after one of our machines died we reconfigured a Sparc 1+
  (wih 2 large hard disks, SunOS 4.1[.0]) to act as a file and yellow
  page server for about 15 machines. This Sparc is now working its heart
  out kepping up with all the work being dumped on it. In the future we
  will be upgrading to a more powerful machine, but until then I would
  like to ask the net community for some input on how to optimize the
  disk I/O throughput on this machine.

1) This Sparc has only 8 megabytes of memory. Will increasing the
memory significantly increase performance?

    2) The daemons nfsd and biod handle the NFS clients. Does increasing
       their number help? The biod daemons do not seem to be accumulating
       any CPU seconds on our system.

3) Any general configuration options we can apply to boost throughout?

Replies:
--------

1) This Sparc has only 8 megabytes of memory. Will increasing the
memory significantly increase performance?

Perry_Hutchison.Portland@xerox.com:

8Mb is probably enough to keep a pure server (I assume you're not
running any user jobs on it) from paging/swapping, however more might
help by allowing the disk cache to grow. Memory is cheap, but an even
cheaper thing to try (if you haven't already) is to build a smaller
kernel by removing unused devices and facilities.

pete@cosc.canterbury.ac.nz:

Probably. Type "vmstat 1" and watch the output for 5 or 10 minutes over
a normal use time. If the "po" column under the page heading is non zero
then memory will help. If it is rarely zero, memory will help a lot.
If the "w" field under the "procs" section is non-zero then you are
getting proesses swapped out -- ie a SERIOUS memory shortage. Our server
had 40 Megs.

mr080216@moosehead.eng.canadair.ca:

Probably *VERY* significantly, put as much memory as you can afford
in this system, this will help a lot. Although 16M would help 32M or
more would be better (SunOS uses as much memory as is available to
cache buffers, with 15 clients you need a lot of cache.

stern@sunne.east.sun.com:

all available memory is used for file cache, so adding memory
increase the size of the file page cache. adding memory is
generally a good thing for a file server. it depends upon how
many files are shared, or for how long a file gets used and re-paged
from the server, but this should help (it's cheap enough).

eckhard@ts.go.dlr.de:

Adding memory will definitively give performance gain iff most of the
NFS operations are reading. If writing is a significant part of the
NFS traffic (see nfsstat -s), then a PrestoServe board or eNFS
software will speed things up (both will work in the upgraded machine
as well if it's a Sun4c 4.1.X machine)

hanson@pogo.fnal.gov:

8MB of memory is not sufficient, add at least another 8MB, plus
another 8MB if you are going are going to run x applications.

**************************************************************************

2) The daemons nfsd and biod handle the NFS clients. Does increasing
their number help? The biod daemons do not seem to be accumulating
any CPU seconds on our system.

mr080216@moosehead.eng.canadair.ca:

Nfsd's are used to serve the clients. Biods runs on the client, (you may
be both a server and a client). If you check you clients biod you should
see them accumulating time. (The biod is used to send read-ahead requests
to the server).

stern@sunne.east.sun.com:

biod does the client-side NFS work. if you aren't mounting filesystems
on this machine, biods aren't doing any work. nfsd daemons do the
server-side work. you should add them to accept a larger incoming
load; generally 16 or so work well. run fewer if you have only
a few disks, since the number of daemons controls how many disk requests
you can have outstanding at once (adding daemons if they can't get to
the disks doesn't really help).

bernards@ecn.nl:

pumping up nfsd to 12 can fix things, but only if it does not swap/page its
own heart out. Increasing MAXUSERS and the ethernet buffers will add some
more juice.

rauls@usb.ve:

The nfsd number on a server can be increased if there is a significant
nfs load. The biod number only affects outgoing nfs mounts and doesn't
really affect a server. There is a tradeoff on increasing the number
of nfsd's - I don't remember the details. It turned out for us that
with ~20 clients (none diskless or dataless) the default number of
8 nfsd's was fine. Having 32 Meg of physical memory was the key
(altho' I am using a 4/490 as a server, I still believe memory is the
crucial element).

hanson@pogo.fnal.gov:

edit /etc/rc.local change the nfsd 8 to nfsd 24 should be sufficient
to handle any load.

you can check with the following command if you are getting network
overflows

# netstat -s | grep overflow | grep -v '0 socket overflows'

If you are getting more than say 500 hundred then increase the above
nfsd ## and reboot.

aad@lovecraft.siemens.com:

It would actually hurt.

poc@shaddam.usb.ve:

The rule of thumb is "two nfsd demons per mountable drive". You might
gain a little by having a few more, but your basic problem is lack of
memory. As regards "biod", that's only useful in NFS clients, not in
servers. If your server is also a client, then use it. If not, you're
just wasting memory.

herman@telecom.ti.com:

Use the following procedure to determine the proper number of nfsds
(document #3619 from the Sun Symptoms and Resolutions database). The
number of biods should only matter if your Sparc 1+ is used as a
client of some other server.

To optimize the number of nfsds, run netstat -s a few times on the
server when it is loaded, and look at the number of socket overflows
in the udp: category. If it is growing, run more nfsds. If it is
zero (0), you have enough. (If it is large but not changing, someone
probably recently ran spray on the network.)

Determining too many nfsds is harder since it involves a subjective
assessment of the quality of NFS service seen by clients. Run uptime
on the server when it is loaded and record the load average. Increase
the number of nfsds, reload the server, and run uptime again. If the
load average increased significantly but the NFS response seen by the
clients did not improve at all, back down to the previous number of
nfsds.

**************************************************************************

3) Any general configuration options we can apply to boost throughout?

chris@cs.yorku.ca:

Get another S-bus scsi controller ... should add some thruput and you can
re-use in the next machine.

mr080216@moosehead.eng.canadair.ca:

You should probably rebuild you kernel, stripping as many unused options as
you can and most importantly bumping "maxusers" up significantly (32, or even
64 would be reasonable).

Perry_Hutchison.Portland@xerox.com:

If you can find another machine to dump some of the workload on, you
could try moving the NIS (YP) service elsewhere and let this machine
concentrate on NFS service.

stern@sunne.east.sun.com:

make sure "maxusers" is large - 128 or so on the server. make
sure you have installed the ufs_inactive() [patch 100259] patch,
to prevent any page cache thrashing problems. upgrade to 4.1.3
if you can.

pla_jfi@pki-nbg.philips.de:

1. Do NOT (!!) allow users to log in to file servers and absolutely no
graphic equipment on these machines. ( we have a 490 with a screen
and performance falls wenn a user is loggeg in and uses OW )
2. A good idea is a pretoserv board, this gives a better write performance

bernards@ecn.nl:

Don't use it as a Workstation , especially OpenWindows will eat CPU
and mem. Look at /usr/etc/pstat -T and the swap consumption, and try
to put up a perfmeter and watch page/swap activity. If it's more than
the memory available it makes sense to add some simm's. also the PD
top(1) program is OK te test

cyerkes@jpmorgan.com:

The first MOST IMPORTANT rec. I have is to buy and read the O'Reilly
book, System Performance Tuning is of great help in teaching you to
analyse these things. You can see if you need more RAM by watching
swapping and paging stats, you can watch iostat output to see which
disks are being beaten on and distribute load across disks or across
disk controllers. You need to know WHICH resources you are running
short of, RAM or IO Bandwidth (cpu is unlikely, except if you are
running it as a computer server).

A sparc one class machine has more than enough power to be a file
server, but if you compute on it (route, run a graphics console, etc)
you will slow it immensely. The sparc one limitation is contexts (see
below) and the SBUS. Putting a higher performance, caching SCSI and
ethernet card in it will enhance things, but it's more worth it on a
Sparc 2 or 10. I have no qualms about running a headless sparc 1 as a
SWAP server for diskless machines for an ethernet segment.

Sparc 1 has 8 contexts (ie. the information for 8 jobs are available
to the MMU in cache, otherwise it goes out to slower memory). More
that 8 nfs daemons will hinder you. A load > 8 will slow the machine
more than proportionally.

SunOS 4.1.2 has a number of bug fixes and minor tunings and is
recommended (with appropriate patches).

You don't say what the other machines are, but I'll guess Suns or at
least Unix.

If you are running apps from a file server, make them READ ONLY
exports if you can. This keeps the machine from keeping any space for
writes to these mounts. It also means you can DUPLICATE the trees on
another machine and use automount (sun if you must, AMD if you can) to
mount from whichever machine answers first. Areas like /usr/share
(man, etc) and usr/openwin/share (openwin extra stuff) can EASILY be
exported ro from a couple machines with automount using a list. I've
set up machines where 1 in 5 exports these to enhance speed.

If you have a machine that's used less than others, make it an NIS
slave server, remembering that if IT goes down, anyone bound to it will
hang for up to 5 minutes until they rebind to another NIS server.

Overall though, I'd say that learning how to recognize limits within
SunOS and Unix in general will help you most (and it's good on a
resume). I like to gather stats every X minutes (I'll use 30, usually)
and then have something to compare to later. As a statistician, you
can make all sorts of comparisions, 3D graphs, and so forth with
something like GnuPlot (a really useful and/or FUN program) to prove
any point you want (try graphing virtual memory use against disk hits
for a truly random graph).

abbott@ms.uky.edu:

if you are doing alot of swapping, it sometimes helps to interleave the
swap over more than one drive (have swap partitions or swapfiles on
each disk). or better yet, get more memory so it doesn't have to wait
on swap. there are probably some kernel mods you can make, but someone
else will have to help you there.

trinkle@cs.purdue.edu:

The biod processes are client-side processes, nfsd are the
server-side clients. The biod process does read ahead/write behind
buffering on the client side.

This may not be your problem, but we found NIS (YP) to be a
substantial load on our systems as delivered by Sun. The fault is in
the library routines that request information. There are two problems
in particular. One, the services database, is (oft-repeated) utter
stupidity on Sun's part. A complete description follows.

The other is a problem with the UNIX design of the "group"
database implementation. The database (/etc/group) was designed as a
flat file before the days of NFS (obviously). It keyed on group
(either group name or gid). There is no access to the data keyed by
user (i.e. in what groups is user x). The means that initgroups()
(called at login time to initialize the user's groups, when cron or at
runs a job, etc) must scan all entries checking for the user. This
was not too bad when the group database was a local file. The entire
file, or large portions of it, would be buffered in memory, and
subsequent reads (getgrent()) did cost a lot. However, with NIS, each
getgrent() is a very expensive operation. The client sends a request
over the network to the server, the server must find out where it left
off for that client, stepping through each entry in the database (DBM
file), and then return the correct answer over the network to the
client. If you have a large group database, this can be very time
consuming. What I did was to build an additional NIS database call
group.bymember. It is keyed on the user, and the value is the list of
numeric gids of which the user is a member. This makes initgroups()
essentially one NIS request, rather than possibly hundreds (in our
case).

When we installed our first 7 Sun 4/110s (real dogs by today's
standards), they swamped our NIS server running on a Sun 3/260. After
I replaced the getservby*() and initgroups() routines, there was no
problem. Over the years we added dozens of much faster SPARC clients
without any problems. We eventually moved the NIS server to another
SPARC, but mainly for added reliability (the Sun3 was crashing
frequently).

If you have source and want the initgroups() replacement/patches
and the getservby*() patches, let me know.

trinkle@cs.purdue.edu

Ever since we converted to SunOS 4.0, we have had a bad problem
rebooting diskless clients. For about the first 5-10 minutes after
reboot, it is almost impossible to log in. Even localdisk machines
exhibit the same behavior, but usually a bit faster. Incredibly
enough, this problem is due almost entirely to a big mistake made in
the getservbyname() library routine.

This problem actually started in (maybe earlier, but) some time
back in SunOS 3.2 or 3.4. There used to be two YP maps for services
entries, one called services.byname and one called services.byport
(or maybe bynumber, it is not that important). One was keyed by
port/proto and the other was supposed to be keyed by name/proto.
Unfortunately, the /usr/etc/yp/Makefile target for building
services.byname had a mistake in it that caused the key to be
incorrect. It was either just the name or port/proto, I don't
remember exactly. I corrected the Makefile and it was just fine.
However, it was not Sun code that had the problem with the YP map
having the incorrect key, it was a uVAX-II running YP. It turns out
that the Sun getservbyname routine was roughly

        if (ypmatch("name/proto", "services.byname") {
                return success;
        } else {
                while (se = getservent()) {
                        if (match())
                                return success;
                }
        }

so the routine was quietly failing each time and resorting to a
painfully slow YPPROC_FIRST, YPPROC_NEXT loop of getting service
entries.

Well, under SunOS 4.0 they got rid of this problem in a clever
way - they removed the initial part of the if statement, just used the
while loop, and did away with the services.byname map. Ah, you say,
there is a services.byname map. Well, that is what they call it, but
it is NOT services.byname, it is services.byport. The initial
analysis of this brilliant design decision is that no one at Sun could
figure out how to make the Makefile target generate a correct map!

Unfortunately, when a machine boots, most of its YP traffic is
doing getservbyname on every single service for which it starts a
daemon or for which inetd listens. By the way, this same behavior
also exists in getrpcbyname(), but it was not worth building a new map
for a file that only has 30 entries. Also, this is not used as much
as getservbyname(). getservbyname is also used every time you do a
rsh, rlogin, telnet, ftp ... well, a lot of stuff.

I have a patch to /var/yp/Makefile and
/usr/src/lib/libc/net/getservent.c to actually make and use a
services.byname map. Note, however, two gotcha's. First, the comment
field of the /etc/services line must begin with a "# " because the #
must be a separate field to awk. Second, note that I call the new map
services.byport instead of the correct services.byname. This is
because other vendors have blindly adopted Sun's code without looking
at it very closely, and consequently they also use the services.byname
map to do the getservbyport() routine. To prevent them from looping
trying to do getservbyport() calls, it is best to keep the old map the
same.

If you are a source site and want the patches, please let me
know. I would also be very happy to have the getservent.o installed
in the libc.pic on UUNET, but I don't know about the legalities. I
hope Sun will take the patches, and also allow us to somehow or other
distribute correct binaries. This fix makes rebooting every time
processes get hung in device wait a bit easier to tolerate (but not
much).

For those that might be interested, I also made a hack patch to
etherfind so that it will print out the YP map name for
YPPROC_{MATCH,FIRST,NEXT} RPC packets (when using -r). Again, if you
are a source site and interested in these patc

cal@soac.bellcore.com:

first, a 1+ is sufficient to support a surprising amount of NFS traffic.
increasing memory will help substantially since the entire space will be used
by the OS to cache file pages. the more memory you have the more pages are
cached and since memory access times are about an order of magnitude faster
than disk access times ..... Next I would remake the kernel with MAXUSERS
equal to 128. this bumps up the size of some internal structures
(e.g. inode/cylinder data) that will improve response time. the nfsd's are
used to provide file service (server side) and the biod's are used to
optimize the client side. so, on a server you could increase the nfsd's
from 8 to 10,12,16 etc. and you will see some improvement. the next thing
to do is get the fastest disks (i.e. avg transfer+latency) you can afford
since this can easily become the bottleneck. also, try to have swap on a
separate bus and disk so that paging activity will not contend with regular
disk access. this may mean buying another SCSI controller (~ $ 1000. list).

Also, you might consider upgrading to 4.1.3 and/or applying the UFS/NFS jumbo
patches (100173-09/100623-03).

lastly, you should get a copy of Hal Stern's Managing NFS & NIS as it is
a very readable text that describes behavior and aids troubleshooting.

hanson@pogo.fnal.gov:

But the main favor you can do yourself is putting more memory
in the system. A 1+ makes a pretty good server. If your users
aren't working the server real hard this may be sufficient for your
needs. A faster CPU will probably not help you at all. Issues
like enough memory and splitting disk load across multiple
disks (and maybe multiple controllers) will help a LOT more than
a faster processor.

hanson@pogo.fnal.gov:

Make sure you are not running the small generic kernal. You may be running
out of processes if you increase the nfsd too high.

# pstat -T will tell you how many processes you are using
will tell you how much swap space you are using

poc@shaddam.usb.ve:

Read Hal Stern's book "Managing NFS and NIS", O'Reilly and Associates
1991 (around $30).

poc@usb.ve:

I can't see using a SUN for ANYTHING with only 8 MB of memory.
At the absolute least upgrade to 16MB, preferably 24MB.

I don't really know where the benefit/cost tradeoff comes in--
keep in mind that one the one hand you have a machine with
the slowest possible disk I/O (async SCSI) but on the other
hand you will be limited by the ethernet bandwidth as long
as no user actually runs anything on the server.

Upgrading to 4.1.2 would probably be a good idea--it works
well for us (I don't have any experience with 4.1.3 on a
Sparc1+ yet).

In your case I don't think the number of daemons matters.

peb@sandoz.ueci.com:

Check out the Nutshell Handbook on System Performance and Tuning.
This will help with general configuration information as well as help
estimate the amount of memory you should add.

mcgrew@cs.rutgers.edu:

Most (4.X) kernel-tuning is for increasing table sizes -- mostly in
response to having more users running more processes, not for nfs
stuff. ... in sum, add more memory. The improvement will be striking.

--
Henrik Schmiediche, Dept. of Statistics, Texas A&M, College Station, TX 77843
E-mail: henrik@stat.tamu.edu  |  Tel: (409) 845-9447   |  Fax: (409) 845-3144

Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:07:28 CDT