Folks,
I'd like to take a little time to describe a problem that was
troubling us for a couple of months until we finally tracked it down
yesterday afternoon.
We have a Sun 4/470 running SunOS 4.1_PSR_A. About two months ago,
we finally had enough users logging in that we started to run out of pseudo
terminals. No problem, right? Just make a few more. My boss did just
that, he made a few more...and a few more...and more...
Unfortunately, makeing the new ptys didn't help at all. After
about 35 or so users were logged in, subsequent users logging in were
getting the message ``all network ports in use''. or ``rlogind: out of
ptys''. We know something was wrong here, since we knew there were plenty
of ptys (96 at the time.) So, after some digging aroung in the old
sun-managers messages, we had attributed the problem to the old pty disease
where there were problems in the way some ptys were released, causing them
to hang. We tried and tried to get Sun to help us. The only thing
we heard from them was that the ``pty disease'' problem was supposedly
fixed after SunOS 4.0. They were of no other help. They essentially
denied that there was a problem by saying ``We've never heard of that
problem.'' And I never heard back from them.
Well, yesterday, my boss (Phil Draughon) and I deceided it was time
to figure out what was going wrong. A few days ago, I had suggested that
we try increasing the value of MAXUSERS in the kernel config file. While
neither Phil or I could think of a reason that it would help, we
thought we'd try it anyway. It was set at 32, we increased it to 128.
Before building the new kernel, I had been using netstat -m to keep an eye
on things and had noticed that when the number of queues (under streams
allocation) reached the maximum value displayed (200) that's when you could
no longer log in, you'd get the ``out of ptys'' error. Kill a few
processes to get the number of queues below 200 and you could log in again.
Seemed strange. It turned out to be a wild goose chase. We installed and
rebooted with the new kernel and nothing changed...we had the exact same
problems, except this time, the number of queues went up to 203.
The next thing we thought of was the possibility of having too many
files opened. We checked the kernel and deceided that the values were
plenty high enough.
Because of Phil's previous experience with the problem, we knew
that the open()s in telnetd and rlogind were failing. Our question was
``WHY??!!??''. Unfortunately, whenever telnetd and rlogind have an open()
fail, they incorrectly assume that the reason for the open() failing is
that you're out of ptys. This is not -always- the case. In out case, we
had 48 more. So, we started trying to open some of the ptys. When we
tried to open anything above /dev/ptyrf, the 48th pty, we were rather
suprised to get the error ``No such device or address''. Great, time to go
digging into the kernel again. Well, after a bit of digging aroung, we
found it. In the file /sys/os/tty_ptyconf.c there is this conspicuous
line:
#define NPTY 48 /* crude XXX */
We couldn't believe it! Sun actually hard-coded into the kernel
the maximum number of ptys you could have, and they didn't even make it a
configurable option. At the very least, this should be in the config file.
After sever minutes of cursing Sun under out breath, we upped it to 128 and
rebuilt a new kernel and rebooted. Problem solved.
I can't believe Sun wasn't able to tell us how to fix this. With a
little more research, we found that this is carried over from the old BSD
code, but under BSD they set it to 32.
I must say, we're rather disappointed with Sun's software support
after this little fiasco.
===============================================================================
Christopher D. Nims Chris_Nims@nwu.edu
Distributes Systems Services
Academic Computing and Network Services: Northwestern University
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:13 CDT