SUMMARY: RSH / RCMD Socket Errors on Solaris 8, Solaris 2.

From: Tim Chipman <chipman_at_ecopiabio.com>
Date: Tue Feb 26 2002 - 10:47:30 EST
A **BIG** thanks goes out to those who responded promptly, leading me
straight to a painless solution. (Casper Dik, Andy Lee, Frank Smith).

Concensus was,

-do some tcp tuning/monitoring ("netstat -an" gives a feel for the state
of connections,ie, in use or wait state?)

-tweak the tcp stack to decrease the timeout value from (the default) 4
minutes to 1 minute for sockets:

ndd -set /dev/tcp tcp_time_wait_interval 60000  [solaris 8]
ndd -set /dev/tcp tcp_close_wait_interval 60000 [solaris 2.6]

(NOTE that these tweaks do *not* persist across reboots ; I've created a
"kludge" script, /etc/rc2.d/S99_ndd_tcp_kludge, which sets these
parameters at system boot.)

It seems that rsh makes use of "reserved" ports (ie, < 1024), of which
there are only ~400 available inherently. With a session timeout of 4
minutes, and most of my rsh jobs taking < 20 seconds, typically I was
saturating the 400 available reserved ports, leaving most waiting to
timeout, but blocking new rsh sessions from being established. Hence the
socket:all ports in use ... type error messages.

I've tweaked the parameters on my solaris 8, 2.6 boxes and now they both
run equally well (ie, no errors).

Another option mentioned by Casper Dik (in addition to the "ndd"
tweaking) was to use something like "ssh" instead of "rsh", which is not
limited to reserved ports .. hence no ~400 port limit ... hence this
entire kettle of worms becomes far less relevant. (Initially when
setting up this software, I had been concerned that ssh might introduce
non-insignificant delays on the encrypting CPU in particular, but more
preliminary benchmarks attempted today suggest this concern is almost
certainly unfounded, esp. if we force use of ciphers that are NOT
"the MOST robust available". Private-Public Key pairs let me have
"no-authentication" ssh commands being pushed in the same manner as rsh,
so it is a very strong candidate to replace rsh when we scale this thing
onto a higher number of parallel tasks (ie, from the current 14-at-once
to .. more .. later).

Finally: I was *TOTALLY* down the garden path with my tweaking of
pt_cnt, npty, etc etc. In this case, my google searches that took me
down this avenue were really a hinderance / distraction more than a
help. "Old Style" BSD ttys were simply not an issue in this (and
apparently they are rarely an issue with most solaris apps .. )

Anyhow. A big thank-you to everyone for their help. I hope that the
archives of this posting help prevent others from making the same
mistakes I did :-)



---Tim Chipman




:original posting follows:

-=-=-=-=-=-=-=-=-=-=--=-=--=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=--=-=-=-=-=-=-=-=-=-=-
Hi all,

I've got a bit of locally-developed software which is giving a fair bit
of grief.

I'm attempting to find how to resolve these errors, and after extensive
crawling through google and the list archives, all the apparently
obvious fixes still leave me in a bind, so I'm hoping someone can
comment on what else can be done with the situation.

Basic scoop is: The program is spawning N parallel processes initially,
where each process will rsh a command through to another server ;
capture the stdout it recieves, parsing the output, and saving it to a
local file. As one thread completes, another is spawned to ensure there
are always N parallel processes grinding away. For the moment, N=14, but
in the future I expect this to increase to 28 or higher.

The Problem:

Initially most critically observed on the Solaris 2.6 box running this
software, it would often throw error messages to the console:


socket: All ports in use


This appeared to be a derivative on how many concurrent rsh sessions the
solaris 2.6 box was happy with. A bit of reading suggested there were
some kernel paramaters in /etc/system which could be tuned to alleviate
the situation, possibly, including:

set pt_cnt=512 (it was already high ; this box is a sunray server with
plenty of users logged in with many term windows open).

set npty = 176  (didn't exist initially, was set and didn't seem to help
as much as expected ; also tweaked the /etc/iu.ap file, "ptsl" line, to
jive with this npty setting.)

tried "pty_cnt=number" but apparently this isn't a legitimate variable
in solaris 2.6 kernel, only in 7 and 8 (?)

After making these changes (along with reconfig-reboot), the problem
still crops up for N > 4 (approx). I had given up hope on this solaris
2.6 box running this app, since we had a solaris 8 box that was behaving
better and could do the same thing. Or so I thought.

ThisAM, the solaris 8 box is now throwing similar errors in the same
circumstances. Error message is,


rcmd: socket: Cannot assign requested address


I've tweaked the "pty_cnt=number" setting in /etc/system, which seemed
the most obvious place to muck with the config for this, and there is no
joy. According to my reading on the problem (?) these resources are
dynamically managed on Solaris 8 (at least, far more so than for solaris
2.6), hence my hopes that the problem would not persist from this
environment.

Alas, it may not be the case. (?)

If anyone has some suggestions on other things I'm overlooking, it
certainly would be greatly appreciated..

Thanks!

---Tim Chipman
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
Received on Tue Feb 26 09:49:24 2002

This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:42:35 EST