SUMMARY - rlogin failes

From: Klaus Hering (fisch@uni-paderborn.de)
Date: Mon Oct 22 1990 - 03:18:12 CDT


  Dear Sun-Managers,

a few days ago I described a problem concerning rlogin failures :

> The machine is a 3/260 running SunOS 4.0.3. Its name is "plato".
> An `rlogin` works in so far as I get a prompt from plato, but any
> keystroke is answered with the immediate logout; `ftp` works as well
> as `rcp`, `rdate` etc. . `telnet` doesn't finish the login procedure
> and gets stucked.
>
> If I call `shelltool /usr/hosts/plato` from sunview's rootmenu the same
> phenomenon appears, but if I open several shelltools in turns all but
> the first can be used regularly.

Although several people told me, that this was a well-known problem, that
has been raised many times here ( called "pty disease" ), I don't have a
bad conscience to revive the discussion about it, as I received many
requests for this summary.

You older inhabitants of netland close your eyes to the ignorance of
us newcomers ! :-)

The problem was due to background processes, that were still attached
to - or other processes not clean detached from - some pseudo tty.

Sun knows about the problem ( #1014706 or 1022058 for 3861's ) and
wanted to have it fixed in the 4.0.3 for Sun3 and Sun4 ( but apparently
didn't ! )

Many people presented simple workarounds :

* Open a window on the console and close it to an icon to keep that port
  busy so the next rlogin gets a "healthy" tty
* On nongraphical terminals ( with same intention ) type
  rlogin plato & rlogin plato & sleep 2; rlogin plato
  ( one extra backgrounded process just to make sure )

Brian Field wrote me:
> Log in to the offendeding machine until you get in, and determine the
> pty the other (failed) logins are attached to. Now look for a process(es)
> that are still attached to those pty's. Nuke 'em, and then try logging
> back in.

Another solution from Mark Wallen:
> Just find the lowest numbered pseudo tty not being
> used (rlogin picks the lowest numbered free one).
> You can use ls -lrt /dev/tty[pq]? and w and who
> to try to find the lowest ttyp? not being used.
> Then, as root, simply mv /dev/ttypN X/dev/ttypN
> and try rlogin again. You can mv back after the
> system is rebooted.

Others told me to have the users always redirect the output of their
background jobs to a logfile or /dev/null. Seth Robertson wanted to
make everyone who submits background jobs use the 'background' program
which he wrote ( available for anonymous ftp from sol.ctr.columbia.edu in
pub/sun/background.shar ).
This was to prevent the appearance of abovementioned problem.

Don Lewis sent me a program, that resets the mode bits on ptys.
Here it comes:

----- Begin Included Message -----

One of the mode bits on ptys sometimes gets set and never gets reset.
Compile the included program. If the broken pty is /dev/ttyp0, then
run the program as "fixpty p0".

# This is a shell archive. Remove anything before this line,
# then unpack it by saving it in a file and typing "sh file".
#
# Wrapped by thrush!del on Fri Oct 19 17:42:58 EDT 1990
# Contents: fixpty.c
 
echo x - fixpty.c
sed 's/^@//' > "fixpty.c" <<'@//E*O*F fixpty.c//'
#include <stdio.h>
#include <fcntl.h>
#include <sys/termios.h>
main(argc, argv)
int argc;
char **argv;
{
        int fd, zero = 0;
        char buf[BUFSIZ];

        if (argc != 2) {
                fprintf( stderr, "syntax: %s port\n", argv[0]);
                exit(2);
        }
        sprintf( buf, "/dev/pty%s", argv[1]);
        if( (fd = open( buf, O_RDONLY ) ) == -1 ) {
                fprintf( stderr, "%s: can't open port %s\n", argv[0], buf);
                exit(2);
        }
        if( ioctl( fd, TIOCREMOTE, &zero ) < -1 ) {
                fprintf( stderr, "%s: can't do TIOCREMOTE(0) on port %s\n", argv[0], buf);
                exit(2);
        }
        exit(0);
}
@//E*O*F fixpty.c//
chmod u=rw,g=r,o=r fixpty.c
 
echo Inspecting for damage in transit...
temp=/tmp/shar$$; dtemp=/tmp/.shar$$
trap "rm -f $temp $dtemp; exit" 0 1 2 3 15
cat > $temp <<\!!!
      25 85 534 fixpty.c
!!!
wc fixpty.c | sed 's=[^ ]*/==' | diff -b $temp - >$dtemp
if [ -s $dtemp ]
then echo "Ouch [diff of wc output]:" ; cat $dtemp
else echo "No problems found."
fi
exit 0

----- End Included Message -----

Joachim Holzfuss included a program from archive-server@titan.rice.edu,
that is supposed to cure pty disease as well. Unfortunately this program
was to long to be included in this summary, but I'd be happy to post it
to anybody who sends a request.

Last but not least I received a message from John C. Hasley in which he
forwarded a mail containing a source code patch written by Andy Sherman.
Andy's mail is included at the end of this summary.

Thanks to

   Adam Stein <Stein.Wbst129@Xerox.com>
   Albert Cheng <acheng@ncsa.uiuc.edu>
   Don Lewis <del@mlb.semi.harris.com>
   Brian Field <field@cs.pitt.edu>
   John C. Hasley <hasley@andy.bgsu.edu>
   Sumant Hattikudur <hattikud@cpswh.cps.msu.edu>
   Brooke Jarrett III <jarrett%mpl@ucsd.edu>
   Lisa <lisa%beldar@ucsd.edu>
   Charles <mcgrew@aramis.rutgers.edu>
   Mark Wallen <mrwallen@UCSD.EDU
   Jeff Nieusma <nieusma@eclipse.Colorado.EDU>
   Seth Robertson <seth@ctr.columbia.edu>
   Tony Silva <tsilva@aaec1.uucp>
   Jaochim Holzfuss <xphyhofu@ddathd21.bitnet>

who took the time to respond.

This list is really great.
Be prepared to hear from me more often ! |-)

        Klaus.

---------
From: andys@ulysses.att.com
To: sun-managers@eecs.nwu.edu
Subject: pty driver bug - WITH FIX!
Date: Thu, 07 Jun 90 22:42:16 EDT
Message-Id: <9006072144.aa19640@delta.eecs.nwu.edu>

For a change, a solution rather than a problem:

The following bug report with analysis and fix was sent to both Sun
and Solbourne. This should clear up at least *some* of the hung
pty problems in the Sun bug list. One of my colleagues is pretty sure
that this may solve the problem of mysterious rdump hangs that he and
others on this list have been seeing.

Those of you with source can apply the patch and rebuild your kernel.
Get the driver source from /usr/src/sys/os, apply the patch and place
the result in /usr/sys/os. When I hear from Sun I will find out if I
can redistribute patched binaries. If anybody on the list knows the
legalities *FOR SURE* (a Sun employee maybe?) feel free to let me
know.

Cheers,

Andy Sherman/AT&T Bell Laboratories/Murray Hill, NJ
AUDIBLE: (201) 582-5928
READABLE: andys@ulysses.att.com or att!ulysses!andys
What? Me speak for AT&T? You must be joking!

------- Forwarded Message

To: hotline@sun.com
Subject: pty driver bug - WITH FIX!
Date: Thu, 07 Jun 90 19:48:50 EDT

This report applies to the pty driver in 4.0.3. It can be observered
in sun3 and sun4 architectures. I believe that the bug may still
exist in 4.1, but I haven't upgraded yet.

This may be the same bug described in Bug Reference Number 1014706.
The description in the synopsis indicated that nobody had figured out
how to reproduce the bug. The bug I report here may be reproduced at
will with the attached code.

PROBLEM:
^^^^^^^

The intermittent behavior I observed was that a pty with background
processes still attached to it would sometimes become unavailable to
new opens by in.rlogind, in.telnetd, or a window system. What the
user would observe is that all attempts to use rlogin, for example,
result in an immediate "Connection closed" message. I attempted to
add an additional vhangup call in in.rlogind just after it forks (as
is done in the 4.3-tahoe code) but that did not unstick the pty. The
death (natural or otherwise) of the background process attached to the
pty will always unstick it. What was puzzling about the problem was
that a background process (such as Ingres) could run happily for
*DAYS* on a pty with lots of login/logout activity and then suddenly
become stuck for no apparent reason.

I have discovered that the pty becomes stuck when it has a background
process attached to it, *AND* a user exits a shell attached to it with
"stty 0". This will do it every time, as you can verify with the
following program.

/* hangit.c */
#include <stdio.h>

/* Fork a child that sleeps for an hour. This is to be
 * used to test wierd behavior of pty's when background
 * processes have open files, the "brain-dead pty" bug
 */

main()
{
  int pid;

  if ( 0 == ( pid = fork() ) ) {
    fprintf(stderr, "Going to sleep for an hour\n");
    sleep( (unsigned) 3600 );
    exit(0);
  }
  else if ( -1 == pid ) {
    perror("fork");
    exit(-1);
  }
  else exit(0);
}

If you compile the program into hangit, you can then recreate the
problem by doing the following from any shell in an rlogin or telnet
session.

$ nohup hangit & # nohup only required for sh, not ksh and csh
$ stty 0

WHAT THE BUG IS
^^^^^^^^^^^^^^^

The driver error is contained in the module /usr/sys/os/tty_pty.c.
When the TCSETAW ioctl (issued by /bin/stty) set the speed to zero,
the pty flag PF_SLAVEGONE is set. This causes all further I/O to
return with an error, hence in.rlogind shuts closes master and slave.
If there is no background process, the driver close routines are
called for both master and slave. This releases all of the slave side
streams queues associated with the pty. When the slave is opened
again, the lack of queues is the signal to reset parameters in the pty
structure, which resets PF_SLAVEGONE and restores a non-zero baud
rate. If background processes are attached to the slave, none of the
shell or rlogind closes bring its open count to 0, so the driver close
for the slave is never called, and queues are still attached to the
pty structure. When the slave is reattached by a subsequent rlogin,
the baud rate is still set to zero and PF_SLAVEGONE is still up, since
the driver open code is not called. All subsequent I/Os return with
errors and rlogind exits. Also any slave-side ioctl will see the baud
rate as zero and make *sure* that PF_SLAVEGONE stays asserted.

THE FIX
^^^^^^^

This problem is fixed by a small change to the master side open
routine. There are two ways to solve the problem. I have elected to
mark the master as busy if the slave side is still busy, as indicated
by streams queues being attached to the pty structure. One could
emulate the BSD driver by just taking the opportunity to reset
PF_SLAVEGONE and restore the baud rate, but I believe that presents a
security hole, since the background processes, despite a vhangup, can
still open /dev/tty and write or read it, to the detriment of the new
session. The following diff applied to pty_tty.c will fix the
problem.

*** tty_pty.c.orig Thu Jun 7 19:09:47 1990
--- tty_pty.c Thu Jun 7 19:12:54 1990
***************
*** 618,631 ****
                                  /* XXX - should be EBUSY! */
          if (pty->pt_flags & PF_WOPEN)
                  wakeup((caddr_t)&pty->pt_flags);
! if (((q = pty->pt_ttycommon.t_readq) != NULL) &&
! ((q = q->q_next) != NULL)) {
                  /*
! * Send an un-hangup to the slave, since "carrier" is
! * coming back up.
                   */
! (void) putctl(q, M_UNHANGUP);
! (void) putctl1(q, M_CTL, MC_DOCANON);
          }
          pty->pt_flags |= PF_CARR_ON;
          pty->pt_send = 0;
--- 618,629 ----
                                  /* XXX - should be EBUSY! */
          if (pty->pt_flags & PF_WOPEN)
                  wakeup((caddr_t)&pty->pt_flags);
! else if (((q = pty->pt_ttycommon.t_readq) != NULL)) {
                  /*
! * Busy controller because slave still open somewhere
! * This avoids security hole in vhangup & /dev/tty.
                   */
! return(EIO);
          }
          pty->pt_flags |= PF_CARR_ON;
          pty->pt_send = 0;

------- End of Forwarded Message



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:05:59 CDT