Sun-3/50 can't bolot - server problems? - SUMMARY

Date: Mon Mar 25 1991

[I know I mispelled "boot". I got more responses about that than
helpful information. I am leaving it that way so those who match
requests and summaries can find this. -tep]

The problem turned out to be hardware, but not the 3/50. I replaced:
        the tranceiver (with a new tap)
        the computer, first with another 3/50, then with a SPARCstation IPC

The symptoms came down to the fact that no system tapped in that
location could recieve a packet from any machine more than about 300
feet away on the cable! Machine on either side of the "ethernet
triangle" could talk to each other with no problems!

It seems that the morning the problems started that the building
and/or net was hit by lightning. It blew three power supplies and the
Ethernet interface on a system three buildings away (but on the same
Ethernet cable). They replaced the supplies and Ethernet interface
board on that system, but that system still couldn't see the net.
When they lifted the false floor, they could *smell* the remains of
the tranceiver!

When they replaced that tranceiver, all of the problems in our
building went away.

Boy am I confused!

   From: tots!tots.Logicon.COM!tep@ucsd.EDU
   Date: Wed, 20 Mar 91 15:16:11 PST
   Reply-To: ucsd!!tep
   X-Organization: Logicon, Inc., San Diego, California

   OK, its been a long day, and I'm still stuck.

   Environment: one Sun 3/180 server, four 3/50 clients, SunOS 3.5.

   I came in this morning and my workstation (galt) was screenblanked and
   did not respond to anything (including L1-a). It behaved as though the
   server was down, but I checked the server (it was up) before I rebooted galt.

   The other three clients are fine, although I have *not* tried to
   re-boot them.

   When trying to boot, galt never got any response to his RARP request.
   I watched with etherfind -rarp on the server and I saw the requests
   from galt, but saw no responses to the RARP request.

   The portmapper, ypserver, ypbind, rarpd, inetd, rpc.lockd, rpc.statd,
   etc. were all running on the server. The /etc/services, /etc/servers,
   and /etc/rpc files are all over 1 month old.

   I have rebooted the server, replaced and un-replaced the inetd with an
   older version from one of the other servers (no effect).

   I checked /tftpboot, the dir is unmodified since before the last
   successful boots of the clients. The dir was last changed three months
   ago, the clients have all booted in the last three days. The in.tftpd
   and the ndboot.* all show the distribution date (Nov 87).

   The /etc/nd.local file is also three months old, I can mount the
   client's root on the server; it fsck'ed OK.

   I started a rarpd on one of the other clients, and now the poor galt
   machine knows its internet address, but now the server fails to respond to
   the tftpboot requests! I now have 9 in.tftpd daemons on the server.
   Apparently the tftp daemon gets started, but never responds, and
   another daemon gets started when the client times-out and re-requests.

   Ypcat of the ethers and hosts maps show everything A-OK. The ethers
   file changed 2 months ago, and some blank lines were removed from the
   host table this morning (restoring yesterday's host and ethers files,
   followed by re-making the yp maps made no difference.)

   The server has old disks and crashed recently with no apparent damage.
   We see occasional "disk sequencer error" messages.

   The server was re-booted this morning to install a new kernel (more
   text table entries). I have the same problems with the old and new kernels.

   What has happened to the server that has caused it to lose the ability
   to boot this client? I am afraid to take the other clients down, as I
   doubt that they would reboot.

   I can't find any configuration errors; is it possible that some
   critical piece of software has become corrupt on disk? What is that
   can make both rarpd and tftpboot fail, but in different ways? Remember
   that inetd *is* starting the tftpd, but tftpd cannot seem to respond
   and hangs.

   *Sigh* The network *might* be the computer :-(
