SUMMARY:NFS problem or something else?

From: Ralph Dell (RALPH@mail.co.catawba.nc.us)
Date: Tue Nov 25 1997 - 09:45:38 CST


The short answer is it was a network problem. Here is the original
question and then some nitty gritty details.

Help,
  I have a Sparc 1000, Solaris 2.5.1, recommended patches are fairly
current, using NIS.
  About 8:20 this morning users connected to the server through sun
workstations or PC's with Exceed 5 lost their connections. Whatever they
were doing froze. After some initial investigation I rebooted the server
and that did not fix the problem. I can ping the workstations and they can
ping the server. Telnet and rlogin work from a PC/ workstation to the
server. When I fire off exceed from a PC I never get the CDE login
screen. If I reboot a workstation I can log in as root and see my NFS
mounted file systems. I can cd to a NFS directory, if I do a" ls" I get one
message "NFS server earth not responding still trying". No big deal I've
seen that before. I've done a nfs.server stop and start. That doesn't help.
All my nfs deamons are running, mountd,lockd, nfsd, statd. Have I missed
any? There are no errors in the servers messages file.
  On a workstation (Ultra 140, Solaris 2.5.1) I have found this in
the messages file.
inetd[119}: yp_all - RPC clnt_call (transport level) failure: RPC: timed out.

Pings work both ways with name and address. The system crashed on
Friday because of a failure in my UPS, but we seem to have that resolved
and the system was running on the UPS all weekend without any
problems.

Any ideas on what my problem is.

I spent the bulk of the day on the phone with my local tech support, and
then with Sun's tech support, before we found a solution. I never got a
chance to read any replies till this morning but Glenn Satchell hit the nail
on the head.
After we switched the ethernet cable to a another wall port we were
able to re-establish communications with the workstations/PC's. The
network is in another departments hands and the switches are MAG
ATM, 740 for the backbone and 280's for workgroups. When the
switches were reset last night one of them didn't come back and had to
be replaced.
  The conclusion that was arrived at yesterday was that there was a
hub, port, cable problem. Traffic that didn't put a load on the network, like
ping, worked just fine, when we tried to do something a little more
intensive like use NIS or execute a command on NFS file system the
network couldn't handle it.
  Along the long rocky road to an answer, some of what we did was;
nfs.server and nfs.client stop and start, verify that nfs deamons were
running, stop and start yp and verify those deamons were running,
rpcinfo, dfmounts, and ypwhich. Connected to a workstation with
Exceed, telneted workstation to workstation, and switched to another
ethernet card on the server. Everything we looked at on the server
looked good, most of what we looked at or tried on the clients worked.
And no complaints in my servers messages file. I've learned a fair amount
including not to assume the network is healthy because ping and telnet
work.
I'm done rambling here are the answers I received.

>From Glenn Satchell
Sounds like a router, hub or switch may be having a hard time. The
yp_all message is usually an indication that the network failed. Try
resetting your hub(s) or switching ports, etc.

>From Joel Lee
You need to bring the nfs server earth up. If it is, you need to start the
nfsd there. I suppose you are not using automount, right ? If that's the
case, it's natural that your users who uses exceed would probably hang
as well.

I may not have been clear in my original post, I was in a hurry. The NFS
server was always up and the nfs deamons were running. Ralph

from Artur Shnayder
 Try to increase file description on NIS server. You can just add the
following

string to /etc/system:
set rlim_fd_cur=512



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:12:10 CDT