SUMMARY: problems with client on 2nd ethernet controller

From: Colin Macleod (cmacleod@maths-and-cs.dundee.ac.uk)
Date: Fri Feb 21 1992 - 23:27:44 CST


I'll summarize this since Alastair Davie (ajdavie) who posted the
original query is now on holiday. The problem:
___________________________________________________________________________
> We are trying to move one of our diskless 3/60's from the first
> ethernet controller to the second ethernet controller on a sun 3/280.
> We already have a number of clients on this part of the net including
> another 3/60. The machine ( called rum ) suffers problems when trying
> to boot from the Ethernet, symptoms that appear are...
>
> It is very slow transferring boot.sun3.sunos4.1.1 from tftp server
> ( the sun 3/280 ) sometimes 5+ minutes.
....
>
> It then proceeds to boot as normal till it gets to ps -U in /etc/rc
> where it hangs and complains of
> NFS server spirit-gw not responding still trying
> etherfind at this point gives
> UDP fragment offset=7400, length=916 from rum to spirit-gw
> every 20-30 seconds.
>
> When this is commented out it continues to the end of /etc/rc.local
> where it hangs on ldconfig with the same complaint.
> etherfind at this point gives
> UDP fragment offset=7400, length=916 from rum to spirit-gw
> every 20-30 seconds.
>
> If this is commented out it finishes /etc/rc and then init hangs trying
> to update /etc/ttys once again complaining about spirit-gw not
> responding.
> etherfind at this point gives
> UDP fragment offset=1480, length=844 from rum to spirit-gw
> every 20-30 seconds.
>
> At this point you can open remote xterm's on rum and running ps shows
> that swapper and pagedaemon are in a state of non-interruptible device
> waits.
>
> We also tried swapping the ethernet address of our other 3/60 on this
> part of the net and rum over with the result that the other machine
> booted fine while thinking it was rum and the machine causing the
> problem still hung in the same places.
>
> rum still boots ok when reconnected to the first ethernet controller.
> The problem still occurs with the generic kernel.
> We even went as far as removing it as a client and re installing it,
> but still no joy.
> We are using SunOS 4.1.1
> We are running NIS, the NIS master is a 3/160, with everything set up
> as it should be.
>
> More thanks in advance,
> Alastair Davie
> ajdavie@uk.ac.dun.mcs
____________________________________________________________________________

The Solution:

Further investigation turned up:

rum# netstat -i
Name Mtu Net/Dest Address Ipkts Ierrs Opkts Oerrs Collis Queue
le0 1500 campus rum 13666 0 10463 1834 594 0
lo0 1536 loopback localhost 141 0 141 0 0 0

- very high rate of output errors, input errors low or zero.

Spray to this machine from another was fine. Trying spray from the
problem 3/60 to a 4/20 (fast enough to catch all correct packets) gave:

rum# spray pimms
sending 1162 packets of lnth 86 to pimms ...
        in 1.5 seconds elapsed time,
        no packets dropped by pimms
        775 packets/sec, 65.1K bytes/sec
rum# spray -l 500 pimms
sending 200 packets of lnth 502 to pimms ...
        in 0.4 seconds elapsed time,
        no packets dropped by pimms
        480 packets/sec, 235.8K bytes/sec
rum# spray -l 1000 pimms
sending 100 packets of lnth 1002 to pimms ...
        in 0.2 seconds elapsed time,
        47 packets (47.00%) dropped by pimms
Sent: 490 packets/sec, 479.9K bytes/sec
Rcvd: 259 packets/sec, 254.3K bytes/sec
rum# spray -l 1500 pimms
sending 66 packets of lnth 1502 to pimms ...
        in 0.2 seconds elapsed time,
        66 packets (100.00%) dropped by pimms
Sent: 367 packets/sec, 538.8K bytes/sec
Rcvd: 0 packets/sec, 0 bytes/sec

So all short packets got through ok, but as length increases more get
corrupted or lost and at the maximum size of 1514 nothing gets through.
Trying this test to other destination machines gives the same result.

In practice this means that all nfs writes of more than a few hundred
bytes hang because the data is fragmented into a stream of packets
all of which are 1514 bytes except the odd one at the end, which is
the only one the server sees. The processes trying to write get stuck
in device wait state.

tgsmith%com.sun.east.spdev@com.sun said: "... I suspect that
the ethernet interface in the client or the server may be wandering
off towards one end of the ethernet spec (jitter, signal level, or
timing). I have seen this happen before; two machines can't
communicate reliably becuase one (or both) are slightly out of spec.
You might want to check the xcvrs and/or cables but I have a feeling
that it may be the crystal driving one of the ethernet chips."

This sounds like a plausible explanation. ***BUT*** when we put this
machine back on its original net it would talk to everything there ok,
on the new net it had problems talking to any other machine. We tried
turning off all the bridge and repeater connections to other buildings
and terminating the ether cable halfway in case it was too long, nothing
made any difference.

Anyway, we got Sun to change the main board, and its now working fine!

Thanks for other suggestions from:

miker@sbocc.com - pointed out that more info was needed than in original
message.

kevin%kalli%com.sun.aus.fourx@com.sun said:
"Now you know why Sun does not support this. First of all, the second
ethernet controller is not very fast (MB->VME adaptor and all that) and
there are a lot of things that assume the first listed interface for
their work.
It can be done, but I'd recommend using the first interface for clients
if at all possible."
- We know the performance is not ideal, but we find it useful and it
does normally work.

stern%com.sun.east.sunne@com.sun said:
"check to make sure you're getting the right IP address,
and that the client's /etc/hosts file has the right IP
address in it for itself (if you've moved it, it might
be wrong -- on the "old" net). looks like the client
is changing its mind about its IP address as soon
as it gets a kernel."
- We've learned these things the hard way in the past!
______________________________________________________________________________
Colin Macleod, Technical Officer, Phone: 0382-23181 x4839
Dundee University Maths & Computer Science Dept.
23 Perth Road, Dundee DD1 4HN, Scotland. EMail: cmacleod@uk.ac.dund.mcs



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:36 CDT