SUMMARY: etherfind and loss of IP packets (partial)

From: J M Thompson (masato@access.digex.net)
Date: Tue Sep 14 1993 - 03:41:55 CDT


The jury is still out regarding this problem :-( But at least I wanted
to summarize what I got so far. This message is divided into the fol-
lowing sections: original problem description, responses received to-date,
some new information, and acknowledgements.

ORIGINAL DESCRIPTION

>I am trying to determine if an intermittent loss of IP packets
>could be caused by running the etherfind utility.
>
>The situation that occurred today is that shortly after starting
>an etherfind trace, we began experiencing intermittent loss of
>IP packets to and from the system that had the etherfind
>trace running. etherfind was started has follows:
>
>etherfind -i tr0 -v -x -l 256 -t \
>between hostname1 and ipaddrhost2 >file.out
>
>The external symptoms inlcuded:
>
>o Excessively long response time for applications running on the
> host with the etherfind trace.
>
>o ping commands to/from this system reported from 20 to 50 percent
> of the packets being lost. Example of the ping command is
>
> ping -s another.host 1000
>
>o The intermittent IP packet loss problem continued even after
> terminating the etherfind trace.
>
>The system configuration is SunOS 4.1.3 on a SUN690, single processor,
>with a token ring interface.
>
>At wits end, we rebooted the SUN690 and the problem went away.
>
>To further confuse the issue, we had run an etherfind trace on DIFFERENT
>SUN690 without incident earlier in the day.
>
>Any help would be appreciated.

SUMMARY OF RESPONSES

Joel Shandelman writes:

>Although not documented [to my knowledge], it is recommended that the
>workstation acting as the sniffer/scope not monitor it's own interface.
>Sun Advanced Admin concurrs with this as well. This doesn't explain very
>well why the problem cleared up after a reboot but it still makes sense
>that a snifer/scope shouldn't monitor it's own actions.

Sumner K Hushing III writes:

>etherfind opens the interface in promiscuous mode, which grabs any old
>transaction that comes by. My experience with etherfind is that you
>must use it on a system other than the one you are debugging, since
>it will indeed affect operations. I'm surprised you had to reboot
>to recover, though. My 4.1.3 Sparc10's would recover as soon as I
>stopped etherfind.

In the situation described by the original posting, I was having
etherfind monitor its own interface. But I was able to also recreate
the symptoms of the problem by running etherfind on a third
box monitoring the traffic of two other boxes. (see NEW INFORMATION
section for more details)

Mike Raffety writes:

>It CAN ... if the host is already fairly busy, and/or there's LOT of
>traffic for etherfind to capture.

I can't be certain in the case of the original problem, but in the
work to recreate the problem in a controlled environment, I was able
to get the problem to reappear while monitoring ping traffic between
two systems and at the same time invoking a telnet session from a PC.
Other than the usual system processes, etherfind was the only process
running on the system that was functioning as the monitor. And I don't
think 'ping -s hostname 1000 20' repeated after a three seconds delay
and a single telnet session is that heavy a load. (see NEW INFORMATION
section for additional details)

Hal Stern writes:

>you may be exhausting some kernel buffers when running
>etherfind, and when you're done the system doesn't
>recover because the buffers are leaked. check out
>the various patches for leaking mbufs and exhausting
>kernel memory on a 600MP.
>
>the fact that the problem is corrected after booting
>makes it appear that you're running out of mbufs.
>you run out, you start to drop packets. you'll
>use a *ton* of mbufs running etherfind because
>it goes and grabs every packet that it can

I did searches at the WAIS server located at quake.think.com and reviewed
the INDEX file for /pub/sun-info/sun-fixes located at sunsite.unc.edu
and found references to the types of problems described above. But in
the write-ups I found, I should have also encountered mbufs shortage
messages written to the console or a panic situation. I did not
encounter either symptom in the original problem or in the attempts
to recreate it in a controlled envionment.

NEW INFORMATION

Since the original problem occurred in the production environment,
work to recreate the problem has been carried out in a separate
environment for debugging purposes. In this environment I have

SysA - Sparc10, SunOS 4.1.3, token ring
SysB - Sparc10, SunOS 4.1.3, token ring
SysC - Sparc2, SunOS 4.1.3, token ring
PC1 - Compaq 486, MS/DOS 5.0, Windows 3.1, FTP, Inc. TCP/IP support,
      token ring

All of the above systems are on the same 16mb token ring subnet.

I can get the problem symptoms to consistently reappear by doing
the following:

Execute the following on SysB

        while :
        do
                date | tee -a file.out
                ping -s SysC 1000 20 | grep "packet loss" | tee -a file.out
                sleep 3
        done

Start etherfind on SysA as follows:

        etherfind -i tr0 -v -x -l 256 -t between SysB SysC >trace.file &

AND start a telnet session from PC1 to SysB. As soon as I issue the
telnet command, within 5 to 20 seconds, the ping command begins to
report dropped packets, response at the telnet session is poor.
Terminating etherfind does not clear the problem. As soon as
I reboot SysA, the ping command stops experiencing dropped packets.

To make it even more interesting, if instead of starting a telnet
session from PC1 to SysB, I start a telnet session from SysC to
SysB, *nothing happens*. The problem symptoms do not appear.

If I only run etherfind I do not experience any problems. I use
the FTP, Inc. TCP/IP support software daily without any problems. It seems
that when both are active that is when the problem arises. I am
now currently pursuing the problem with the respective software vendors.

ACKNOWLEDGEMENTS

I'd like to thank Joel Shandelman, Sumner K Hushing III, Mike Raffety
and Hal Stern for their responses.

-- 
--
Jim Thompson                     
email: masato@access.digex.net
daytime phone: 703-759-8252    



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:08:11 CDT