SUMMARY: 4/690 crashing constantly

From: Peter Sivo (peter@key.amdahl.com)
Date: Wed Feb 24 1993 - 18:41:07 CST

Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

The original posting:

> We have 3 4/690's in house running SUN OS 4.1.2, and now two of them since
> last week (over a 5 day period) have crashed 8 times! The error messages
> are all pretty similiar. These messages (below) show up on the console,
> usually LED 11 and 6 are on solid on the CPU board (or just 6), and the
> system needs to be powered cycled to bring it back to life:

> --------------------------------------------------------------------------
> Level 7 VME interrupt not serviced
> Level 6 VME interrupt not serviced
> Level 5 VME interrupt not servicedLevel 11 VME interrupt not servicedLevel
> 7 interrupt not serviced
> Level 13 VME interrupt not serviced
> ve
>
>
> OR
>
> Level 7 VME interrupt not serviced
> Level 6 VME interrupt not serviced
> Level 1 VME interrupt not serviced
>
> OR

> LeveLevel 6 VME interrupt not serviced
> l 5 VME interrupt not serviced
> VME dropped an INT-ACK cycle
> MMU sfsr=b36:P Bus Access Error on supv data fetch at level 3
> M-Bus Timeout Error

Sorry for taking so long for the SUMMARY, however it took 2 solid weeks to
find out exactly what the problem was.......

I got a lot of responses and appreciate them all, however 3 people responded
with the same suggestion and it got me thinking. Here is Hal Sterns reply
which best summarizes:

> i wonder if this is a problem with long bursts from the SMD controller,
> given other VME bus activity. the xd driver has a throttle that
> controls how long the bursts are from the SMD controller, which
> is 32 words by default.
>
> crank this down to 16 and see if the problem goes away..
>
> # adb -k -w /vmunix /dev/mem
> xdthrottle?W0t16
> $q
> # reboot

> these interrupt not serviced messages usally point to (a) a device
> hogging the bus, so a write to a valid device never makes it out
> or (b) an improperly terminated bus, so that interrupts are posted
> but never seen by the CPU.

Jeff Kays also pointed out the .h file the variable is in:

> We had this same problem when we upgraded our 490s to 690s. We found
> them to be related to 3rd party hardware and software using
> the VME bus. First, we had Xylogics 7890 IPI disk controllers.
> In the release notes for the controllers it states that due
> to the number of longwords transferred by the controller across
> the VME, it can cause timeouts on the VME bus. I think the XL_THROTTLE
> constant defined in xlreg.h was 32. We set it down to 8 and that got
> us going the first day after the install.........

(On our system under SUN OS 4.1.2, it was in xdreg.h.....not sure if that
was a mispelling, or it used to be elsewhere in another SUN OS version...)

Anyways, that helped me put the pieces together but the real problem
was: Everything was stable since August....why all the crashes now?
If the SMD controllers did hog the bus, how come we never saw this before?

The answer came after I drew up a time line of what changed since Xmas.

Turns out that we installed a Intelligent Routing Hub from Alantec (ever heard
of one?) that basically segments traffic, decreases collisions, and buffers
traffice. This was for testing purposes to see if we wanted to purchase one.

During the 1.5 months it was installed, we ended up installing 150 more IPX's
into our network, rammed a few more Network Coprocessors into our 4/690's,
and loaded the latest Rev. of SW that drove those Network Coprocessors. The
enhancements to their SW were incredible and performance increased
substantially.

Still no crashes. Then, Feb 12th, we removed the Alantec hub at 4:30am and
all hell broke loose starting at 7:30am. That is when the first 4/690
crashed with the message above and over the next 6 days, 2 of our busiest
4/690's would crashes 18 times between them!

SUN had no idea. Grumman (HW support) had no idea and ended up replacing
X number of boards, CPUs', etc......but nothing worked.

IT WAS THE INSIGHT OF THIS GROUP and the above emails that helped me realize
that taking out the Alantec Hub with such an increase in traffic finally
drove the SMD's and Xylogics controllers to show their "ugly bug" ......

The fix? We are now testing out another Hub (Sigma) that does the same
thing and now we have had no crashes. BTW, I attempted to throttle the
SMD's down to 8 and I was still able to crash our server. It seems that
the traffic load we carry *and* the need for the network coprocessors for
the bus cause the servers to crash constantly.....

So be it....atleast I know of the problem and can deal with it........

Many thanks to the 2 gentlmen above and the following individuals for their
insight and pointing out other things that could be wrong.
(ie. HW problems - CPU, ROSS modules, etc....)

If I missed anyone, my apologies....I did lose some mail during the crashes..

Jeff Kays <jkays@msc.edu>
John A. Murphy <jam@philabs.Philips.Com>
<slezak@llnl.gov>
Mike Raffety <miker@il.us.swissbank.com>
Phill <phill2@hivnet.ubc.ca>
Mike Raffety <miker@il.us.swissbank.com>
Cheryl Cato <clc8347@nigel.tamu.edu >

------------------------------------------------------------------------------

Peter Sivo
Amdahl/Advanced Systems
peter@key.amdahl.com

Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:07:30 CDT