SUMMARY: Sparc 10 dual CPU and SunOS 4.1.3_U1 problem

From: Simon Yeo (syeo@ntmtv.com)
Date: Tue Aug 15 1995 - 03:52:50 CDT


Here is my original question:
---------------------------------
We have a SPARCstation-10 Dual CPU server here that is running SunOS 4.1.3_U1.

The problem is that the server hangs completely every few weeks. It drops
off the network and does not respond to anything. There are no logs in
/var/adm/messages and console, so there is really no indication what might
cause the crash.

I called Sun and all they could tell me was that SunOS 4.1.3_U1 is NOT
reliable on the Dual CPU machine and that THEIR solution is to install
Solaris on it.

Can you tell me what I can do to prevent the crash without having to
install Solaris?

-----------
<= The Solutions =>

There seems to be a number of opinions that I put into three main catagories:

1) Install the following patches - 101508, 101784, 101954

2) Remove one of the CPUs

3) Upgrade to Solaris

Since I could not do 2) nor 3) for various reasons and both 101508 and 101784
had been installed before, I installed 101954. At this point I have to wait
and see if that patch is going to work. Usually takes a few weeks before it
hangs.

-----------------------------------------------------------------------------
--------- replies from people ------------------------
------------------------------------------------------

What patches are installed? I'd suggest the kernel jumbo patch, 101508,
and there is also one for ethernet interface hangs, but I don't have
the patch number handy. These might help.

Other options are to remove one processor :-(

----------
unfortunately, that seems to be the consensus answer these days. You can
generally "get by", but dual CPU systems want Solaris. The good news is
that Solaris works better on MPU machines - is there some particular thing
holding you back from switching?

---------
We have the same problem here - SS10 with dual SuperSparc II processors (75
Mhz) running SunOS 4.1.3_U1 Rev B. Since upgrading to 4.1.3_U1 B from
4.1.3 A and having new CPUs installed a few weeks ago (they were 2x50 Mhz
CPUS but seemed quite stable under 4.1.3 A) we have had two crashes (one of
which trashed our financial database that took a day and a half to
recover!!). Prior to the CPU upgrade I had most of the recommended patches
relevant to our site installed from the SunSolve disk for May 95 (except
for the international libc jumbo patch).

We get support through Computervision (used to be called PR1ME) who resell
for Sun here. They confirmed that our configuration is not supported by
Sun. They suggested I apply the recommended libc jumbo patch and sent me
two recommended patches with more recent revisions than those on the May 95
Sunsolve disk.

The patches are 101508-10 (sun4m kernel jumbo patch) and 101784-04
(rpc.lockd/rpc.statd jumbo patch). Since then we haven't had a crash - BUT
- its only been a week and a half!

They suggested the following course of action:

--------------------
If problems still persist one of he following lines of action is suggested:
- upgrade machine to Solaris (where 2 * SM71s are supported)
- replace the SM71s with the 51s until the machine can be upgraded - speed
        reduced
- remove 1 SM71 leaving 1 SM71 (this is supported - speed reduced)
- run "as is" - random crashes _may_ occur

----------
Sun is being boneheaded about this. The problem you're having sounds as though
traffic is shutting down you're lance chip, not I repeat not anything to do
with your CPUs. To prove this to yourself, try the following:

prompt> etherfind -ip

If this takes a small while and then you see output, then you've been afflicted
that Sun's ethernet drivers are not rbosut in the face of large packets from IPX
stacks, among other things.

You'll need to hunt for the patch which has the keywords "le0 hang", tell folks
the archicture and revision, and regen your kernel. You should be find from
then on.

If your machine cannot be logged into from the console, then it's a different
problem. Talk to your sun representative and ask them to send you the recommended
patches for ROSS hypersparc CPUS. The logic here is that Sun supports dual CPUS
in a Sunos environment; the HyperSparc chip is proof. THOSE patches, plus some
jumbo patches dealing with asynchronous memory hangs, should do the trick.

Why do I know this? I've run 20 compute servers with dual Vikings, and 4 HyperSparc
CPUS for about a year and a half. Barring blowing a disk now and then, I've had no
troubles, and will move into Solaris when then flurry of patches for 2.5 has died
down to a dull roar. ;-D Hope that this helps. Peace.

---------
We have found that Sparc 10's will sometimes 'hard hang' during periods
of high network activity. This happens more frequently under SunOS
than Solaris, but it does happen with either OS. The symptom is that
you cannot 'break' (or L1-A) to get back to the monitor prompt -- you
have to power off and back on again to get it unwedged. If this is
your symptom, then you should try using a different ethernet interface
besides le0. This usually means adding an Sbus card (we use the FSBE/S
which also gives a SCSI connection).

------------
We have two SS10 4CPU machines that have/had the same problem. The
frequency varied from twice a day to once a month or so, depending on the
code. I personally believe that two CPUs get into a kernel deadlock, but
haven't proved it.

# Can you tell me what I can do to prevent the crash without having to
# install Solaris?
I bit the bullet and updated one of them... seems to have fixed it, but I
won't know for sure till it's been running some heavy code when the students get
back in late August.

----------
I cant tell you what is causing the crash. However, I have a
3 sparc 10s' with 2 cpu's each. They all run 4.1.3_U1. Each have been
up for greater than 400 days. Dont let sun convince you that 4.1.X and
multi processor doesnt work.

-----------
We run SunOS 4.1.3 on a dual (50 MHz) cpu Sparc 10 and used to have this sort of
problem. Patches 100726-12 (or later, current version is -17) and 101408-01 did
the trick. 100726 is the sum4m kernel jumbo patch.
For SunOS 4.1.3_U1, I believe the equivalent patch to be 101508. The latest
latest revision I am aware of is 101508-10.

--------------
One thing to try (if you havn't got them on already) is loading patches:

101508-09 & 101954-06

Both of these fix various hangs.

Also there is a patch for 4.1.3 (But NOT 4.1.3_U1) which fixes a
known hard hang problem with the SM51 cpu module, this has apparently
NOT been obsoleted by 4.1.3_U1...it's patch 101408-01
however I doubt that they'll considering porting it to 4.1.3_U1 if
the only system type experiencing the problem is an unsupported
configuration.

----------------
We have a similar problem with 690's, but they do a "watchdog reset".

Sun gave us the same information and we have never solved it. The best
we have done is to enable an automatic reboot on a watchdog reset in the
eprom.

--------------
Does it truly crash, or just not respond to packets on it's ethernet interface?
The Sun interfaces will occasionaly wedge if they detect packets on the network
larger than 4096 bytes; try running diags on the card and continuing on; that
always resets ours...

> I called Sun and all they could tell me was that SunOS 4.1.3_U1 is NOT
> reliable on the Dual CPU machine and that THEIR solution is to install
> Solaris on it.

Their solution to most 4.x problems is to install 5.x. :)
------------



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:10:31 CDT