Summary : Ultra SPARC1 Creator shuts down automatically

From: Sangamesh Biradar HCMC Vietnam (sang@hcmc.geoquest.slb.com)
Date: Wed Jan 22 1997 - 14:09:49 CST


Hello BBers,

Please accept my sincere thanks for your valuable information given within
3 hours of my question on SunBB.
This is increasable helps from all over the world.

All of you have pointed at the same area " CPU's FAN " directly or
indirectly thro'
/var/adm/messages, which says precisely as follows,
----------------------------------------------------------------------------
------------
 Jan 21 21:07:10 myhost unix: WARNING: THERMAL WARNING DETECTED!!!
 Jan 21 21:07:20 myhost syslogd: going down on signal 15
----------------------------------------------------------------------------
------------
This is due to the defective CPU's fan ( on the top of CPU inside the pizza
box )
provided by SUN.
The system shuts itself down when it detects over-temperature.
This leaves clear log messages in /usr/adm/messages,

This is well known fact by SUN systems and replacement may be obtained from
your nearest SUN office .
For GeoQuest users, should go via GQ-Houston Mr. Don. P. Koenig.

Once again many thanks for sharing your experiences and knowledge.
Best Wishes,
Sangamesh
Systems & IT Manager
Schlumberger - GeoQuest
Vietnam, Cambodia & Laos
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
SEE THE DIERCT RESPONSES AS BELOW
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
- John Stoffel <jfs@fluent.com>
\Check that the fan on the CPU is working correctly. You can do this
by just popping the top and looking. If it's not working, you will
need to call sun and have them send you a replacement fan. This is a
known problem of the early Ultra 1 boxes, I've had over 10 fans die on
me so far.

- Jens Fischer <jefi@kat.ina.de>
Have a look at your /var/adm/messages file. I'm quite sure you will
find a message like "Thermal problem detected". This happens if the
internal fan mounted directly upon the CPU is not running (or at
least is not running fast enough). We have replaced 20 % of these
fans in our Ultra systems within the last 6 month.
You should make sure to get a "new" model if you get a replacement.
The new models actualy have the same part no and revision than
the old ones, but you can distinguish them by counting the wings.
The old ones have 5 wings, the new ones have 7.

- baldma@aur.alcatel.com (Mark A. Baldwin)
Check /var/adm/messages and I bet you will find some warnings concerning
"thermal" conditions. Basically, the fan that sits right on top of
the CPU in the machine is not working. When it gets too hot the machine
shuts itself down. Call sun and get a replacement fan.

-"Trevor Paquette" <tpaquett@aec.ca>
look at /usr/adm/messages for any information..

- nate@lscpdx.latticesemi.com (Nate Nicholson)
Check your /var/adm/messages* files. See if you have any "Thermal
Shutdown" messages. We have approximately 25 Ultras. At least half of
them have had problems with their CPU fan. We have seen two failure
modes. With the first mode, the CPU fan just starts to howl. It gets
louder and louder, until you replace it. With the second mode of
failure, the CPU fan just stops spinning, the CPU overheats, and the
machine auto shuts down. It always leaves a message in /var/adm/messages
when it does this.

- Jay Lessert <jayl@latticesemi.com>
We've currently got 17 Ultra1/170* hosts and had one develop a bad power
supply when it was about two months old. The symptoms were very much like
your description. Replacing the power supply would be the only fix; you
may still be under warranty.

The only other thing I can think of would be a bad fan over the CPU
module; we've had three of these die so far (Sun must gone to the lowest
bidder on these fans) and the system shuts itself down when it detects
over-temperature. This leaves clear log messages in /usr/adm/messages,
though, and so is probably not your problem.

- James Ashton <James.Ashton@keating.anu.edu.au>
Have you checked /var/adm/messages for messages. It sounds to me like
the fan mounted directly on the CPU heatsink has failed and is running
slowly or not at all. If so, the message will look like:

    Dec 30 13:07:16 myhost unix: WARNING: THERMAL WARNING DETECTED!!!
    Dec 30 13:07:44 myhost syslogd: going down on signal 15

We've had two fans fail in three months on the same machine and the
hardware guy claims it's a known problem and that the replacement fans
are supposedly more reliable. What's the good of reliable silicon with
no moving parts when the CPU depends on an unreliable fan! Anyway, if
you are seeing this problem, I'd suggest you leave the machine off
until the fan is replaced or you could damage it.

- Casper Dik <casper@holland.Sun.COM>
Check /etc/power.conf. Perhaps the system is configured for autoshutdown.

- sanjay@aur.alcatel.com (Sanjay), ellen@aur.alcatel.com (Ellen Spoonamore)
heck the /var/adm/messages files for any errors. most of the ultra
sparc 1 machines that have done this in my workplace are due to faulty
fans. the CPU overheats and automaticly shuts the machine off before it
causes major damage. crack open your unit and turn it on and make sure
that all the fans are working properly, if not call your vendor and ask
for a replacement fan.

-. Ross Stocks <ROSS.STOCKS.PSD36651@nt.com>Sounds like unreliable power.
Check your power source (consider UPS). If
no problem there, consider replacing the system's power supply.

- renan@cenpes.petrobras.gov.br (Renan Martins Baptista)
Just verify if rstatd in running. When Solaris configure its boot files,
there are no place when it starts that daemon. Its a failure. The new
ultra keybord boot controller device depends on that daemon in order to
perform the boot.

Read the ultra 1 hardware reference, in order to be familiar with that
new keybord boot control. To solve your problem:

just type the command:

/usr/lib/netsvc/rstat/rpc.rstatd

To avoid it to happen again, put that command in your preferred boot file.

I think that what is going on is as follows:

Every time the machine shuts down, it looses the pointer which links the
deamon rstatd to the keyboard controller. So, every time it shuts down,
since you don't have the deamon started au- tomatically, the problems will
return.

Try to do the following:

1. Syncronize your machine and halt it:

   sync <enter>
   sync <enter> (this second sync is oure supersticion)
   halt

2. Boot the machine in a remounting way:

   boot -r

3. Enter as root:

   edit the file /etc/rc2.d/S20sysetup

   at the end of the file, put the line:

   /usr/lib/netsvc/rstat/rpc.rstatd

4. Syncronize it again, and rebbot again, in a remounting way
   (boot -r), and keep it under observation, for a long period.
-----------------------------
-



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:11:43 CDT