SUMMARY: SS1000E controller crisis

From: Seth Rothenberg (SROTHENB@montefiore.org)
Date: Tue Mar 28 2000 - 14:31:40 CST


I want to thank everyone who was kind enough to answer
my recent questions "SS1000E controller crisis" and "SS1000E Reboot"

The first conclusion is, design a system that can run
with failures until your regular Field Engineer is on duty.
Whether the alternate needs to be more experienced
is a contract issue I can raise - but I cannot control.
Second, have a terminal available so you can diagnose problems.
Third, have e-mail/web access so you can e-mail sun-managers right away :-)

Summary....

Four Sun Managers came up with the tests that took Sun nearly 20 hours to do...
It turns out that the system was prompting on /dev/ttya for instructions.

It seems that factory-programmed Controller boards for SS1000(E)
have two settings that are very important:
a) Use on-board NVRAM for hostid (which of course is wrong for us)
b) Use /dev/fb0 if present for console (which of course is right for us)
The instructions our FE got from SunSolve caused both of these to
be inverted - i.e., correct hostid from Motherboard 0 (we assume),
but communicate with /dev/tty first.

Because we have a modem on /dev/ttya, and we also have
a monitor, the system came up only enough to ask "which one?"
but it only asks on /dev/ttya! The modem does not know how to answer.

For the transscript and the sunman index - This appears as though the system
cannot boot, and the orange light remains solid.

When the FE unplugged the modem, the system booted correctly.

Lesson 2 - Plug in a terminal in /dev/ttya.
In our next platform, /dev/ttya will actually be connected to
a terminal concentrator, and there will be a terminal on it also.

Our regular Sun FE later told me that the procedure is to boot with
only a keyboard and a monitor connected. Unplugging the modem,
which is part of the correct procedure, tells the system to use fb0.

Another conclusion is, always have a backup plan so that
you can call it quits for the day - and KICK THE FE OFF PREMISES.
After many hours, we took a break, and left a replacement FE working on it.
While we were gone, motherboard 0 was changed.
I was not there, I was not told about this until....
The following day, we had a watchdog reset running on the new motherboard.
Our regular FE came out, changed Motherboard 0 AND cpu 0,
and that seems to have fixed it...almost a week has gone by without problems.

Seth Rothenberg

Thanks to:
<John.Julian@galegroup.com>

Ted Meng <mengx@nielsenmedia.com> who wrote
>>>Booting the system one board at a time (pulled all boards
>>>out and add them back one by one) sorted out a bad board.
(This would have worked, because we only had one modem hooked up)

Dave Evans daevans@us.oracle.com who wrote
>>>If the screen doesn't come up try STOP-N. One of the boards may
>>>be set to /dev/sttya instead of the head.
(I will try the STOP-N sometime. This probably would have showed us something)

LEIF.H.ERICKSEN@msg.ameritech.com who wrote
>>>you need an ascii terminal or laptop with a free serial port to hook to the
>>>A port of the 1st sys bd. turn key from off to lightening bolt . if
>>>nothing comes out and you are sure the connection is good. slide out the
>>>first cpu and connect to the second bd. if you try booting cdrom use
>>>solaris 2.5.1 or better in order to see the ssa's. off the cd run
>>>format|read|analyse to check the boot disk.
(Again, this would have removed the board with the modem on it)

Val Popa <vpopa@tsg.eds.xerox.com> who identified
that the system is prompting for instructions....

>>>Your system probably has 2 system boards. If this is the case bare in
>>>mind the folowing facts:

>>>The system master (sys. brd. in slot 0) is a standard system board.
>>>However, it is NOT recommended to swap the s.b. master for the purpose of
>>>troubleshooting. The system master is configured to fulfill minimum
>>>requirements, and if swapped with another brd. of lesser configuratin,
>>>results may be misleading.

>>>A. If PROM rev = or < 2.11
>>>Uses master-nvram scheme to select the system master board. In multiple
>>>board configs., when the system is powerd on the first time OBP may
>>>prompt you to select a sys. brd. to become the master. This prompt will
>>>appearONLY IF:
>>> -The system does not recognize ANY brd. as the master
>>> -More than one brd. is recognized as beeing qualified to function
>>> as the master.
>>>
>>>B. If PROM = or > 2.13
>>>Uses auto-master scheme instead of master-nvram scheme to select the
>>>system master board and then nvram info. is automatically propagated to all brds.
>>>If there is no system hardware error, OBP will select the POST master
>>>board ( or the lowest board that has a functional CPU) as the system master brd.
>>>The auto-master scheme ignores the status of nvram master or slave.
>>>If any system hardware error occurs after power on, then OBP will user
>>>tge master-nvram scheme to select a system master as described in A.

"Buddy Lumpkin" <blumpkin@ijapan.com> wrote:
>>>The most common reason for watchdog resets is having too many sbus devices
>>>on a single board. Or just overloading any bus on the system. Make sure to
>>>spread them across as many of the boards as possible. If you look at the
>>>specs, the backplane can handle a lot, but it is EXTREMELY easy to overload
>>>these things ( we proved it in the Solaris Performance Management class, in
>>>fact an SS1000 was the case study!).
>>>
>>>So the goal is, spread the cards across several boards paying no attention
>>>to what is *pretty*.
(We might have too many SBUS cards, but that is not being addressed now....
it will be addressed in our next hardware platform.)

Original Question:

>Our primary server has been in [Sun]'s hands for 13 hours and no luck.
>I wonder if anyone has seen the follownig?
>
>We had disk controller errors, called Sun.
>Field Engineer brought a new controller. (1am)
>The first new controller did not boot. (3am)
>They got a new one delivered. (5am)
>It booted with the wrong hostid. Field Engineer called backline
>support, got a procedure. Now board 2 doesn't boot (7am).
>
>Now, an additional engineer is here with a 3rd replacement,
>and the system won't boot...the screen driver never comes up.
>Orange light on the front is solid. (2pm)



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:14:05 CDT