I want to thank everyone who was kind enough to answer
my recent questions "SS1000E controller crisis" and "SS1000E Reboot"
The first conclusion is, design a system that can run
with failures until your regular Field Engineer is on duty.
Whether the alternate needs to be more experienced
is a contract issue I can raise - but I cannot control.
Second, have a terminal available so you can diagnose problems.
Third, have e-mail/web access so you can e-mail sun-managers right away :-)
Four Sun Managers came up with the tests that took Sun nearly 20 hours to do...
It turns out that the system was prompting on /dev/ttya for instructions.
It seems that factory-programmed Controller boards for SS1000(E)
have two settings that are very important:
a) Use on-board NVRAM for hostid (which of course is wrong for us)
b) Use /dev/fb0 if present for console (which of course is right for us)
The instructions our FE got from SunSolve caused both of these to
be inverted - i.e., correct hostid from Motherboard 0 (we assume),
but communicate with /dev/tty first.
Because we have a modem on /dev/ttya, and we also have
a monitor, the system came up only enough to ask "which one?"
but it only asks on /dev/ttya! The modem does not know how to answer.
For the transscript and the sunman index - This appears as though the system
cannot boot, and the orange light remains solid.
When the FE unplugged the modem, the system booted correctly.
Lesson 2 - Plug in a terminal in /dev/ttya.
In our next platform, /dev/ttya will actually be connected to
a terminal concentrator, and there will be a terminal on it also.
Our regular Sun FE later told me that the procedure is to boot with
only a keyboard and a monitor connected. Unplugging the modem,
which is part of the correct procedure, tells the system to use fb0.
Another conclusion is, always have a backup plan so that
you can call it quits for the day - and KICK THE FE OFF PREMISES.
After many hours, we took a break, and left a replacement FE working on it.
While we were gone, motherboard 0 was changed.
I was not there, I was not told about this until....
The following day, we had a watchdog reset running on the new motherboard.
Our regular FE came out, changed Motherboard 0 AND cpu 0,
and that seems to have fixed it...almost a week has gone by without problems.
Ted Meng <email@example.com> who wrote
>>>Booting the system one board at a time (pulled all boards
>>>out and add them back one by one) sorted out a bad board.
(This would have worked, because we only had one modem hooked up)
Dave Evans firstname.lastname@example.org who wrote
>>>If the screen doesn't come up try STOP-N. One of the boards may
>>>be set to /dev/sttya instead of the head.
(I will try the STOP-N sometime. This probably would have showed us something)
LEIF.H.ERICKSEN@msg.ameritech.com who wrote
>>>you need an ascii terminal or laptop with a free serial port to hook to the
>>>A port of the 1st sys bd. turn key from off to lightening bolt . if
>>>nothing comes out and you are sure the connection is good. slide out the
>>>first cpu and connect to the second bd. if you try booting cdrom use
>>>solaris 2.5.1 or better in order to see the ssa's. off the cd run
>>>format|read|analyse to check the boot disk.
(Again, this would have removed the board with the modem on it)
Val Popa <email@example.com> who identified
that the system is prompting for instructions....
>>>Your system probably has 2 system boards. If this is the case bare in
>>>mind the folowing facts:
>>>The system master (sys. brd. in slot 0) is a standard system board.
>>>However, it is NOT recommended to swap the s.b. master for the purpose of
>>>troubleshooting. The system master is configured to fulfill minimum
>>>requirements, and if swapped with another brd. of lesser configuratin,
>>>results may be misleading.
>>>A. If PROM rev = or < 2.11
>>>Uses master-nvram scheme to select the system master board. In multiple
>>>board configs., when the system is powerd on the first time OBP may
>>>prompt you to select a sys. brd. to become the master. This prompt will
>>> -The system does not recognize ANY brd. as the master
>>> -More than one brd. is recognized as beeing qualified to function
>>> as the master.
>>>B. If PROM = or > 2.13
>>>Uses auto-master scheme instead of master-nvram scheme to select the
>>>system master board and then nvram info. is automatically propagated to all brds.
>>>If there is no system hardware error, OBP will select the POST master
>>>board ( or the lowest board that has a functional CPU) as the system master brd.
>>>The auto-master scheme ignores the status of nvram master or slave.
>>>If any system hardware error occurs after power on, then OBP will user
>>>tge master-nvram scheme to select a system master as described in A.
"Buddy Lumpkin" <firstname.lastname@example.org> wrote:
>>>The most common reason for watchdog resets is having too many sbus devices
>>>on a single board. Or just overloading any bus on the system. Make sure to
>>>spread them across as many of the boards as possible. If you look at the
>>>specs, the backplane can handle a lot, but it is EXTREMELY easy to overload
>>>these things ( we proved it in the Solaris Performance Management class, in
>>>fact an SS1000 was the case study!).
>>>So the goal is, spread the cards across several boards paying no attention
>>>to what is *pretty*.
(We might have too many SBUS cards, but that is not being addressed now....
it will be addressed in our next hardware platform.)
>Our primary server has been in [Sun]'s hands for 13 hours and no luck.
>I wonder if anyone has seen the follownig?
>We had disk controller errors, called Sun.
>Field Engineer brought a new controller. (1am)
>The first new controller did not boot. (3am)
>They got a new one delivered. (5am)
>It booted with the wrong hostid. Field Engineer called backline
>support, got a procedure. Now board 2 doesn't boot (7am).
>Now, an additional engineer is here with a 3rd replacement,
>and the system won't boot...the screen driver never comes up.
>Orange light on the front is solid. (2pm)
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:14:05 CDT