I want to thank everyone who was kind enough to answer
my recent questions  "SS1000E controller crisis"  and   "SS1000E Reboot"
The first conclusion is, design a system that can run 
with failures until your regular Field Engineer is on duty.
Whether the alternate needs to be more experienced
is a contract issue I can raise - but I cannot control.
Second, have a terminal available so you can diagnose problems.
Third, have e-mail/web access so you can e-mail sun-managers right away :-)
Summary....
Four Sun Managers came up with the tests that took Sun nearly 20 hours to do...
It turns out that the system was prompting on /dev/ttya for instructions.
It seems that factory-programmed Controller boards for SS1000(E)
have two settings that are very important:
a) Use on-board NVRAM for hostid (which of course is wrong for us)
b) Use /dev/fb0 if present for console (which of course is right for us)
The instructions our FE got from SunSolve caused both of these to 
be inverted - i.e., correct hostid from Motherboard 0 (we assume),
but communicate with /dev/tty first.
Because we have a modem on /dev/ttya, and we also have
a monitor, the system came up only enough to ask "which one?"
but it only asks on /dev/ttya!  The modem does not know how to answer.
For the transscript and the sunman index - This appears as though the system
cannot boot, and the orange light remains solid.
When the FE unplugged the modem, the system booted correctly.
Lesson 2 - Plug in a terminal in /dev/ttya.
In our next platform, /dev/ttya will actually be connected to 
a terminal concentrator, and there will be a terminal on it also.
Our regular Sun FE later told me that the procedure is to boot with
only a keyboard and a monitor connected.  Unplugging the modem,
which is part of the correct procedure, tells the system to use fb0.
Another conclusion is, always have a backup plan so that 
you can call it quits for the day - and KICK THE FE OFF PREMISES.
After many hours, we took a break, and left a replacement FE working on it.
While we were gone, motherboard 0 was changed.  
I was not there, I was not told about this until....
The following day, we had a watchdog reset running on the new motherboard.
Our regular FE came out, changed Motherboard 0 AND cpu 0, 
and that seems to have fixed it...almost a week has gone by without problems.
Seth Rothenberg
Thanks to:
<John.Julian@galegroup.com>
Ted Meng <mengx@nielsenmedia.com> who wrote
>>>Booting the system one board at a time (pulled all boards
>>>out and add them back one by one) sorted out a bad board.
(This would have worked, because we only had one modem hooked up)
Dave Evans  daevans@us.oracle.com   who wrote
>>>If the screen doesn't come up try STOP-N. One of the boards may
>>>be set to /dev/sttya instead of the head.
(I will try the STOP-N sometime.  This probably would have showed us something)
LEIF.H.ERICKSEN@msg.ameritech.com  who wrote 
>>>you need an ascii terminal or laptop with a free serial port to hook to the
>>>A port of the 1st sys bd.  turn key from off to lightening bolt .   if
>>>nothing comes out and you are  sure the connection is good.  slide out the
>>>first cpu and connect to the second bd.  if you try booting cdrom  use
>>>solaris 2.5.1 or better in order to see the  ssa's.   off the cd run
>>>format|read|analyse to check the boot disk.
(Again, this would have removed the board with the modem on it)
Val Popa <vpopa@tsg.eds.xerox.com>  who identified
that the system is prompting for instructions....
>>>Your system probably has 2 system boards. If this is the case bare in 
>>>mind the folowing facts:
>>>The system master (sys. brd. in slot 0) is a standard system board. 
>>>However, it is NOT recommended to swap the s.b. master for the purpose of 
>>>troubleshooting. The system master is configured to fulfill minimum 
>>>requirements, and if swapped with another brd. of lesser configuratin, 
>>>results may be misleading.
>>>A. If PROM rev = or < 2.11
>>>Uses master-nvram scheme to select the system master board. In multiple 
>>>board configs., when the system is powerd on the first time OBP may 
>>>prompt you to select a sys. brd. to become the master. This prompt will 
>>>appearONLY IF:
>>>	-The system does not recognize ANY brd. as the master
>>>	-More than one brd. is recognized as beeing qualified to function
>>>	 as the master.
>>>
>>>B. If PROM = or > 2.13
>>>Uses auto-master scheme instead of master-nvram scheme to select the 
>>>system master board and then nvram info. is automatically propagated to  all brds.
>>>If there is no system hardware error, OBP will select the POST master 
>>>board ( or the lowest board that has a functional CPU) as the system master brd.
>>>The auto-master scheme ignores the status of nvram master or slave.
>>>If any system hardware error occurs after power on, then OBP will user 
>>>tge master-nvram scheme to select a system master as described in A.
"Buddy Lumpkin" <blumpkin@ijapan.com> wrote:
>>>The most common reason for watchdog resets is having too many sbus devices
>>>on a single board. Or just overloading any bus on the system. Make sure to
>>>spread them across as many of the boards as possible. If you look at the
>>>specs, the backplane can handle a lot, but it is EXTREMELY easy to overload
>>>these things ( we proved it in the Solaris Performance Management class, in
>>>fact an SS1000 was the case study!).
>>>
>>>So the goal is, spread the cards across several boards paying no attention
>>>to what is *pretty*.
(We might have too many SBUS cards, but that is not being addressed now....
it will be addressed in our next hardware platform.)
Original Question:
>Our primary server has been in [Sun]'s hands for 13 hours and no luck.
>I wonder if anyone has seen the follownig?
>
>We had disk controller errors, called Sun.
>Field Engineer brought a new controller. (1am)
>The first new controller did not boot. (3am)
>They got a new one delivered. (5am)
>It booted with the wrong hostid.  Field Engineer called backline
>support, got a procedure.  Now board 2 doesn't boot (7am).
>
>Now, an additional engineer is here with a 3rd replacement,
>and the system won't boot...the screen driver never comes up.
>Orange light on the front is solid.  (2pm)
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:14:05 CDT