SUMMARY: Watchdog reset

From: Scott Gargash (scott@plab.dmll.cornell.edu)
Date: Wed Mar 03 1993 - 14:40:36 CST


  I sent mail to the list Monday with the following question:

> Last monday we experienced a power dropout in our lab. Of course, it
>caused everything to crash. Everything seemed to come back up (well, the
>superblock was corrupted on one drive but fsck fixed it). In the last week,
>twice I have come in in the morning to find our Sun3 at the PROM prompt
>with the message watchdog reset. One time I typed B and it booted fine, but
>the other time it failed its mwemory check. I had to cycle power to bring it
>back to life.
> Questions:
>1) What exactly is the watchdog?
>2) Is this behaviour related to the power dropout?
>3) Can this be fixed?
> Thanks.

I received 10 replies. It seems that the watchdog is a timer that checks
if instructions are executed within a certain amount of time. If it times
out, it assumes there is a problem and quits. People had a variety of
different things that would cause this behavior. Thanks to all who replied,
and especially to Koper Zangocyan <cc_koper@rcvie.co.at>, who sent a copy
of a Sun Tech Bulletin. I'm not sure what I'm going to do about the problem
except hope it doesn't reappear (it's not worth throwing money at a Sun3/160).
Thanks.

-------------------------------------------------------------------------
>From polk@ece.nps.navy.mil

>It's funny that you wrote about a "watchdog
>reset." I just got one and I am formatting my
>disk. My boss said it was a hardware problem
>that has something to do with the bus.
>
>I don't know the answer to your problem but, if
>this formatting doesn't work it probably means
>that we have a serious problem.

--------------------------------------------------------------------------
>From shandelm@jpmorgan.com
>
>I knew of a problem with 3/260 and 3/280's whereby if the ethernet cable was
>reattached while the machine was up, it would cause a watchdog reset. I forget
>the circumstances which cause this. My memory is a bit foggy now since the
>3/2xx days. I hope this helps somewhat.

--------------------------------------------------------------------------
>From glk@pls.amdahl.com

>I have an old Sun 3 that displays similar behavior. I don't know exactly whatthe watchdog is. The way it was explained to me, the machine gets itself intoa confused state, at which point it decides it wants to start over with a clean
>slate. It seems reasonable to assume the power problem probably had somethingto do with it. My guess is memory problems, judging from your message and also
>my experience with my Sun 3.

--------------------------------------------------------------------------
>From cc_koper@rcvie.co.at

>Regarding your question about watchdog resets, I have found the following.
>It may not be the 100% suitable information for you but I am sure it will help.
>We also have had Sun-3's and they were unstable, especially wrt power.
>
>
>-----------------------------------------------------------------------------
>
>Collection: Software Technical Bulletins
>Document: 1056
>
>Year: 1989
>Month: February
>Title: Watchdog Resets for Kernel Debuggers
>-------------------------------------------------------------------------------
>
>Watchdog Resets for Kernel Debuggers
>-------- ------ --- ------ ---------
>
>This article is a discussion of watchdog resets for kernel debuggers. To
>understand these procedures you must be familiar with kadb, 680x0 assembler,
>and have an understanding of 680x0 stack frames. Sun-4 systems are also
>discussed.
>
>What is a Watchdog Reset?
>---- -- - -------- -----
>
>The Sun-2, Sun-3, and Sun-4 system CPUs are capable of halting under certain
>conditions. This can occur with the 680X0-based Sun-2 and Sun-3 machines when
>executing a halt (stop) instruction in either of the following circumstances:
> A bus error is received at times when the CPU is unable to handle it, due to
> stack problems When the stack is rendered unusable as a result of other
> circumstances, the 680X0 cannot issue a bus error
>
>he Sun-4 SPARC chip only halts when it receives a synchronous trap while
>already servicing a trap, a situation which ely occurs. Note that
>asynchronous traps (such as interrupts) will not cause a halt.
>
>The watchdog reset message is produced by the ROM monitor. The detection of
>watchdog resets is less accurate on some machines than others; sometimes the
>ROM decides they are power-on resets.
>
>By default, a system stops when a watchdog reset occurs. An EEPROM option can
>cause an automatic reboot. The most likely kernel bugs which cause watchdog
>resets either overflow or trash the interrupt stack; some hardware problems can
>also cause a watchdog reset.
>
>Watchdog resets occur when processor hardware stops, as follows:
> Sun-2 and Sun-3 software can stop with a stop #2?00 instruction. Software
> can cause a 68020 to stop by causing a bus error during exception processing
> of a bus error, address error, reset, or certain portions of an RTE
> instruction. Sun-4 machines will generate a watchdog reset if a synchronous
> trap occurs while already servicing a trap. The most likely kernel bugs
> which cause watchdog resets either overflow or trash the interrupt stack.
> Some hardware problems can also cause a watchdog reset.
>
>In order to recover from these conditions, Sun has built what is called a
>`watchdog timer' into its systems. If no instructions are executed within a
>certain amount of time (for whatever reason) a timer expires, and we reset the
>CPU so the system can take steps to get running again.
>
>What is the System State After a Watchdog Reset?
>---- -- --- ------ ----- ----- - -------- -----
>
>The ROM monitor will have an accurate picture of most of the processor state at
>the time of the crash. The details are written up in the PROM monitor's trap.s
>module. The ROM monitor attempts to preserve the processor registers and so
>forth, but the following information will be lost:
> The Interrupt Stack Pointer (ISP). The PC. The Status Register (SR),
> including the supervisor/user flag, the `use master stack pointer' flag, and
> the interrupt level. Segment map entry 0 (0th pmeg). Page map entry for
> g resetaddr (g resetmap).
> - -
>
>Gathering Information for Analysis
>--------- ----------- --- --------
>
>If a watchdog reset occurs at some random time, perform the following:
>
> Use g4 to get a kernel stack trace.
>
> Use g0 to get a dump (this occasionally fails after a watchdog reset).
>
>Using kadb to Debug a Watchdog Reset
>----- ----
>
>kadb is useful when debugging a reproducible watchdog reset. When using kadb to
>debug a watchdog reset, the following occurs:
> The kadb registers will be wrong. Note the boot PROMs, as it is possible the
> boot PROM's registers will be invalid. Symbolic addresses will be constant,
> with or without kadb. Dynamically allocated kernel storage will move.
>
>To get a stack trace from kadb, perform the following.
>
> a6 is the C frame pointer; it will usually be located somewhere near the
> stack. If you check around where a6 points, you can usually find a
> frame-link address.
>
> If you can find a frame-link address, addr$c will produce a useful stack
> trace.
>
>Identifying the Causes of Watchdog Resets on Sun-2s and Sun-3s
>----------- --- ------ -- -------- ------ -- --- -- --- --- --
>Most watchdog resets involve interrupt stack problems of some sort, such as
>overflow, trashing, or unmapping. Here are some hints for identifying these on
>Sun-2 and Sun-3 machines.
>
>To identify the cause of a watchdog reset, one generally needs a reproducible
>case. kadb can be used to obtain such a case. Therefore, load kadb and cause
>the crash. Then, use the PROM monitor to list all the registers, and copy the
>listed registers down. Finally, start kadb with g fd00000.
>
>The overall strategy of this procedure is to determine the location of the last
>stack frame. Once that is available, you use addr$c to get a stack trace,
>which will tell you what is active at the time.
>
>Check a6 against the range eintstack-2k <= a6 < eintstack. If it is some value
>that is wildly out of range, the stack was probably trashed. In this case,
>refer to `Finding Your Place in a Trashed Stack', below. If the value is
>reasonable, but near the low end of the range, refer to `Checking for Stack
>Overflow', below.
>
>If trying to read the stack gives you an error message, the stack was probably
>unmapped.
>Finding Your Place in a Trashed Stack
>------- ---- ----- -- - ------- -----
>
>If a6 is zero or some small value, try working from the highest-level routine
>downward.
>
>If a6 is some unusually large value, try searching the stack for that value
>using the following commands:
> eintstack-800,800/L unusually-large-value
> eintstack-7fe,7fe/L unusually-large-value
>
>If these commands find some matches, try the following:
> found-addr+4/p
>
>For the matches which show valid routine names, look on the stack for other
>pointers of the form intstack+something. Use these as arguments to $c; this
>may bring you to a valid stack frame.
>
>If the above commands fail to find an appropriate match, the problem requires
>further, independent investigation outside the scope of this article.
>
>Checking for Stack Overflow
>-------- --- ----- --------
>
>Take the value of a6 obtained from the PROM monitor and enter the following
>command: a6-from-prom-monitor$c
>
>This will usually produce a valid stack trace. Look at the prefix code of the
>last routine named, and find the size of the routine's stack frame (on a 68020,
>this will be an argument to the linkw instruction). Then enter the following:
>eintstack-addr of last stack frame+size of last stack frame = x
>
>If this number is more than 0x800, you have a stack overflow.
>
>Interrupt Stack Sizes
>--------- ----- -----
>
>On Sun-2 and Sun-3 machines, the interrupt stack varies in size from 2k to 10k;
>you are guaranteed 2k. On Sun-4 machines, the interrupt stack varies in size
>from 4k to 12k; you are guaranteed 4k. The interrupt stack begins at the first
>page boundary following intstack.
>
>The stack-defining code in locore.s is deceptive, as it appears that the stack
>size is 2k plus the page size. Actually, the first part of the stack is
>write-protected, since it follows the kernel in memory. For further details,
>refer to the locore.s manual page.
>
>*******************************************************************************
>
>ONLINE SUPPORT SYSTEM (OSS),
>
> Software Technical Bulletin (STB),
>
> Produced by: Technical Information Services (TIS)
>
>Copyright (c) 1989, Sun Microsystems, Inc. All Rights Reserved. No part of
>this work covered by copyright hereon may be reproduced or used in any form or
>by any means -- graphic, electronic, or mechanical, including photocopying,
>recording, taping, or information storage and retrieval systems -- without
>permission of the copyright owner.
>
>RESTRICTED RIGHTS LEGEND: Use, duplication, or disclosure by the Government is
>subject to restrictions as set forth in subparagraph (c)(1)(ii) of the Rights
>in Technical Data and Computer Software clause at DFARS 52.227-7013 and in
>similar clauses in the FAR and NASA FAR Supplement.

--------------------------------------------------------------------------------
>From @a.gec-epl.co.uk:dunstan@gec-epl.co.uk
>
>As far as I know, it is a software watchdog. The message "Watchdog
>Reset" usually means that the kernel has found some inconsistency in
>the hardware it is addressing. This would stack up with your saying
>that it failed memory tests.
>
>It is possible that some small corruption still exists in your root
>filesystem - and it might be worth your while to reinstall the OS, but
>it sounds like hadware to me.

--------------------------------------------------------------------------------
>From hendefd@tech.duc.auburn.edu
>
>We experienced multiple watchdog resets with our 4/280. It also had
>occasional problems with memory check and it turns out that we had a
>bad memory board. Since it's real difficult and expensive to get a
>new one or have one repaired, we moved the faulty board to the top of
>the memory stack and ran fine for about 3 months. Now, of course it
>died completely, but didn't cause a reset.

-------------------------------------------------------------------------------
>From ups!upstage!glenn@fourx.Aus.Sun.COM

>A watchdog reset occurs when the system panics for some reason, and
>then while it is handling the panic it panics again. Since it hasn't
>finished responding to the first one it cannot continue and gives a
>watchdog reset. Usually these are caused by a hardware failure. It
>sounds like it's time to call in the repair man, as you may need a new
>cpu or memory I think.
>
>Sometimes you can see the panic message by using dmesg after the system
>has booted (this won't work if you had to power cycle because dmesg
>prints out the kernels message buffer from memory).

--------------------------------------------------------------------------------
>From ups!kalli!kevin@fourx.Aus.Sun.COM
>
>A watchdog reset is causes when the CPU halts. A timer goes off to
>make sure the system doesn't just hang, hence the name watchdog. This
>is generally caused by double bus errors on the 68020 machines. That
>in turn is usually because somebody hosed the stack (as in a badly
>written driver or system code), or the machine is a sick puppy and the
>memory isn't giving the right answers all the time.
>
>> 2) Is this behaviour related to the power dropout?
>
>If it wasn't happening before, probably.
>
>> 3) Can this be fixed?
>
>Have you run the diagnostics on the board yet? They are not exhaustive,
>but they sometimes catch problems...



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:07:32 CDT