Partial-SUMMARY: Hung locked thread under Solaris 2.6

From: Davin Milun (milun@cse.Buffalo.EDU)
Date: Mon Mar 01 1999 - 15:39:57 CST


I do not have this problem resolved yet.

But with some gracious help from Kevin Sheehan (u-kevin@megami.veritas.com),
I managed to do a bit more poking with adb, and came up with a stack
backtrace of the hung thread:
   0xfc165a28: cachefs_dir_complete+0x1f8
   0xfc165ad0: cachefs_async_populate_dir+0x8c
   0xfc165bf0: cachefs_async_populate+0x384
   0xfc165c58: cachefs_do_req+0x94
   0xfc165d50: cachefs_async_start+0x260
   0xfc165db0: thread_start+4
   0xfc165e20: cachefs_async_start

So it clearly looks as if it's some cachefs bug.
And I'm going to have somewhere here at the University bug report it to Sun.

(Before you ask, we do have the current cachefs patch installed: 105693-05)

Kevin's other suggestion was to avoid cachefs:
>I guess the old joke about "doctor, it hurts ... don't do that" would be
>all I can suggest here. Dunno enough about cachefs internals to have
>a clue what it could be, so it's probably filing a bug with Sun.
>
>I don't work for Veritas, but Uniq license a product known as SRFS that
>does a superset of the job cachefs does. Our object is to have a complete
>up-to-date copy on the remote machine of the local data. I guess a useful
>side effect for your situation is that you would have all the data locally
>too. I suspect with a .EDU address, that this is not an option :-)

Thanks.
Davin.

Original question: (with some raw data removed)
>
>We're having a problem, with the system that is currently our webserver.
>It's a SS10 with two SM52 modules, so it's a quad processor system.
>It's running Solaris 2.6, pretty much patched up current (105181-12, and
>*lots* of others, which I'll list below somewhere).
>
>This system has been in service for a long time here, but only became our
>web server about 2.5 weeks ago. It's running Apache 1.3.4; and also the
>latest wuftp-2.4.1-beta18-vr13. And some of the web directories are cachefs
>mounted from remote NFS servers.
>
>Before I get into all the details, here's the overview and problem:
> Twice last week the system has gotten itself into the state where
> something is stuck, locked in kernel mode: load stays above 1.0; mpstat
> and xcpustate show one CPU in 100% system mode (but which CPU it is does
> change, every few minutes usually).
>
>I rebooted it Wednesday morning to fix it, but it did not go down cleanly at
>all. It killed most processes, but then panic'ed with a 'mutex adaptive
>exit' (or something close to that), and then could not sync disks
>(apparently infinite loop of "[3] [3] [3]...". To break that, I had to
>unplug and replug the keyboard. At the ok prompt I typed "sync" and it went
>into a fast loop outputting something about a processor (level 4?? level
>14??) interrupt not serviced. And I needed to power-cycle to get the system
>back.
>
>That was the first time it happenned. I'd hoped that it was a once-off fluke.
>But then late Friday afternoon I noticed that it had happenned again. :-(
>
>It doesn't seem to be causing any real problems for the system - but the fact
>that it doesn't reboot cleanly is a big concern, because we want the system
>to be able to come up without intervention, just in case.
>And if it's actually completely tieing up a CPU, that's wasteful anyway.
>
>
>More details:
>
>So, I've been poking at the running system using iscda and crash.
>
>By watching xcpustate or mpstat, I can tell which CPU is currently being
>tied up by it. So I then run iscda, and see what's running on that CPU.
>
>And currently it's always the same thread: fc165e80
>
>So I examine that thread with crash, and get the following:
>> thread -f fc165e80
...
>So it's a thread that belongs to sched!! :-(
>
>Any suggestions?
>Does this relate to any known bug?
>And advice on how to diagnose this any further?
>
>Now for some more system details:
>
>SunOS xxxxxx.cse.Buffalo.EDU 5.6 Generic_105181-12 sun4m sparc SUNW,SPARCstation-10
>
>OpenBoot 2.12
>
>128MB memory.
>About 800Meg swap space, spread over 3 disks.
>
>Patches installed:
> 105160-01 105181-12 105189-01 105210-18 105214-01 105216-03 105223-01
> 105284-23 105338-14 105356-07 105357-02 105375-09 105377-03 105379-05
> 105393-07 105397-02 105400-01 105401-20 105403-01 105405-01 105407-01
> 105416-01 105426-01 105464-01 105472-01 105486-01 105490-07 105492-02
> 105497-01 105516-01 105518-01 105528-01 105529-01 105552-02 105558-03
> 105562-03 105564-02 105566-06 105568-12 105572-02 105591-05 105600-07
> 105604-05 105615-04 105618-01 105621-09 105630-01 105633-16 105637-01
> 105651-02 105654-03 105665-03 105667-02 105669-04 105686-02 105693-05
> 105703-08 105705-01 105718-02 105720-06 105722-01 105724-01 105736-01
> 105742-01 105743-01 105746-01 105755-07 105757-01 105776-01 105778-01
> 105780-01 105786-07 105792-03 105795-05 105797-05 105798-02 105800-05
> 105802-07 105836-01 105837-02 105845-01 105847-01 105926-01 106040-10
> 106049-01 106112-03 106123-04 106125-06 106193-03 106222-01 106226-01
> 106235-02 106242-02 106257-04 106271-05 106301-01 106415-01 106439-02
> 106448-01 106522-01 106735-04 106828-01
>
>/etc/system additions:
> set priority_paging=1
> set ufs_ninode = 10000
> set ncsize = 30000
> set maxpgio=180
>
>If any more information would help, just ask.
>
>Thanks again.
>Davin.

-- 
Davin Milun    E-mail:  milun@cse.Buffalo.EDU     milun@acm.org
               Fax:     (716) 645-3464
               WWW:     http://www.cse.buffalo.edu/~milun/



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:13:16 CDT