SUMMARY: ps does not complete on 2.6

From: Jonathan.Loh@BankAmerica.com
Date: Thu Feb 05 1998 - 19:14:51 CST


     The original question will follow as well as some of the suggestions.
What I did was just reboot the machine and that cleared it up. Because
none of the suggestions worked. Most people suggested I run truss on ps.

     On one of our Sparc20's, we are doing time machine testing. It has
2.6 loaded on it, with patches. The current date is set to Jan 4 2000. We
are starting to see a peculiarity, where 'ps -ef' never completes, it just
hangs. When we try to break out it will not let us. I've also noticed
that I can log on remotely but not from the console, again it just hangs.
And I can't kill it because I can't get the PID!!
That sounds reminiscent of a problem I saw on a SPARCcenter running
2.4 a couple of years ago....
Might see if "top" will run on the machine.
Might try running the "ps" via "truss" (to see what it's doing....
david

--
David Wolfskill                          david@xtend.net
Pager: (408) 989-8067                    david@pager.xtend.net

alf@alpha.orion.it suggested truss and ps too.

After I rebooted and everything was all right. I also got this response from staylor@cbis.com If you are running Oracle, the following may have some relevance...if not, I am still interested in finding out what other feedback you received. We are seeing a similar problem here on 2.5 and 2.5.1 with Oracle and Budtool running. It is an intermittent problem that occurs only when we we checkpoint the cache before starting a backup on the filesystem holding the Oracle binaries. We had an open call to Sun from a ~year ago and from crash dumps we sent to them, they were able to narrow it down to the Oracle listener process ..... (excerpt from the Sun engineers note to me follows) "I just spoke to engineering about the ps problem. What he said was that the third party drivers were not following the lock protocol. There was a call to cv_wait_sig() from dm_cond_wait(). You are not supposed to do this if you are holding a lock. What happens is that ps ends up just waiting for the lock. "

If we catch the problem before someone does a "ps", and kill the listener then all is well. Oracle SUpport has not been able to recreate the problem so we are on their back burner. Even internally, we are unable to recreate it on other database servers that are not backed up via Budtool. AT this point, we are still trying to find the exact conditions to recreate the problem so we can pin down the correct vendor.

Now we are not running Oracle but we are running Sybase. So the problems might be the same. but since this happened after



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:12:30 CDT