SUMMARY: Help! Sparc 2 - SunOS 4.1.4 periodically hangs

From: Don Lewis (gdonl@gv.ssi1.com)
Date: Sat Sep 16 1995 - 05:57:00 CDT


It appears this problem was caused by building a new libc.so file (to
include the BIND resolver) with a new minor version number but not
creating the matching libc.sa file (by copying the old one). When new
application binaries were linked, they did not get statically linked
with the exported initialized data that resides in the .sa file, but
instead referenced the dynamically linked copy in the .so file. The
Sun Answerbook says that this can result in performance degradation, but
it looks like it may also tickle a subtle kernel bug when the pages
containing this data is modified at run time.

This system seemed to be more susceptible to these hangs when ncsize and
ninode were set even higher (hanging about once a week). I've currently
got these set to ncsize=16000 and ninode=18000 and the current uptime
is four weeks. The system was last rebooted to muck with the hardware
after a two or three week uptime. It's looking cured to me :-)

On Jun 13, 10:42pm, Don Lewis wrote:
} Subject: Help! Sparc 2 - SunOS 4.1.4 periodically hangs
}
} I'm having problems with periodic system hangs with one of our SPARC 2
} machines. In the latest episode, I tried to su, and the su command hung.
} In another window, I tried to strace the su command, and the strace command
} didn't report what system call the su command was executing. I attempted
} to ^C and ^Z the su command and it did not respond. It also did not respond
} to kill -9. When I attempted to trace the su command again, the console
} froze. From another workstation, the system responded to ping, but not to
} telnet, rup, etc. It responded to L1-A, so I tried the continue command
} which didn't bring it back to life, so I gave up did an L1-A followed by the
} sync command to force a panic. The machine had been up about four weeks.
} I've searched SunSolve and turned up nothing.
}
} In the hardware department, it has a Weitek CPU upgrade, 48 MB of RAM, a
} fully loaded SCSI bus with 4 disks, two tapes, and a fairly lightly used
} Pioneer six disk CDROM changer (using the stock sr driver). It has a
} Telebit WorldbBlazer on one of the zs ports.
}
} It's running SunOS 4.1.4 with the following patches:
} 102264-02
} 102373-01
} 102394-01
} 102422-01
} 102433-01
} These system hangs also occurred with a heavily patched 4.1.3U1 kernel.
}
} It's been tuned according to the Sun Performance Tuning Overview
} white paper:
} maxusers 32
} ninode 12000
} ncsize 12000
} lotsfree 0x100
} fastscan 0xc00
} slowscan 0x64
} maxslp 0x80
} maxpgio 0x32
} nbuf 0x70
} The ninode and ncsize parameters are set to these values in an attempt
} to increase the dnlc hit rate. When this machine was running 4.1.3U1
} and ninode and ncsize were set to larger values, the system hangs
} occurred approximately weekly.
}
} It's running OW3_414 + various patches.
}
} This machine is our NIS master and is an NFS client and server.
}
} Other software of note:
} INN
} CAP using NIT to speak ethertalk
} popper
} sendmail 8.6.11
} AMD automounter
} xntpd
} mushtool (using Sunview over Openwin)
} xautolock (using the Sunview lockscreen over Openwin)
}
} Another SPARC 2 with the Weitek CPU running the same kernel, with a
} regular CDROM instead of the changer, lacking the DAT, not running
} INN or popper, and generally less heavily loaded is completely stable.
}
} The only thing suspicious that I see is that the system hangs seem to
} occur when the "locked" memory reported by top approaches 8000K. Right
} after reboot, the locked memory is about 5800K with this kernel. I
} seems to increase fairly rapidly at first to 6300K then drift more slowly
} upwards. I've cannibalized the innards of top in order to track this
} over time and try to correlate it with anything. I'm somewhat suspicious
} of a kernel memory leak, since one of the 4.1.3U1 crashes revealed a bunch
} of processes hung in kmem_alloc().
}
} This crash dump hasn't been too informative (various sleeping processes
} deleted):
}
} F UID PID PPID CP PRI NI SZ RSS WCHAN STAT TT TIME COMMAND
} 80003 0 0 0 0 -25 0 0 0 R ? 0:00 swapper
} 20088001 0 1 0 0 5 0 52 16 child I ? 1:33 /sbin/init -
} 80003 0 2 0 0 -24 0 0 0 R ? 2:58 pagedaemon
} 88001 0 58 1 0 1 0 68 136 R ? 0:35 portmap
} 88001 0 63 1 0 1 0 144 180 R ? 1:06 ypserv
} 88001 0 65 1 0 1 0 64 68 select I ? 0:02 ypxfrd
} 88001 3 67 1 0 1 0 36 68 select I ? 0:01 ypbind
} 88001 0 69 1 0 1 0 40 28 select I ? 0:00 rpc.ypupdate
} 88001 0 77 1 0 1 0 36 108 R ? 5:12 in.routed
} 88001 0 95 1 0 -5 0 64 160 R ? 9:50 syslogd
} 88001 0 153 1 0 1 -4 108 332 R < ? 17:31 /usr/etc/amd
} 88001 0 183 1 0 1 0 68 156 select I ? 0:07 rpc.mountd
} 488001 0 186 1 0 1 0 28 0 R ? 13:11 (nfsd)
} 88001 0 187 186 0 1 0 28 0 R ? 13:19 (nfsd)
} 88001 0 188 186 0 1 0 28 0 R ? 13:25 (nfsd)
} 88001 0 189 186 0 1 0 28 0 R ? 13:49 (nfsd)
} 88001 0 190 186 0 1 0 28 0 R ? 13:35 (nfsd)
} 88001 0 191 186 0 1 0 28 0 R ? 13:28 (nfsd)
} 88001 0 192 186 0 1 0 28 0 R ? 13:42 (nfsd)
} 88001 0 193 186 0 1 0 28 0 R ? 13:27 (nfsd)
} 88001 0 197 1 0 1 0 16 0 nfs_dnlc I ? 3:19 (biod)
} 88001 0 198 1 0 1 0 16 0 nfs_dnlc I ? 3:19 (biod)
} 88001 0 199 1 0 1 0 104 76 select I ? 0:00 rpc.lockd
} 88001 0 200 1 0 1 0 52 44 select I ? 0:00 rpc.statd
} 88001 0 201 1 0 1 0 16 0 nfs_dnlc I ? 3:18 (biod)
} 88001 0 202 1 0 1 0 16 0 nfs_dnlc I ? 3:18 (biod)
} 80488001 0 216 1 0 1 0 156 84 R ? 1:05 sendmail: ac
} 80001 0 225 1 0 1 0 16 24 select I ? 6:29 screenblank
} 20088001 0 226 1 0 1 0 504 100 R ? 2:08 /usr/etc/snm
} 20088001 0 227 1 0 1 0 48 84 R ? 0:10 /usr/etc/syc
} 80488001 6 241 1 0 -5 02712 960 R ? 966:30 /usr/lib/new
} 88001 0 327 1 0 1 0 120 208 R ? 89:35 /usr/etc/aar
} 88001 0 329 1 0 1 0 88 168 R ? 53:34 /usr/etc/ati
} 88001 0 332 1 0 1 0 80 108 R ? 0:06 /usr/etc/sni
} 88001 0 334 1 0 1 0 160 112 R ? 0:40 /usr/etc/auf
} 88001 0 336 1 0 1 0 116 144 R ? 1:03 /usr/etc/lws
} 88001 0 338 1 0 1 0 116 144 R ? 1:01 /usr/etc/lws
} 88001 0 340 1 0 1 0 116 144 R ? 1:04 /usr/etc/lws
} 88001 0 342 1 0 1 0 160 112 R ? 0:48 /usr/etc/lws
} 80001 0 347 1 0 -5 0 12 8 R ? 677:09 update
} 488001 0 350 1 0 1 0 52 184 R ? 8:21 cron
} 488001 0 357 1 0 1 0 48 108 R ? 4:28 in.rwhod
} 88001 0 359 1 0 1 0 52 100 R ? 4:33 inetd
} 20488001 0 16387 359 0 25 0 56 356 R ? 0:00 popper
} 20088001 0 16388 359 0 25 0 28 180 R ? 0:00 popper
} 20088001 6 16389 241 0 25 0 56 164 R ? 0:00 /usr/etc/in.
} 88001 0 17075 1 0 1 01592 932 R ? 15:03 in.named
} 20088001 0 19986 359 0 1 0 48 120 R ? 5:40 rpc.rstatd
} 2000804124603 21800 21799 0 25 06484 3212 R co202:58 /usr/openwin
} 2040820124603 21825 1 0 15 0 272 200 R co 7:56 xautolock -l
} 2000800124603 21835 1 0 1 0 152 472 R co 21:19 perfmeter -W
} 2000800124603 21838 1 0 1 0 152 444 R co 14:56 perfmeter -W
} 2000800124603 21842 1 0 1 0 132 404 R co 0:30 clock -Wp 44
} 2000800124603 21850 1 0 1 0 280 644 R co 3:19 xcalentool -
} 20108001 0 16386 16382 3 5 0 24 224 child I p6 0:00 trace -p 163
} 2008800124603 5974 1 0 1 01248 1460 R p8 3:53 mushtool
} 88201 0 16788 1 0 15-12 120 224 R < p8 0:31 /usr/etc/xnt
} 2000800124603 9697 345 0 1 0 644 832 R pf135:57 /usr/local/b
} 20408011 0 16352 10789255 1 0 36 316 R q2 2:57 su
}
}
} sunrise gdonl 163> adb -k vmunix.0 vmcore.0
} physmem 2fc6
} 0t16352$<setproc
} $c
} pid 16352
} _panic(0xf8142e57,0xf8142e57,0x0,0x0,0x0,0x0) + 6c
} 0t359$<setproc
} $c pid 359
}
} sw_bad(?)
} _swtch(0x90800aa2,0xf8299460,0x904000a2,0x0,0x3,0x908000a1) + 80
} _sleep(0xf8152d68,0x1a,0x400000,0xa00,0x1a,0xf829bcc4) + 1a0
} _select(0xffbfffff,0x88001,0xf84f3fe0,0x0,0xf84f3fe0,0xf84f4000) + 4cc
} _syscall(0xf84f4000) + 3b4
} 0t347$<setproc
} $c pid 347
}
} sw_bad(?)
} _swtch(0x90900ae4,0xf8152db0,0x904000e4,0x0,0x3,0x1) + 80
} _sleep(0xf8152db0,0x14,0xf8121c00,0xa00,0x14,0xf829ac9c) + 1a0
} _biowait(0xff070474,0xbc5f0,0xfce024c0,0xfce025a0,0x0,0x904000e5) + 30
} _bwrite(0xff070474,0xfdef3000,0x0,0x38,0x8,0x0) + c4
} _ufs_sbwrite(0xfde4d538,0x1,0xff070474,0xf81476e8,0xf81476e8,0xfdef3000) + e0
} _ufs_notclean(0xfce62718,0xfde4d000,0x0,0x0,0x0,0xff070474) + f8
} _iupdat(0xff15b778,0x1,0x1101,0xfde4d000,0x0,0x0) + 40
} _syncip(0xff15b778,0x100,0x0,0x4,0x1101,0x0) + 14c
} _update(0xf8158b00,0x104e,0xff15b778,0xff256250,0x0,0xff15b778) + 2e0
} _ufs_sync(0x0,0xf8132408,0xf8136400,0xf81570e0,0xf8132408,0x2) + 4
} _sync(0xf8462fe0,0x120,0xf812fdd8,0xf812fef8,0xf8463000,0xf81323c8) + 3c
} _syscall(0xf8463000) + 3b4
}
} This doesn't seem to be an mbuf leak:
}
} sunrise gdonl 164> netstat -m vmunix.0 vmcore.0
} 927/1088 mbufs in use:
} 147 mbufs allocated to data
} 17 mbufs allocated to packet headers
} 289 mbufs allocated to socket structures
} 331 mbufs allocated to protocol control blocks
} 8 mbufs allocated to routing table entries
} 133 mbufs allocated to socket names and addresses
} 2 mbufs allocated to interface addresses
} 4/28 cluster buffers in use
} 164 Kbytes allocated to network (73% in use)
} 0 requests for memory denied
} 0 requests for memory delayed
} 0 calls to protocol drain routines
}
} streams allocation:
} cumulative allocation
} current maximum total failures
} streams 47 55 3307 0
} queues 153 185 12983 0
} mblks 1026 1028 1381024762 0
} dblks 1026 1027 82518516 0
} streams buffers:
} external 0 0 0 0
} within-dblk 23 107 12890624 0
} size <= 16 4 154 10074880 0
} size <= 32 30 154 4126926 0
} size <= 128 971 972 31485376 0
} size <= 512 0 40 8382531 0
} size <= 1024 0 14 7838400 0
} size <= 2048 0 43 7719639 0
} size <= 8192 0 2 127 0
} size > 8192 0 0 0 0
}
}
} I'm stumped at this point. Does this look familiar to anyone out
} there?
}
} --
} Don "Truck" Lewis Silicon Systems
} Internet: gdonl@gv.ssi1.com 138 New Mohawk Road
} Phone: +1 916 478-8284 FAX: +1 916 478-8251 Nevada City, CA 95959
}-- End of excerpt from Don Lewis

                        --- Truck



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:10:33 CDT