SUMMARY: Disk contention problems

From: Rob McMahon <Rob.McMahon_at_warwick.ac.uk> Date: Thu Nov 29 2007 - 10:03:20 EST · This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:44:07 EST

Thanks for all your suggestions, there were some good points in there.  
Thanks to Ric Anderson,  John Leadeham,  Tobias Nutt, Joe Fletcher, 
Bhaskar G, Pawel Osiczko, and Grzegorz Bakalarski for their prompt 
replies, and apologies for the late summary.  It's only today I'm 
completely happy, and it's involved moving a bunch of data into the SAN 
(it needed doing anyway).

Suggestions were that having UFS filesystem 95% full is a bad idea in 
the first place, because of the overhead looking for free inodes / free 
data blocks.  Also UFS doesn't do so well on filesystems with milliions 
of files. Hence the move of a chunk of data to the SAN. 

Fragmentation can apparently still be an issue, the only real cure for 
that would be a dump | restore.  A messy option when you're talking 500 
GB data.

If a controller had actually failed, this can trigger the array to 
switch through to write-through mode, clobbering performance.  In my 
case `show cache-param' still showed `mode: write-back', but definitely 
worth checking.

UFS can throttle writes in the case of high write-rates, which is tweakable.

A failed / failing drive can hurt performance.  All my drives were good.

UFS journalling is important, and was turned on.

The optimisation mode can make a big difference, and think before you 
create a volume, because you can't change it later!  I have mine 
optimised for random access, which seems about right for a mail spool.

There's were also a couple of comments that the 3510 isn't a great 
performer in the first place, to check for bad memory, and to make sure 
the firmware's up to date.  I'm a happy bunny at the moment, and 
firmware upgrades mean more downtime, so I'm going to schedule that for 
Christmas.

Anyway, I finally seem to have got it sorted, and it appears to have 
been due to the controllers being in a dodgy state, i.e. this

sccli> show redundancy-mode
 Primary controller serial number: 8040592
 Primary controller location: Lower
 Redundancy mode: Active-Active
 Redundancy status: Failed
 Secondary controller serial number: 8009331
sccli>

On the suggestion of a guy from Sun, I tried

sccli> unfail

The Redundancy status changed to Scanning, and then to Detected, and 
then I lost one of my LUNs.  Bugger.  Then he suggested

sccli> reset controller

and the machine panicked and came back to single-user because of loss of 
metadb quorum.  Bugger, bugger.  I should have known better than that, I 
would have known that would happen if I hadn't been panicking myself.

Anyway, I shut the machine down, power-cycled the array, waited for the 
array to look healthy, and brought the machine back up.  Redundancy 
status is now "Enabled", asvc_t is a 10th of what it was, throughput 
(kw/s) is 2-3 times what it was, and all's back well with the world.

Thanks again everybody,

Rob

The original problem:

Rob McMahon wrote:
> I've got a machine here which has recently (over the last few weeks) 
> degenerated into being unusable at times.  It's a V890 running Solaris 
> 10, cyrus-imap (2.2.13) and squirrelmail.  The mail partitions are on a 
> 3510 FC, 500GB a piece, and RAID 5.  The filesystems are UFS, and the 
> problematic one is 95% full. When it becomes unusable, iostat shows the 
> asvc_t times hitting 1000, 2000 or more.  %b is pinned at 100% all the 
> time.  %w hits 60% on the one partition.  At quiet times I don't seem to 
> get better than:
>
>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>    71.8  258.8  618.7 4334.6  0.0 26.0    0.0   78.7   0 100 
> c6t600C0FF0000000000855613BE6F2D900d0
>    35.6  129.6  322.3 2068.4  0.0  0.0    0.0    0.0   0   0 
> c6t600C0FF0000000000855613BE6F2D900d0.fp1
>    36.2  129.2  296.5 2266.2  0.0  0.0    0.0    0.0   0   0 
> c6t600C0FF0000000000855613BE6F2D900d0.fp3
>
> which is lower throughput than I'd expect.  Truss shows creates, renames 
> and fdsyncs (which cyrus-imap seems to like using a lot) taking seconds.
>
> sccli does show
>
> sccli> show redundancy-mode
>  Primary controller serial number: 8040592
>  Primary controller location: Lower
>  Redundancy mode: Active-Active
>  Redundancy status: Failed
>  Secondary controller serial number: 8009331
> sccli>
>
> and I have a call in about that with Sun, although they seem to be 
> arguing about maintenance levels as normal.
>
> Really, I'm a bit desperate out here, and I'd like to hear any 
> suggestions or pointers to things I might not have thought about.
>
> Any input gratefully received.
>
> Thanks,
>
> Rob
>
>   

-- 
E-Mail:	Rob.McMahon@warwick.ac.uk		PHONE:  +44 24 7652 3037
Rob McMahon, IT Services, Warwick University, Coventry, CV4 7AL, England
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers