SUMMARY: I/O resource controls?

From: joe fletcher <joe_fletcher_at_btconnect.com> Date: Wed Nov 24 2010 - 09:25:51 EST · This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:44:17 EST

Hi,

Its been a while coming so apologies for the delay. We've not been able to pin
down a definitive explanation but all our testing confirms our suspicion that
SATA is the cause of the problems. We ran some tests with zones clones onto
SAS disks and saw an immediate improvement with respect to the stalling
behaviour. Based on that we converted one of the DL380s to use external EMC
SAN storage, moved the zones from the SATA drives to the SAN and now we can
thrash several zones to death simultaneously and nothing chokes. Peak
throughput is down on a per-zone basis so individually the throughput is
slower. However now we can run multiple parallel jobs with no freezing up so
overall we have a win.

Incidentally I had some contact from a couple of people using the x4500 boxes
expressing similar issues on machines with SATA based configs.  As with our
setup things are fine up to a point, beyond which there is a sharp drop in
usability.

Cheers

Joe

====================== originally.. ===================================

Looking for some insights on a performance issue. Platform is an HP DL380-G6,
dual quad core, 64G RAM, 2x 146Gb disks hardware mirrored for the sys disk
plus 4x 1Tb SATA drives as an additional logical drive, also hardware
mirrored. Controller is a P400i, 512Mb bbwc. Server houses 4 zones. The 2Tb
volume forms the basis of a zpool. Each zone sits in a ZFS directory on that
pool. The zones will run a BI app (SAS) which is numerically and I/O
intensive. What we're seeing is that when one SAS job gets busy the whole
system locks while its doing its disk work. So for example we have zones A
thru D. A kicks off a job. Someone else either on global zone or in one of
the
other child zones runs anything (w, ls, date) they can wait for upto 30s to
get output and a prompt back. Trying to run jobs in 2 zones simultaneously
causes run times to extend markedly.

The disks are pushing 200Mb/s + sustained once they get busy (peak observed
at
a shade under 300Mb/s). %busy is 100%, blocking is 0 and service times are
around 25ms. Drivers are latest and greatest. Read/write cache ratios on the
controller are 25%:75%. CPU usage levels overall are <15%. Things get even
worse if we try to do some network transfers at the same time (eg scp).
Machines are on 1000Base ethernet.

I've run some tools like the the rather funky zilstat.ksh which indicate zfs
itself isn't struggling.

I'm aware obviously that the arrangement I've built means there are common
disks, controllers and so on servicing all the zones.

What does seem unusual is the way everything seems to block, even things in
global zone which ought not to be causing any significant I/O contention.

Essentially it looks like we can thrash the disks via a single thread and get
nothing else done whilst its doing it.

I'm in the process of building a comparative system using HBAs/SAN instead of
internal RAID and also comparing ZFS and Veritas so see if we can isolate a
specific element. as being the problem. Will update with the results.

Anyone got any suggestions in the meantime?
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers