SUMMARY (sort of): horrible ZFS performance on a pool of 2 LUNs, awesome with 1 LUN

From: Andrey Dmitriev <admitriev_at_mentora.biz> Date: Tue Aug 05 2008 - 17:36:17 EDT · This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:44:11 EST

Well, after eliminating every possible hardware issue, the issue turned out to be with software, particularly with utilization of the underlying ZFS file system. As soon as we deleted about 2TB worth of data, the file system started performing fine again.

we deleted beast/customer1/db file system (had it on tape, just like others), and right away were back to 150-300megs per sec.

If anyone has a clue as to why this might've been an issue, or advise some ZFS related mailing list, I'd greately appreciate it.

beast                 130G   37K  130G   1% /mnt/backup1
beast/customer1         130G   29K  130G   1% /mnt/backup1/customer1
beast/customer1/bacula  222G   93G  130G  42% /mnt/backup1/customer1/bacula
beast/customer1/db      2.0T  1.8T  130G  94% /mnt/backup1/customer1/db
beast/customer1/fs      2.1T  1.9T  130G  94% /mnt/backup1/customer1/filesystem
beast/customer5      130G   29K  130G   1% /mnt/backup1/customer5
beast/customer5/bacula
                      221G   92G  130G  42% /mnt/backup1/customer5/bacula
beast/customer5/db   130G   25K  130G   1% /mnt/backup1/customer5/db
beast/customer5/fs   172G   42G  130G  25% /mnt/backup1/customer5/filesystem
beast/bacula          130G   15M  130G   1% /mnt/backup1/bacula
beast/bacula/spool    130G   34K  130G   1% /mnt/backup1/bacula/spool
beast/customer6          130G   29K  130G   1% /mnt/backup1/customer6
beast/customer6/bacula   210G   81G  130G  39% /mnt/backup1/customer6/bacula
beast/customer6/db       3.7T  3.6T  130G  97% /mnt/backup1/customer6/db
beast/customer6/fs       130G   25K  130G   1% /mnt/backup1/customer6/filesystem
beast/customer2         133G  3.6G  130G   3% /mnt/backup1/customer2
beast/customer2/bacula  1.5T  1.4T  130G  92% /mnt/backup1/customer2/bacula
beast/customer2/db      194G   65G  130G  34% /mnt/backup1/customer2/db
beast/customer2/fs      221G   92G  130G  42% /mnt/backup1/customer2/filesystem
beast/customer4         130G   29K  130G   1% /mnt/backup1/customer4
beast/customer4/bacula  1.3T  1.2T  130G  90% /mnt/backup1/customer4/bacula
beast/customer4/db      1.6T  1.5T  130G  92% /mnt/backup1/customer4/db
beast/customer4/fs      130G   25K  130G   1% /mnt/backup1/customer4/filesystem
beast/customer3    130G   26K  130G   1% /mnt/backup1/customer3
beast/customer3/bacula
                      2.8T  2.6T  130G  96% /mnt/backup1/customer3/bacula

Original Post:
                                           capacity     operations    bandwidth
pool                                     used  avail   read  write   read  write
--------------------------------------  -----  -----  -----  -----  -----  -----
beast                                   14.1T   366G      0    155      0  3.91M
  c7t6000402002FC424F6CF5317A00000000d0  7.07T   183G      0     31      0  16.2K
  c7t6000402002FC424F6CF5318F00000000d0  7.07T   183G      0    124      0  3.90M 

I get pretty consistent results like this.. i can only write to the pool at about 3megs right now.. i used to at about 300-400MB/sec

each member is a 9 1TB disk RAID5, members are not mirrored

I have another group that I created on the same array (NexSAN SATABeast). using only 2 disks (mirror). I am able to push them to 60megs, which is fine.

We have tried rebooting switches, setting all ports to 2Gb hard, eliminating controllers (each controller presents all LUNs), eliminating ports on the fibre card, direct attaching the machine to the array, yet I am consistently getting the same (crappy) results on one LUN, and decent on another.

Interestingly I see reads in the 30-40megs per member all the time, but writes suck consistently. 

We had a network maintenance the day before, which was also rolled back.

Anyone has _any_ clue on how to troubleshoot this further?

FOLLOW UP:

I do see a problem (this was done with 30secs) iostat intervals

                 extended device statistics
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
    0.0    6.3    0.0    3.3  0.0  0.0    0.0    0.6   0   0 c7t6000402002FC424F6CF5318F00000000d0s0
    0.0   25.1    0.0  805.3  0.0  0.0    0.0    1.4   0   2 c7t6000402002FC424F6CF5317A00000000d0s0

    1.4    6.9    3.4    3.3  0.0  0.0    0.0    0.5   0   0 c7t6000402002FC424F6CF5318F00000000d0s0
    1.4   26.1    3.4  822.5  0.0  0.0    0.0    1.4   0   2 c7t6000402002FC424F6CF5317A00000000d0s0

   73.0   11.6 4476.2  165.0  0.0  2.9    0.0   34.1   0  17 c7t6000402002FC424F6CF5318F00000000d0s0
   76.4   19.3 4727.3  487.2  0.0  2.6    0.0   27.1   0  17 c7t6000402002FC424F6CF5317A00000000d0s0

However, I do not understand why ZFS is making writes in such a lopsided manner. 
E.g. why are number of writes equal to the LUNs, yet the # of KB is substantially different

Also, we did try swapping cables.
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers