Parcial Summary: Climbing CKSUM errors after zpool online

From: Ivan Fetch <ifetch_at_du.edu>
Date: Thu Jan 15 2009 - 10:43:50 EST
Hello Sun Managers,


    We received additional info from Sun, which I'd like to pass along.


    When you offline (zpool offline poolName device) a device, ZFS still 
"tracks" that device. IN our case, the device disappeared then reappeared 
(we took the array down for maintenance), causing ZFS to have CKSUM errors 
once then device is onlined again in ZFS (zpool online).

    The recommended action for future work (where the array will be 
offlined), is to detach the legs of our ZFS mirrors, then offline the 
device in ZFS.  I'm not sure if you can, or if there would be a point to, 
offlining since the device would be detached from the mirror.  Detach with 
something like:

zpool status # Take record of which devices mirrors or made of for later
zpool detach poolName device
# Offline the array, once it's back online:
zpool attach poolName existingDevice device # This is why you need zpool status output

    We fixed the climbing CKSUM errors, by detaching then re-attaching 
those legs of our mirrors.

    I'd like to get more definitive info on when zpool offline / online is 
appropriate, and why the CSUM errors kept climbing after we onlined the 
devices. This is something we'll probably further experiment with, and 
keep asking 
Sun about.


Thanks,

Ivan.



  On Wed, 7 Jan 2009, Ivan Fetch wrote:

> Hello Sun Managers,
>
>
>   We've been working on a weird ZFS issue, and not getting very far with 
> Sun.
>
>   We needed to relocate a storage array, so "zpool offlined" the second half 
> of mirrors on multiple machines.  Once the array was back online, and we 
> verified the LUNs were seen in the OS, we did "zpool online" for each of the 
> previously offlined LUNs.
>
>   The first LUN took about 35 minutes to resilver, and the mirror was fine; 
> no errors in "zpool status."  Subsequent mirrors reported resilver completed 
> in a matter of seconds, and we got quite a few CKSUM errors (in one case, a 
> few thousand in 12 hours), but no read or write errors.
>
>   We're experiencing this idential issue on three boxes so far, a couple of 
> them are:
>
> 5.10 Generic_127127-11 sun4v sparc SUNW,SPARC-Enterprise-T2000
>
> 5.10 Generic_127111-06 sun4v sparc SUNW,Sun-Fire-T200
>
>
>   Sun's answer is to "Just upgrade the kernel, a lot of ZFS bugs have been 
> fixed, but only upgrade to 137137-06 as later kernels will introduce other 
> ZFS issues."
>
>   We ended up detaching, then re-attaching the second leg of the mirrors, 
> and all of them resilvered and do not have CKSUM errors. We will probably end 
> up doing this on our remaining ZFS boxes but would like to match our symptoms 
> with a particular bug / resolution / patch, and have more complete answers.
>
>   I've found a few simelar cases on the ZFS Discuss list, but no resolutions 
> there.
>
>
>   Has anyone else run into this issue?
>
>
> Thanks,
>
> Ivan.
>
>
> ---
> Ivan Fetch
> University of Denver
> Computer Operations, University Technology Services
> 303-871-3092
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
Received on Thu Jan 15 10:46:55 2009

This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:44:13 EST