SUMMARY: Replacing A1000 batteries...online

From: Denes, Daniel (BG-IT 51) <daniel.denes_at_bankgesellschaft.de>
Date: Wed Sep 22 2004 - 10:31:50 EDT
I asked for best practises and the supported method to replace A1000 batteries
(with focus on wether or not its wise to do it online). I received a number of
very helpful replies which i will summarize below. First - apologies for the
very late summary, unfortunately our vendor took >2months to provide us with
the 4 neccessary batteries 8^(

===== Responses (email addresses removed for spam prevention) =====
Albert Musin wrote:
Hi,

I did it online as written in RM instruction (recovery guru).
It was OK.
-------------------------------------------------
ponder wrote:
While sun still claims it is impossible, we have changed about 20 batteries
without turning the cabinets off in half a day. Don't forget to reset the
battery age (can't remember command at the moment, check osa docs). And
ofcourse this procedure is completely unsupported by sun...
-------------------------------------------------
Sudesh Shetty told me:
Hi Daniel,
          You can replace the battery online as i had done it before on an
A1000 connected to a production server.
-------------------------------------------------
Valuable insight from Sun came from Tony Walsh:
It has never been a Sun engineering supported procedure to "hot swap"
batteries
in an A1000 array (rumoured to be related to litigation related..). The Sun
statement of support states that a battery exchange should be made with all
power
removed from the array. I have found that this policy can often lead to much
more
drastic failures if the array has other issues that are not apparent (or not
checked) before power is removed. It also gives disk drives a chance to cool
down
(even for a short period) and makes it much more likely that the drive may
not
spin up when power is reapplied. Not to be deterred by any of this, the
approach
I usually use is to give the customer the choice and advise that data lose is
a
possiblity, and then use the following process:-
- Determine if the host can be shut down to the OK> prompt. If it can, this
will
stop all IO to the array and prevent data loss. At this point I just "warm
swap"
the battery and make sure it begins charging.
- If the host must remain active the next level of data safety is to unmount
all
the file systems on the host(s), wait 10-15 minutes to ensure the cache is
flushed,
then exchange the battery and make sure it starts charging.
- If all filesystems must remain mounted, I then turn off all "write through"
cacheing using the RM6 command line or gui, wait at least 10-15 minutes to
ensure
the cache is flushed, and then change batteries.

I have never yet had a data loss event or any other electrical misadventure,
but it
must be said again, the risk is all in your hands as the batteries are not
designed
as a "hot swap" item.

In your case, as the batteries are already well beyond their "use-by" date,
the cache
"write through" has already been turned off, so I would just pull each battery
in
turn and replace it with a new one (replace the first battery before removing
the next
one). The only caution I would give is, make the change during a low activity
period
just in case there is an electrical misadventure.

One other warning is to make sure the battery has been recharged within the
last 6
months or the recharge operation in the array may take too long and then cause
the
"new" battery to be marked bad again.
-------------------------------------------------
Gene Beaird has a different practise:
I have been working with A1000s since about 2000, and have never heard of
doing the battery change online.  We have always scheduled an outage,
powered the system off and swapped the batteries.  Should have maybe tried
it to one of our standby systems when I had the chance, but never did.  I
guess the potential power surge when you pull the battery out can fry some
of the electronics on the system.
-------------------------------------------------
Tim Chapman pointed out, that
For what it is worth, a semi-recent summary to the list reiterates the
requirement, "turn off, swap battery, power on, issue 'battery age
reset' command" ...

http://www.sunmanagers.org/pipermail/summaries/2002-July/002014.html

maybe others will suggest otherwise ?
-------------------------------------------------
Tod Sandman sent me this very helpful recipe:
I always replace live by first turning off caching.  I too heard
rumors etc. and so recently verified my procedure with our Sun
onsite guy.  For what it's worth, he says yep, that's the way to
do it.  Let me know if you hear otherwise.

Here are my typical steps:

## Define these 2:
device=c1t5d0s0; luns=0,1,2,3,4

## To check battery:
  raidutil -c ${device} -B
## To replace battery:
  ## Turn off caching if it is still enabled:
    raidutil -c ${device} -w off ${luns}
  ## Replace the battery.
  ## Reset battery age:
    raidutil -c ${device} -R
  ## Re-enable cache after replacing battery:
    raidutil -c ${device} -w on ${luns}
-------------------------------------------------
Dirk Boenning summarized like this:
Hello,

Sun: You have to shutdown the whole system.

Experience: Be sure to disable the cache. Change battery and enable cache
again. Done couple of times without any problem.
-------------------------------------------------
Gene Siepka knew:
You can definitely do it online. Our local Sun Field Engineer says its like a
urban legend that you must power off...
I've done it on a A1000 just 2 months ago. No problem.
-------------------------------------------------
My original mail:
Hello List,

we have a couple of A1000 Arrays with 8 and 12 disks. All of their batteries
have exceeded their lifetime long ago and are to be considered disfunctional.
As we hope to gain performance from a working cache we want to replace them;
however as they are attached to HA-systems we would clearly prefer to do that
"online", i.e. without shutting them down. I recall it used to be a supported
procedure to replace the batteries like that, but i heard rumors that Sun has
backed up on that. Does anyoen have insights, experiences, rumours (s)he could
share on that topic?
-------------------------------------------------
What i did:
- used rm6 (shame on me for using a GUI) to turn cache off (of course it was
disabled for years already, the oldest of the replaced batteries dated back to
1998)
- got myself a console window to see kernel output, also had a look on
messages
- pulled old battery online (but, as recommended, at a time where it would
hurt the minimal number of users if something bad happened). Pulling it out
proved to be somewhat difficult for two of them because the adhesive labels on
top suffered and rolled themselves up as i dragged the cannister out, i had to
use some force to get them out the last few centimeters.
- slid new one in
- watch console and messages - eventually 3 messages per battery would show
up, looking like this:
raid: [ID 702911 user.error]
Sense=700006000000009800000000A0000000000000000000000000000000000000000000800
000082C000000000000000000000000000B053154383032303130303720202020202003010300
00010000000000000000000000000000000000000000000000000005000000000000000000000
0000000000000000000000000000000000000000000000009A412A93039313730342F30393339
353400000000000000
no need to panic though its just this one message.
- reset battery age as suggested by Tod by doing raidutil -c ${device} -R
- turned caching on again (with rm6 again, who told me right away that cache
was temporarily disabled again while the batteries were loaded).
- checked cache status 20 min later, it was on and working
- tested performance with iozone, performance improvement was like factor 2.5

Thanks everyone for the insights.

Daniel
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
Received on Wed Sep 22 10:31:45 2004

This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:43:37 EST