SUMMARY: Poor Mans HA help Needed

From: Michael Cunningham (archive@securityinsight.com)
Date: Wed Jun 07 2000 - 09:25:28 CDT


Folks,

I got a great response for this question. Plenty of good info. Some people
sent scripts, some provided step by steps..etc. To much good info to
summerize. I have included the good bits from all the emails below.

Thanks go out to!

bergman@panix.com
"Seth Rothenberg" <SROTHENB@montefiore.org>
"Rasana Atreya" <rasana_atreya@hotmail.com>
mike.marcell@cntcorp.com
carl.staroscik@barclayscapital.com
Birger Wathne <Birger.Wathne@getronics.no>
mark@sowerby.org
"Tom Jones" <tjones@statesman.com>
mgherman@ozemail.com.au
"Bill Armand" <barmand@flash.net>
John DiMarco <jdd@cs.toronto.edu>

        ----Original Message-----
        From: Michael Cunningham [SMTP:archive@securityinsight.com]
        Sent: Saturday, June 03, 2000 7:58 AM
        To: sun-managers@sunmanagers.ececs.uc.edu
        Subject: Poor mans HA help needed

        Folks,

        I am currently building an HA system using two ultra2's
        w/ 3 A1000's dual attached. I purchased veritas volume manager
        and veritas fs so I could have everything fully journaled to avoid
        a fsck in case I have to fail over. Now I am trying to figure
        out how to get the disk groups to fail back and forth between
        the system. Boss wouldnt fork out the bucks for veritas ha product
        so I am writing custom perl scripts to deal with the monitoring
        and failover..

        How do I make the disk groups fail back and forth? Do I setup all
        the drives on one system then add the other system in and then
        pull in the disk configurations somehow? Anyone know any web sites
        that describe how to do something like this? Anyone have any sample
        scripts? basic howtos? Anything would be helpful.

        Basically I am looking for... how do I setup volume manger on both
        systems to allow me to move the disk groups back and forth between the
        systems?

        How do I actually fail the disks back and forth? I have read a
        bit about import and export on the veritas site.

        Any trouble spots I should watch out for? I know I have to be
        very carefuly about who controls what when.. Is a fsck still
        possible even using vxfs? How can I avoid it at all costs?

        I am looking at about 200 gig of data mirrored across all 3
        arrays so a fsck would be disaster:(

        I will summerize of course..

        Thanks.. Mike

-----------------------------------------------------------------------------

I did the exact same thing with two E450s and two A1000s for a customer last
year. Basically you
configure everything on one server, with your boot disk in rootdg and the A1000
drives in another
diskgroup (we called it A1000). I have attached scripts that were manually
executed to "unmount"
the volumes on one server then "mounted" on the other server. Keep in mind that
the SCSI_INITIATOR_ID must be different between the two servers.

Feel free to email me with any questions you may have. I would recommend that
you check
SunSolve for the SCSI_INITIATOR_ID change, as well as the SUn Cluster docs on
docs.sun.com.
I've also picked up info from deja.com as well.

-----------------------------------------------------------------------------

You can do...

vxdg import -C disk_group
vxrecover -sb -g disk_group
fsck -F vxfs /dev/vx/disk_group/volume
mount /volume

To bring the volumes online to the new host.

It would be best to unmount and deport the diskgroup from the other host,
however if it is in a failover situation that may not be possible.

However any scripts that you have ENSURE that the disks are not being seen
by the other host or it is going to get pretty ugly.

Given that vxfs is a journalling filesystem there should be significantly
less time to replay the logs.

The only other thing that comes to mind is that you would not want to mount
these filesystems automatically on either of the hosts on boot, as that
could again lead to awkward situations.

As a note, I have never tried what you are looking to do, but the import /
deport works fine, as I am using it in another situation.

Good luck, and I'd strongly suggest that you make sure you have enough time
to test thoroughly on the system prior to its going into production.

----------------------------------------------------------------------------

Have a look at www.high-availability.com, their product rsf-1 may well
meet your needs and be affordable. It's very easy to Administer
with rc style scripts, and has a very good framework built for the nodes to monitor
each others status, not just via the network (you wouldn't want a glitch in the network
to make a failover occur (spilt brain)).

----------------------------------------------------------------------------

What you have to do (from memory, so there may be holes):

On host surrendering service:
- unexport file systems
- unmount file systems
  - Kill processes using file systems if neccesary
  - May need to stop/start lockd and statd to get rid of file locks before
file system will umount
- deport disk group

On host taking over service:
- Make sure the other host has given up file systems cleanly or is down
- import disk group
- fsck file systems (since these are logging file systems, fsck should
normally consist of rolling forward or backward transactions in the log.
You will almost never see a full fsck)
- mount file systems
- export file systems

----------------------------------------------------------------------------

Asuming tha you have cabled the disk to both boxes, you can just do what VCS
does, vxdg deport the disk group on one server and vxdg import on the other.
If the live server crashes, vxdg import on the backup server then fsck all
filesysems, you`ll have t be very carefull that you don`t end with the disk
group live on both servers as you`ll get corruption and probably a panic.

---------------------------------------------------------------------------

HA is hard to get right! Your boss is naive if he expects you to write perl
scripts to auto-failover. However, manual failover is doable by unmounting
from one machine and then mounting on the other (you can't mount on both
without risking corrupting the filesystem). This can be made transparent to
NFS if you mount from a virtual IP address and move that IP address to the
other machine.

---------------------------------------------------------------------------

Yes, this can be done. I'm going to implement VCS on my servers soon, but
they will be replacing our current set up. Unfortunately, these are all a
bunch of csh scripts. It's actually not bad, as it's easy to debug, but I'm
not a big fan of csh. Anyway, I can send them to you tomorrow.

Basically, you'll need to license both servers for VXFS and VXVM, as if
they were separate servers. THen make sure that both systems can see the
storage array, via format. Create your volumes and mount points on systemA.
Create the same mount points on systemB. If you put them in the vfstab,
make sure that they're marked as NO for mount at boot on both.

As far as failing over, in your scripts, you basically want to import the
diskgroup to sysA, then mount each of the filesystems, start any apps and
so fourth. You'll of course have to have error trapping and verify that all
mounts are present and so on. For the failover, you basically have to
reverse order what you just did: Stop all applications that are running
from the mounts, verify that nothing is present or running on these
filesystems (fuser -ck FILESYSTEM), unmount the filesystems, then deport
the disk group. Once this is done, the other server can run the same
startup script to import the disk groups, and so on. Hopefully, you're
familiar with Veritas Administration, as you'll need to know all of the
command lines, but that's pretty easy.

Obviously, both systems need to know about each other, and have ways of
verifying which state they're in, who has control of the disks, all disks
actually get unmounted/mounted, and so on.

Since our scripts are not that robust (they've been a legacy for so long
before any of my group came along), we have 24x7 operators that are
available to manually issue commands to failover. But to automate it, isn't
really that much more difficult. Once you can successfully flop back and
fourth with your scripts, it's just a matter of creating another code to
monitor, and based off of your conditions, run the failover script. This is
really all VCS does. It's really just a pretty script to do the monitoring,
although very cool and lots of bells and whistles. VCS uses ethernet
heartbeats to test for system uptime. If you have multiple interfaces (or a
quad card), you can also do this very easily. Connect your ethernet cable
from one system to the next, give ipaddrs/hostnames (hb1, hb2, etc) and do
a ping from one system to the other on a set schedule and verify that the
ping comes back. That's really all the heartbeats do. You can just as
easily use your public interfaces as the heartbeats. After that, you have
to set your conditions on which to failover, how critical and so on.

--------------------------------------------------------------------------

I had a similar situation in my previous company. 2 E450s with a
dual-attached A5000. The only this was I did not have to do hot fail overs.
However, my ideas might be of help:

#!/bin/sh
#
########
#
# Script to either:
# - deport yantra diskgroup from machine yantra03 (production), so it #
may be imported on machine yantra02 (backup).
# - deport yantra diskgroup from machine yantra02 (backup), so it may
# be imported on machine yantra03 (production).
#
# Rootdg cannot be moved from one system to another system; so to use # two
hosts to one ssa box, there must be other diskgroups involved.

while true
do
        echo ''
        echo 'To deport diskgroup yantra (make unavailable) type: d'
        echo 'To import diskgroup yantra (make available) type : i\n'
        echo 'Please make selection:'

        if read SELECTION
        then
                case $SELECTION in
                        d) echo '\nPreparing to deport diskgroup yantra.'
                           break ;;
                        i) echo '\nPreparing to import diskgroup yantra.'
                           break ;;
                        *) echo '\nInvalid response! Script being aborted.'
                           exit 0 ;;
                esac
        fi
done

while true
do
        echo ''
        echo 'WARNING if deporting!!!\n'
        echo 'This script will make the database unavailable on this
machine.'
        echo 'Shutdown database and Yantra BEFORE running this script.'
        echo 'Remember to run the corresponding script on alternate
machine.\n'
        echo 'Continue? [y,n]: '

        if read RESPONSE
        then
                case $RESPONSE in
                        n|N) echo 'Diskgroup yantra unchanged.\n'
                             exit 0 ;;
                        y|Y) break ;;
                          *) echo 'Invalid response! Script being
aborted.\n'
                            exit 0 ;;
                esac
        fi
done

if test "$SELECTION" = d
then
        # Prepare to failover to the backup machine yantra03
        umount /db/data01
        umount /db/data02
        umount /db/data03
        umount /db/data04
        umount /app/yantra
        umount /oracle
        umount /db/data05
        umount /backup
        umount /db/archive
        umount /db/export
        umount /db/data06
        umount /db/data07
        umount /db/data08
        umount /db/data09

        vxvol -g yantra -o verbose stopall
        vxdg deport yantra

        cp /etc/vfstab /etc/vfstab.save
        cp /etc/vfstab.nodb /etc/vfstab

        echo 'Remember to remove VxVM processes from root crontab.\n'
else
        # Failback to the production machine yantra02

        vxdg import yantra
        vxvol -g yantra -o verbose startall

        mount /db/data01
        mount /db/data02
        mount /db/data03
        mount /db/data04
        mount /app/yantra
        mount /oracle
        mount /db/data05
        mount /backup
        mount /db/archive
        mount /db/export
        mount /db/data06
        mount /db/data07
        mount /db/data08
        mount /db/data09

        cp /etc/vfstab /etc/vfstab.save
        cp /etc/vfstab.db /etc/vfstab

        echo 'Remember to setup VxVM processes in root crontab.\n'
        echo 'See /etc/scripts/cron-schedule-yantra03'
fi

exit 0
--------

Ofcourse, the problem with this script is everything is hard coded. But it
should give you the general idea.

---------------------------------------------------------------------

I don't have any experience with Veritas file-management or volume
management, (we use SDS), but I do have exp with Veritas FirstWatch HA.
The capabilities it provides require a lot of tuning....enough that
you may break even on writing them yourself. The hard part is
reliably recognizing when the remote system has died, etc.

You might look for information in the Linux arena on a package
called "HeartBeat". It should be portable enough to run on Solaris,
and it should provide the most crucial part of the HA package for you.

---------------------------------------------------------------------

On a real Sun cluster (2.2, a very, very, very simplified picture), all the
disk configuration information is stored in a private region (private to the
cluster software only, not used for data) on the shared disk hardware. There
are references in the Veritas config. file on each node that point to that
private region. In the simplest case, a disk group is only mounted on one
machine at a time. When the current master node fails (and the disk group is no
longer mounted), another node detects the failure, mounts the private region
specified, then completes the disk group import based on the data it finds in
the private region.

=> Any trouble spots I should watch out for? I know I have to be
=> very carefuly about who controls what when.. Is a fsck still
=> possible even using vxfs? How can I avoid it at all costs?

Let's see....
        not detecting a failure

        too sensitive a detection (ping-pong from one server to another)

        split-brain problems (both (N) servers think they are the master)

        corruption of shared configuration data
        
        contention for "master" status at startup (particularly in the case
        when multiple servers start up simultaneously--such as after a power
        failure)

        problems with starting applications after a failover, and preservation
        of the state of application data (different from the accurate, uncorrupt
        storage of the data itself...for example, a cluster may take 30 seconds
        to fail over, but the Oracle app may take 2hrs to roll back the logs
        that were saved so it's at the correct state)

There are a number of white papers and technical docs on docs.sun.com or
sunsolve.sun.com describing the design and administration of a cluster. While
the references are to Sun's solution, you can gain a lot from their analysis of
the problems.



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:14:09 CDT