SUMMARY: TSM and Sun Cluster, or how to create a resource that is a script in Sun Cluster

From: Markus Mayer <>
Date: Tue Sep 02 2008 - 10:42:43 EDT
In the end the only reply I got was from our Sun partner, Martin Pre_laber,
and thankfully through his several further suggestions we found an answer.

To get a script in the cluster framework, specifically in our case one that
starts and stops TSM's dsm scheduler, several steps were needed.  The most
critical for me was to stop following the tsm manual where it was telling me
that all scripts for starting and stopping the tsm scheduler plus all
configurations files *must* be on shared storage.  This simply doesn't work.

The dsm.opt file for each TSM node (note that a TSM node is different to, and
*not* a cluster node!) can and generally should be on shared storage, mainly
for consistency.  The scripts for starting, stopping and probing the tsm
services however need to be local and present on every node at all times.
This availability of the scripts is what the cluster framework needs in order
to add the resource into the cluster.  If the script wasn't available on all
nodes when I tried to create the resource, cluster spat the dummy...

After setting up the scripts and manually testing the tsm client to make sure
the configuration is correct on all nodes, it is possible to add a new
resource to the cluster of type SUNW.gds - a general data service.  To add the
scripts as a gds resource into the cluster, the following command does the

# clrs create -g www-rg -t SUNW.gds -p
Start_command="/etc/init.d/ /zones/webdata/tsm/dsm.opt
start" -p Probe_command="/etc/init.d/ webdata probe"
-p Stop_command="/etc/init.d/ webdata stop" -p
Network_aware=false webdata-backup-rs

So in this example, the script /etc/init.d/ is on
local storage on all nodes and is identical across all nodes.  The script is
below.  The file /zones/webdata/tsm/dsm.opt is on shared storage and switches
between nodes in the event of a failover.  When the rg starts on a different
node, the script is run and the resource comes online.  Curiously, the dsmcad
daemon process doesn't need to be killed in the event of a failover, the
cluster framework seems to take care of this, killing the process and allowing
a clean failover.  Also, making the resource not network aware removed the
need for a logical hostname for the resource group.

The script to start, stop, and probe the dsm client is below.  It could
definitely be done better, however it works.  Also, what I've noticed, it may
also be possible to directly start and stop the scheduler process, dsmc, using
the script.  I haven't tried this, however I'm sure it would work.  Note that
I include this script for informational purposes only, I don't promise that
it will work for you ;-)


# Generally, we should start up with something like this:
# /opt/tivoli/tsm/client/ba/bin/dsmcad

# set the necessary environment variables so that TSM doesn't vomit
export LC_CTYPE
export LANG
export LC_LANG
export LC_ALL

# work out which argument is the command and which the config file
case "$1" in

# now check what we want to do.
case "$COMMAND" in
        # echo "starting"
        # There has to be a better way to do this test.......
        if test -f $DSM_CONFIG ; then
                echo "Config file $DSM_CONFIG does not exist, exiting."
                exit 1
        export DSM_CONFIG
        # Check if there is already a dsmcad process running, if so, ignore
the start command
        PS=`ps -ef | grep -v grep | grep -v vi | grep -v probe | grep -v
zoneadmd | grep -v "" | grep -c "$DSM_CONFIG"`
        if test "$PS" -eq "1" ; then
                echo "dsmcad is already started for $DSM_CONFIG, will not
start another."
                ps -ef | grep -v grep | grep -v vi | grep -v probe | grep -v
zoneadmd | grep -v "" | grep "$DSM_CONFIG"
                exit 0
        elif test "$PS" -gt "1" ; then
                echo "Seems to be too many processes running for dsmcad for
$DSM_CONFIG, please check it."
                exit 1

        /opt/tivoli/tsm/client/ba/bin/dsmcad -optfile=$DSM_CONFIG
        if test "$?" -ne "0" ; then
                echo "Failed to start the dsm scheduler, exiting"
                exit 1

        # echo "stopping"
        # For the most part, we ignore a stop command as the dsmcad should
work out itself
        # that it has to stop it's child process when the directory with it's
        # isn't available.
        exit 0

        # echo "probing"
        # WARNING: The following would produce a bug if "vi" is in the
        #          So make sure you avoid it, OK?
        PS=`ps -ef | grep -v grep | grep -v vi | grep -v probe | grep -v
zoneadmd | grep -c "$DSM_CONFIG"`
        if test "$PS" -gt "0" ; then
                # echo "Found $PS processes"
                exit 0
                echo "Found no processes"
                exit 1

        # otherwise an invalid command was received, vomit.
        echo "options { start | stop | probe }"
        exit 1


So I hope I've written something that is useful.  If anyone has questions,
feel free to contact me.


On Thursday 14 August 2008, 17:07 Markus Mayer wrote:
> Hi all,
> I've been pulling my hair out on this one for a few days now, even with
> support from our Sun partner, we have not come up with a solution.
> I have a cluster, Sun Cluster 3.2 on two V445's, five resource groups each
> containing an own zpool, and a number of zones.  Each zpool and the zones
> are configured as a resouce within the group, as is necessary for cluster.
> Each resource group is configured for failover operations.  From the
> cluster view, everything works as it should.
> Enter the desire to make a backup with TSM.  Backup services will be run
> from the global zone.  According to the TSM manual (IBM TSM for unix and
> linux, backup-archive clients installation and user's guide, page 543-549)
> we need to have an own TSM server node for each shared disk resource to
> back up the shared resources.  This is configured.  Each TSM client node
> will backup the data only on the shared disks within each resource group.
> >>From the client side, cluster, we need a simple script that runs as a
> >> resource
> within the resource group.  This script meets the requirements of cluster,
> having exit values of 0, 100 and 201 depending on circumstances, and the
> functions start, stop, and probe.  As required by TSM, this script resides
> on shared storage that switches between nodes, in our case an own zfs file
> system on the zpool.  When a failover occurs, the script should be started
> (backup service/resource brought online) in the same way that any other
> resource within the group would be started or brought online.
> Therein lies the problem.  How can I define a resource that is a simple
> shell script or program, which should then be added to an existing resource
> group in cluster?  It sounds simple enough, but it would seem it's not
> so...
> Our Sun partner gave me the following link to follow, which I did.
> In short, it says enbable SUNW.gds (already done), create a resourcegroup
> that will contain the resource and failover service itself, create a
> logical hostname, then the resource.  This is where some confusion comes in
> for me.
> I already have resource groups defined, one being comms-rg containing two
> resources, comms-storage-rs and commssuite-zone-rs.  The "backup" resource,
> named for example comms-backup-rs, from my point of view should then come
> into this resource group.  If I try to add a logical hostname to this
> resourcegroup, I get an error:
>   # clreslogicalhostname create -g comms-rg commslhname
>   clreslogicalhostname:  commslhname cannot be mapped to an IP address.
> So as suggested by our Sun patner, I tried adding an IP address for the
> logical host name and putting it in the /etc/inet/hosts files on both
> nodes. The result was:
>    # clreslogicalhostname create -g comms-rg commslhname
>    clreslogicalhostname:  specified hostname(s) cannot be hosted by any
>    adapter on wallaby
>    clreslogicalhostname:  Hostname(s): commslhname
> getent returned valid information on both nodes.
>    # getent hosts
>   commslhname commslhname.nowhere.nothing.invalid
> OK, so it seems that I have to define a new resource group especially for
> this one resource which contains one simple script, which makes no sence to
> me because I already have a resource group into which the resource should
> go. Why then can't I add this new script as a resource in an existing
> resource group?  The problem here is too, that I need to define an
> additional resource group for every other resource group that I have,
> currently five, meaning a total of ten resource groups, all of which need
> affinities in order to correctly fail over and start the resources.
> Additionally, the backup resource needs, according to the manual, to have
> network resources defined, and a port list defines, although it needs only
> to start a shell script.
> It seems much more complicated than it should be.  I find nothing else in
> the documentation about this, but it has to be simple, I can't imagine that
> it could be so complicated....
> The alternative, should such a resource definition not be possible, is to
> have a TSM client in every zone, and one in the global zone of each node.
> This is however not what I'm looking for.
> Could it be that I'm barking up the wrong tree here?  Does anyone have any
> suggestions as to how I can achieve this?
> Thanks
> Markus
> _______________________________________________
> sunmanagers mailing list
sunmanagers mailing list
Received on Tue Sep 2 10:45:20 2008

This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:44:12 EST