Summary redundant systems

From: phil hoff (pah2824@kramden.nyu.edu)
Date: Mon Dec 07 1992 - 15:32:21 CST


This is the replies I recieved from by original posting about redundant systems.Just about all of them require using the product Fusion. There was very little feedback from actual Fusion users.
For those who did reply Thanks

>I work for a large bank that needs to have a fully redundant system.
>We have NIS and about 100 sparc 2's NFS mounting /apps and /home
>off a 470,called primary_A. We now have an other 470 as a hot spare,called secondary_B. However if the primary_A goes down, we have to bring up the secondary_B
>as the primary machine, change the NIS master, change our automout maps and
>reboot the clients. If you have any experience with setting up a redundant
>system , particularly Fusion you can either write to me and I will
>summarize the responses or just speak your piece.
>Thanks,
>Phil Hoff
>pah2824@kramden.acf.nyu.edu

You didn't specify what kind of disks you have, IPI or SCSI, so
I'll try and answer for both.
To use our product "High Availabilirty for Sun" you would have
to:
        1. If you have IPI disks:
                - Purchase the dual port kit from Sun
                  Its not supported on the 470 but I believe it works
           If you have SCSI disks:
                Either:
                        Upgrade to machine that supports Sun's
                        OpenBoot prom.
                Or
                        Purchase a RAID box with dual controllers.
                        Data General and Peripheral Devices both have
                        them.
        2. Purchase some extra network interface cards.
           You'll need two for each of the 470's.

Alternatively if you're using SCSI disks this may be a good time
to upgarde to a 670 or even a SS10.

This hardware configuration is appropriate regardless of whether
you purchase our product.

Once the hardware is in place you can then install our software.
This will allow transparent failover of both NFS and NIS.
For NIS you don't have to wait for ypbind to timeout, before
NIS reconnects to a slave server.

There will be not changes to the automount maps, and the clients
do not have to be rebooted after a failover.

If you need more info, please contact me, Charlie Rich or
Harvey Rand (Sales Director) at Fusion.

Regards

Tim Hunt
tim@fsg.com (212)285-8001

The best way to set up a redundant system is to have dual-ported
disks. This way the second machine can take over for the first.
If the data is important , mirror it.

--
% Brenda Parsons                  
% Currently at Prospect Electricity
% 10 Smith Street, Parramatta 2150, Australia
% +61 2 635 0300              e-mail:   parson@coulomb.pcc.oz.au        

s5udtg@fnma.COM (Doug Griffiths) writes:

>>To: lim@telerobotics.jpl.nasa.gov (David Lim) >>Subject: Re: Having mirror servers on a network >> >>We have just purchased two Sun 690 MP servers and are in the process of >>bringing them up. What would really be nice is if the machines can become >>mirrors, so that if one server goes down, all the client machines do not have >>to go down. We are using NIS, automount, and storing common utilities on the >>server (e.g. gnu emacs). As an added complication, the two servers are >>on different subnets, but the machines will be setup as gateway to each other.

>The answer is Sun's IPI Dual Port 1.0 software & DiskSuite 1.0 software, and >to hire Fusion Software (Sun recommended) to write your fail-over scripts. >We are running 6 sets of High Availability servers (4/690's) here, with varying >success. Apparently, we were one of the first sites to contract Sun (Fusion) for >this type of setup.

Fusion is now selling this software as a product "High Availability for Sun". For more information please contact me, or our sales director Harvey Rand (harvey@fsg.com).

Our typical configuration has both machines on the same sub-net, and we switch IP addresses when moving services from one machine to another so that the clients do not have to be reconfigured.

>>- The two servers would have their own set of files (e.g. their own identical >> copies of gnuemacs). No cross mounting between the two machines

>Not possible, unless you set up two different copies on one machine. Remember, >you may own two 4/690's, but in the dual-port configuration you have only >ONE pseudo machine.

Obviously to dual-port the two machines need to be located together. However if the machines are not located together you could have identical file systems on each machine, that would be used interchangeably. I would suggest that they are mounted read-only to be sure they stay in sync.

>>- The two servers would have their own set of NIS file, i.e. each one will be >> a NIS master. The reason for this is the automounter consults the NIS for >> the machine to mount from. Thus it would be setup such that the NIS files >> on server1, would automount from server1, and NIS files on server2 >> would automount from server2. Thus if client1 was picking up NIS from >> server1, it would mount server1 directories. If server1 goes down, client1 >> will automatically pick up server2's NIS, and then eventually the automounter >> will mount server2's directories. Is this doable in practice? Or would I >> have to manually reboot every client associated with server1? I know that >> switching NIS servers is relatively quick, and automounted directories, if >> not used will be unmounted after a few minutes. But if a user was in a >> automounted directory, then would the client would simply hang?

We can do a high availability configuration for NIS. The big advantage is that you don't have to wait for the timeouts before ypbind switches to a new server.

>- how about mail? Can mail be centrally located in this scheme? If client1 > has server1 as its central mailhost, if server1 goes down will client1 go > down? Is it possible to soft mount the /var/spool/mail directory to allow > client1 to continue to run?

Using dual ported disks /var/spool/mail can be mounted and exported so that the clients can continue to run uninterupted.

Hope this helps

Tim Hunt - Tim@fsg.com Fusion Systems Group, Inc.

Overview of the High Availability for Sun Product:

High Availability for Sun (HA for Sun), addresses the need for systems reliability and continued operations in the event of a loss of critical software, communications, or hardware services. HA for Sun protects against single point of service Failure on Sun supported and Sun third party supported IPI and SCSI configurations using dual porting and dual hosting disk technology, respectively. In the event of a loss of service on the primary server, (eg., RDBMS - Sybase/Oracle/Informix, NFS, NIS, or gateway) a resumption on the alternate server provides uninterupted service and data access to network clients. The dual porting (or dual hosting)facility permits disks on a failed machine to be switched w

tware to an alternate machine. Client applications can then reconnect to the services on the alternate machine.

High Availability for Sun consists of six main functions: 1) Failure Detection 2) Message Routing 3) Statistics Collection 4) "failover" Initiation and Management 5) Recovery Control 6) Systems Administration and Configuration

NFS "failover" High Availability for Sun also provides full high availability services for NFS. Resources exported by a primary server are transparently made available by the alternate server after the "failover".

Failure Detection: Failure Detection enables High Availability for Sun to monitor and react to various service failures on the primary machines. For example, a detection of the failure of the database server engine may be distinguished from a failure of the machine or another software service such as an Open Server based gateway service. For each such failure, a user specified action is carried out. If the primary machine fails, then full "failover" procedures will be initiated; however if the database server engine fa

he user can specify in the High Availability for Sun configuration file whether the service should be restarted or whether full "failover" to the alternate machine should occur.

Message Routing: Message Routing allow the following message interactions: 1) High Availability for Sun agents can make "keep alive" checks to determine if the software service is alive on the primary; 2) High Availability for Sun agents can send status messages to the High Availabilty for Sun daemon on the primary, informing it if the respective services are still alive; 3) The primary High Availability for Sun daemon exchanges bidirectional "keep alive" messages with the High Availability for Sun daemon on the secondary machine. For example, an agent may send a specific SQL message to the database server on the primary machine in order to determine if the RDBMS service is still alive. The agent then informs the High Availability for Sun daemon about the status of the service. The High Availability for Sun daemon routes a periodic heartbeat to the secondary daemon as long as all critical services are running.

Status Monitoring: Statistics Monitoring involves the accumulation of relevant status information within the High Availability for Sun product. This information can be accessed via the Command and control interface (fmscc).

"failover Initiation: "failover initiation enables High Availability for Sun to start the 'failover" process by which the standby machine takes over all database operations of the primary machine. This action will involve the switch over of disks from the primary to the standby, and the start up of database and related services on the alternate platform. The user specifies the "failover" action for each service in the High Availability for Sun Configuration File.

Recovery Control: Recovery Control involves those actions which High Availability for Sun takes when the failed machine is brough back on line. The key element of this process is that the failed primary machine now assumes the role of the standby machine and will not be considered the primary machine. In this respect, a systems administrator would have to inform High Availability for Sun that the recovered system is now the standby machine.

Command and Control Interface: The Command and Control Interface function (fmscc) enables the system administrator to configure, manage and monitor the High Availability for Sun system from any character-based terminal (or X11 xterm window) on the network. In particular, the system administrator can inhibit or initiate "failover" and monitor High Availability for Sun status.

High Availability for Sun Software Architecture: The central High Availability for Sun process on each machine is a daemon process that exchanges a periodic "keep alive protocol" with the High Availability fo Sun process on the other server machine. When loss of the keep alive packet is detected on the primary, the alternate machine takes control. Loss of heartbeat can occur when the primary fails, or when the primary High Availability fo Sun daemon detects loss of a software service on the primary. In the latter case, depending on information in t

h Availability for Sun configuration file, the primary High Availability for Sun will either stop sending the heartbeat, or restart the service a designated number of times before failing over.

"Loss of primary service' is detected by agent processes that communicates with the primary High Availability for Sun Process. The system administrator monitors high Availability for Sun status via the console messages sent by the High Availability for Sun process, or by messages sent by High Availability for Sun to all Command and Control processes connected to it. To dump High Availability for Sun status or send it commands, an instance of the Command and Control process must be started to connect

h Availability for Sun, accept operator input and pass it to High Availability to Sun.

Summary Fusion's High Availability for Sun provides a fully automated solution for both Server and Services based failures, supporting either dual ported IPI drives or dual hosted SCSI devices. The first release version 1.0 entered into restricted release on June 29th 1992. Release 1.1 is scheduled for September 8, 1992. Subsequent releases later this year will include an Open Look administrative interface and a SunNet Manager interface. In the first quarter of 1993, HA version 2.0 will be released containin

tonal capabilities including; "N-way" or symetrical failover, "Per-Service" failover, and a Client Side Failover library.

For more information or technical documentation please e-mail me, or phone @212-285-8001, fax 212-285-8705

Many thanks to the Fusion folks for a quick reply.



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:54 CDT