BigAdmin System Administration Portal
Feature Article
Print-friendly VersionPrint-friendly Version

Installing and Configuring Sun Cluster 3.1 09/04 Software for High-Availability Applications

Amy Rich, June 2005

Abstract: This article explains how to set up a Sun Cluster 3.1 09/04 environment for highly available services.

Contents:


Introduction to the Sun Cluster Environment

Today's production environments often require 24x7 availability on critical services such as mail, LDAP, and web and database servers. Often the time it takes to perform one clean reboot is more downtime than is desired for the entire year. The solution to this problem is to design an environment where services can migrate between individual servers so that a reboot of a single machine never causes a service downtime. Sun's solution to architecting these highly available services is combining the software product, Sun Cluster, with the appropriately sized servers, shared storage, and network topology.

The goal of the cluster is to maintain near 100 percent uptime while ensuring data integrity. More than one highly available application can be run on a cluster at a time, either sharing resources with other applications, or running on separate nodes with their own resources. The Sun Cluster environment is not only a way of tightly coupling multiple machines and resources, but also a mechanism by which these hardware and software components can be managed. Various types of cluster configurations are possible, and choosing the right configuration for your site requires knowing a bit about Sun Cluster itself as well as your applications.

A Sun Cluster 3.1 09/04 environment consists of two to sixteen SPARC processor-based servers running version 8 or 9 of the Solaris Operating System. It can also support x86 servers running Solaris 9 09/04 or later (x86 configurations are currently limited to two nodes). It also requires the Sun Cluster software and some number of applications with software agents and fault monitors. Each node needs one or more public network interfaces as well as two or more private network interfaces (either directly cabled or connected via a private switch) for communication between cluster nodes. Most cluster installations will also include shared storage in the form of directly attached SCSI or fiber arrays or SAN switches.


Sun Cluster Concepts

Understanding how the Sun Cluster environment works depends on comprehending a few basic concepts, such as application modes, quorum, and SCSI reservation. Learning how your applications interact with the underlying operating system and how the cluster manages hardware will help you determine what configuration might work best for your site.

Cluster Application Modes

Clustered applications can run in failover or scalable mode depending on their implementation. In failover mode, the application only runs on one node at a time. If the controlling node fails, then the application and any other associated resources are passed to another node which was previously in standby. A scalable application runs on more than one node at the same time without any failover. Not every application can operate in scalable mode, since the application must implement its own write-locking mechanism so that data does not become corrupted on any shared storage devices. Parallel databases can be considered a specific type of scalable application. They run on each node without failover and handle different queries and often parallel queries on the same database. Currently, Sun Cluster 3.1 09/04 only supports Oracle 8i OPS and Oracle RAC (9i and 10g) in such a configuration.

During normal operation, each cluster node regularly sends out heartbeat information across the private network to let every other node know about its health. In the event that one or more nodes fail to successfully transmit heartbeat information to other nodes in the cluster, the cluster removes those machines and continues without them. Any applications or resources associated with the removed nodes are failed over if appropriate. The cluster safeguards against data corruption by using the principle of quorum and fencing.

Cluster Topologies

Cluster configurations come in four types, clustered pairs, Pair+N, N+1, and Multi-ported N*N scalable. In a clustered pair topology, two or more pairs of nodes are each physically connected to some external storage shared by the pair. This configuration consists of an even number of nodes (N), two to sixteen, and a minimum of N/2 external shared storage devices. A clustered pair with only two nodes is the only topology supported by the Solaris OS for x86 platforms. The following figure shows two pairs with two storage devices each:

Image of clustered pair configuration

A Pair+N topology has one set of nodes directly attached to shared storage and N number of additional nodes which are not connected to any storage. In this case, any access of the shared storage on nodes which are not directly attached happens through one of the directly attached nodes via the private cluster network. The following figure shows a Pair+N configuration, where N is two. When nodes 3 or 4 access the storage, they must do so through nodes 1 or 2.

Image of Pair+N configuration

In an N+1, or star topology, primary and secondary nodes do not need to be configured identically. Some number of primary nodes are all active, and one secondary node, which acts as a failover for the N other primaries, can be either active or passive. The secondary node is the only node which is connected to all shared storage devices. The following figure shows three primary nodes (N), each running their own applications with one secondary node as the failover:

Image of an N+1 configuration

In an N*N topology, all nodes are connected to all shared storage devices in the cluster. The following figure shows four nodes, each connected to two shared storage devices:

Image of an N*N configuration

Global Devices in the Cluster

When describing the possible cluster topologies, we briefly mentioned shared external storage, but we didn't discuss how storage in general is seen within the cluster. The Sun Cluster global devices feature provides simultaneous access to all storage devices from all nodes, whether they be local devices or shared external devices. Node W can see tape drive on Node X, the CD-ROM drive on Node Y, and the SVM root file system from Node Z. Disks are the only supported multi-pathed global devices, so CD-ROM and tape drives are not considered highly available.

When the cluster forms, it automatically assigns unique (to the cluster) IDs to each device within it under the /dev/did namespace using the device ID (DID) pseudo driver. The DID driver probes all nodes of the cluster and builds a table of device names, assigning each a major and minor number. Each node accesses a device using this unique global device number instead of using traditional Solaris IDs so that there is no confusion between a local disk named c0t0d0 and c0t0d0 on another cluster node. Using this unique ID is especially important for multi-host disks which may appear as c3t0d0 to one host and c2t1d0 to another. The DID driver assigns a unique name, such as d10, that the nodes would use instead, giving each node a consistent mapping to the multi-host disk. Because of this, care must be taken when using SVM so that the /global file system on each node does not have the same SVM disk device name.

Understanding Quorum

To form a cluster and offer services, the nodes in a cluster must first reach quorum. The quorum equation states that a cluster must have the total number of configured votes, divided by two (remainders are discarded), plus one (Q = TCV/2 + 1). If a cluster cannot reach quorum, then it does not form. The individual cluster nodes do not boot fully, but wait until enough votes are available to reach quorum. If a running cluster loses quorum, the affected nodes panic and try to reboot (assuming auto-boot? is set to true on those nodes). Machines can be booted outside the cluster by issuing a boot -x from the OBP, but no cluster services will be available on these machines.

The key to understanding quorum is learning how votes are assigned and counted. Each node in a configured cluster has one (1) quorum vote. Each shared storage device configured as a quorum device has votes totaling the number of connected devices minus one (QD = TCD - 1). Ownership of a quorum device is assigned to one controlling node based on SCSI reservations.

By doing some simple math, it's easy to see that a two-node cluster must have a quorum device to continue operating if one node fails. Once installed, a two-node cluster under Sun Cluster must have a quorum device for this very reason.

Quorum required to operate:

Q = TCV/2 + 1 = (2)/2 + 1 = 2

Votes if one node fails: 1

When you introduce a quorum device, the equation changes. This Sun Cluster configuration, shown in the following figure, is one of the most common.

Image of two-node + one quorum device configuration

Quorum required to operate:

Q = (2 + 1)/2 + 1 = 2

Votes if one node fails: (1 + 1) = 2

Below are some quorum examples in more complex cluster configurations.

Image of three node + one quorum device configuration

Note in the last example that the quorum device is connected to three devices (N) and therefore has two (N-1) votes. The same quorum formula still applies, though.

Quorum required to operate:

Q = (3 + 2)/2 + 1 = 3

Votes if one node fails: (2 + 2) = 4

Votes if two nodes fail: (1 + 2) = 3

Votes if just the QD fails: ( 1 + 1 + 1) = 3

Votes if any node plus the QD fails: (1 + 1) = 2

As a note of warning, when allocating quorum devices, always use the minimum number possible to achieve quorum, or the health of the cluster will depend on the health of the shared disks configured as quorum devices. In the case where only one of the configured quorum devices is necessary for cluster operation, the cluster will fail unnecessarily if one of the unneeded quorum devices fails. Also, never have the number of quorum device votes exceed the number of device votes, or you run the risk of enabling two separate clusters to form independently (which is known as "split brain"). In this case, both clusters will compete for traffic on the public network, and data between the two will be out of sync.

Understanding SCSI Reservations

Another mechanism that protects data integrity within the cluster in conjunction with the quorum principle is SCSI reservations. When the cluster forms, one node takes responsibility for any quorum devices by using SCSI reservations. With SCSI 3-capable storage where there are more than two paths to the storage, this reservation is accomplished by each available cluster node registering a key by writing it to the disk. The controlling node is tagged as the owner, and the other nodes are tagged as capable of becoming the owner. If a node fails, the remaining nodes remove the failed node's key from the disk, and it is no longer eligible to own the quorum device. Once the failed node recovers and rejoins the cluster, its key is re-registered.

In the event that the controlling node leaves the cluster, the remaining eligible nodes compete to gain control of the quorum devices. If the cluster is not cleanly shut down and nodes go down individually, the last node down must be the first one up, because it is the only node eligible to control the quorum devices. If the controlling node was the only eligible node in the cluster when the cluster rebooted and the controlling node cannot come back up, the administrator must boot one machine outside the cluster and make manual changes to the cluster configuration database so that it may achieve quorum without control of the quorum device. Once the modified machine is able to form the cluster, machines in the cluster will re-register their keys with the quorum device.

The reservation keys can be read from the quorum device by using the pgre command for SCSI 2-capable storage and the scsi command for SCSI 3-capable storage where the nodes have more than two paths to the storage. If the global device name of the quorum device is /dev/did/rdsk/d4, for example, then you could retrieve the keys from the disk in the following way:

SCSI 2: pgre -c pgre_inkeys -d /dev/did/rdsk/d4s2
SCSI 3: scsi -c inkeys -d /dev/did/rdsk/d4s2

The quorum device owner can be determined using the following commands:

SCSI 2: pgre -c pgre_resv -d /dev/did/rdsk/d4s2
SCSI 3: scsi -c inresv -d /dev/did/rdsk/d4s2

Pre-installation Planning and Tasks

With an understanding of how Sun Cluster works behind the scenes, you should have some idea about the kind of cluster configuration you'd like to implement. Before installing and configuring the cluster, though, there are some things to consider first.

Configuring Solaris Volume Manager Software

To control shared disks, Sun Cluster supports both VERITAS Volume Manager (VxVM) or Sun's volume manager, called Solaris DiskSuite under the Solaris 8 OS and Solaris Volume Manager under the Solaris 9 OS. SDS/SVM is cluster aware, and Solaris 9 09/04 SVM supports parallel writes for Oracle RAC, so you can go with these products unless you need other features that VxVM offers. This article discusses using the Solaris 9 OS and SVM since several other documents already cover configurations with VxVM.

General practice for most production Solaris 9 OS installations is to mirror the boot disk using SVM. Instructions on how to use SVM can be found in the Solaris Volume Manager Administration Guide. When planning out an SVM installation for a cluster, there are a few more things to consider beyond keeping a slice for the metadbs. For detailed information, read the Planning Volume Management section of the Sun Cluster Software Installation Guide for Solaris OS.

The first consideration beyond normal SVM configuration is the /globaldevices file system, later renamed to /global/.devices/node@nodeid (where nodeid represents the number that is assigned to a node when it becomes a cluster member). This 512 Mbyte file system must be its own partition under SVM and must have a unique metadevice name throughout the cluster. If the partition underlying /globaldevices is named d5 on machine node1, it can't be named d5 on any other node within the cluster. If you do have namespace clashing, the global namespace directories will fail to cross mount when the cluster tries to form. As well as having a unique metadevice name in the cluster, each device must not clash with any name created by the cluster DID driver.

Secondly, all cluster nodes require identical /kernel/drv/md.conf files to prevent errors and loss of data. This file must be modified from the standard SVM installation to support additional disk sets and metadevices that can be created on the host. Generally, set the variable md_nsets to one more than the expected number of disk sets needed for the entire cluster (up to 32). Set the nmd variable to the highest expected metadevice or volume name (up to 8192) in the cluster to ensure uniqueness. A change to either of these values on a running system requires a reconfiguration reboot, so it's best to set them with some future growth in mind. Since each of these possible devices has an associated memory cost, though, do not set these variables too much higher than necessary.

Each shared storage volume must exist as part of a metaset. Starting with Solaris 9 09/04, multi-owner disk sets are also supported, meaning that multiple nodes can own and write to shared disks. This is the simultaneous write functionality required by Oracle OPS/RAC. For more information, read the Solaris Volume Manager for Sun Cluster section of the Solaris Volume Manager Administration Guide.

Installing Patches

Before installing and configuring the cluster environment, you may need to install patches or upgrade firmware on your hardware. Determine your requirements by using the PatchPro tool from http://patchpro.sun.com/. Once at this site, click the Sun Cluster link and describe your cluster environment. There are four buttons below the description area which will generate patch lists for Solaris (pre-installation), Sun Cluster, Post-Install, and Data Services (if you selected additional data services in your cluster description). Any patches listed in the Solaris pre-install group must be installed to the listed minimum revision before installing the Sun Cluster software.

Once you have a running cluster, Sun recommends installing patches on the cluster in a very specific way. For a complete understanding of how patches and firmware upgrades should be applied to cluster systems, read Chapter 8: Patching Sun Cluster Software and Firmware in the Sun Cluster System Administration Guide for Solaris OS.

Configuring Network Interfaces

Each cluster node must have at least one public network adapter and two private cluster network adapters for redundant heartbeat transmission. It's also suggested that each node have more than one public network adapter for maximum redundancy in the event of a network adapter hardware failure. Because each network interface must have its own Ethernet address, make sure to set local-mac-address? to true in the OBP.

The public network adapters are configured under IP Multipathing (IPMP) for automatic failover and load spreading. In the case where an IPMP group contains only one adapter (for example, only one interface on the public network), the adapter requires only one IP. In the case where the IPMP group has multiple network adapters (for example, the private network or multi-homed public networks), each adapter requires one primary IP plus a test IP.

These test IP addresses cannot be used by normal applications because they are not highly available, so they are marked with the ifconfig flag -failover. The test interfaces are also marked as deprecated because they should not be used as a source address for outbound packets.

For example, say that you have a node named node1 which had a primary public interface of hme0 and a secondary public interface of qfe1. In order to set up IPMP to perform load spreading and automatic failover, you'd assign a name to a test IP for hme0 (node1-hme0-test) and a test IP for qfe1 (node1-qfe1-test). Modify /etc/hostname.hme0 to include an IPMP group (I chose nafo0 for historical reasons, but the IPMP group can be called anything in Sun Cluster 3.1 09/04) and the hme0 test interface:

node1 group nafo0 up
addif node1-hme0-test group nafo0 netmask + broadcast + -failover deprecated up

Now create the test interface for qfe1 as well by creating an /etc/hostname.qfe1 file:

node1-qfe1-test group nafo0 netmask + broadcast + -failover deprecated up

For additional information on configuring IPMP, read the IP Network Multipathing Administration Guide. Also review the "IP Network Multipathing Groups" section of Sun Cluster Configurable Components under the Sun Cluster 3.1 09/04 Software Collection for Solaris OS.

If a cluster has only two nodes, the heartbeat interfaces can be cross connected between the two machines. If the cluster has more than two nodes, two switches (or, less optimally, two VLANs on one switch) must be used to connect the nodes. When installing the cluster framework, it will ask about cluster transport junctions if there are more than two nodes in the cluster or if you choose the Custom configuration. Even if there are only two nodes, it's wise to configure cluster transport junction names in case you move to using switches later.


Installing and Configuring Sun Cluster Software

Now that we've covered the basic important concepts behind Sun Cluster and you have a machine that's patched and connected to both the public and private cluster networks, you're ready to install the cluster software itself. First, download the cluster software and agent software from the Sun Cluster page or from the Cluster software CD-ROM and unzip the archive files.

Creating an Administrative Console

If you're going to use a machine outside the cluster as an administrative console, add the SUNWccon and SUNWscman packages to the administrative machine. These include the programs ctelnet, cconsole, crlogin, and others. They allow the administrator to type commands into one window and have them echo on multiple cluster node windows. In fact, the c- tools can be used on any machine, not just cluster nodes. One BigAdmin tech tip explains how to use the c- tools over ssh for those who want a more secure connection. There's also a similar tool called clusterssh available from SourceForge.

To configure the cluster console tools, create an /etc/clusters file and add the cluster name followed by the names of each node. If using a terminal concentrator or system service processor, also create an /etc/serialports file and add the hostname of each node, hostname of the console access device, and serial port number to which the host is connected.

Installing Sun Cluster Framework Software

On each node in the cluster, install the Sun Web Console packages. This is accomplished by changing directory to Solaris_sparc/Product/sun_web_console/2.1 and running the setup shell script.

Now run Solaris_sparc/Product/sun_cluster/Solaris_9/Tools/scinstall on one node to install the cluster framework software itself. Choose the option Install a cluster or cluster node from the Main Menu. From the Install Menu, choose the option Install just this machine as the first node of a new cluster. The scinstall program will ask you a number of questions about your cluster. On the scinstall instructions page (How to Configure Sun Cluster Software on All Nodes), Tables 2-2 and 2-3 list the questions and default answers for the Typical and Custom install methods.

Once the first node finishes, run this on any additional nodes to have them join the cluster:

Solaris_sparc/Product/sun_cluster/Solaris_9/Tools/scinstall

From the Install Menu, choose the menu item Add this machine as a node in an existing cluster.

You can also opt to install all nodes at once if you've enabled root rsh or ssh access from the first node where you are running scinstall. To do this, choose the option Install all nodes of a new cluster from the install menu on the first node on which you run scinstall. Yet another option is installing all of the cluster software with Solaris JumpStart via a flash archive along with the OS -- see How to Install Solaris and Sun Cluster Software (JumpStart).

Configuring the Cluster

To enable you to run the newly installed binaries and read the associated man pages without using full paths, modify your PATH to include /usr/cluster/bin/, and modify your MANPATH to include /usr/cluster/man/.

When the first cluster node is installed, the cluster enables installmode and is not completely configured. With installmode enabled, the first cluster node sponsors additional nodes until quorum can be reached. Recall that in a two-node cluster, a quorum device must be configured before the cluster will operate. Therefore, in a two-node cluster, the cluster does disable installmode until a quorum device is successfully configured.

The scsetup program resets the installmode variable to disabled and performs post-installation configuration tasks such as configuring quorum devices. The scsetup program can be run from any node in the cluster, but it's important that all nodes have joined the cluster before resetting installmode.

The cluster also needs to keep time closely synced. The cluster installation adds an ntp configuration file called /etc/inet/ntp.conf.cluster which lists a number of private cluster names as peers. Remove the peer names that are not present in your cluster so that ntp does not try to sync against non-valid hosts. If you're already running ntp in your environment and want to keep your cluster synced with the rest of the machines on your network, it is best to set up one cluster node to sync from another machine on the network and then have the rest of the cluster nodes peer amongst themselves.

Data Services

You're now ready to install the Data Services packages, those programs which monitor various applications and handle stopping, starting, and migrating them on the cluster. Make sure that you have downloaded the sc-agents-3_1_904-sparc.zip file for Sun Cluster, and unpack it. Run the scinstall program and choose Add support for new data services to this cluster node. Specify the directory you just unpacked to load all of the agents therein. Once you install the Data Services, you are ready to perform application-specific configuration.

How you configure application services will depend on the applications you choose, but they all involve configuring some sort of resource group. Resource groups usually contain a logical host resource (virtual IP address for the resource), a data storage resource, and one or more application resources. For example, a failover web server resource would contain the virtual IP assigned to the web site, the global file system used by the web server, and an application that starts, stops, and monitors the web server. For information on configuring each type of pre-defined Data Service, please read the individual Sun Cluster Data Service guides. Once your resource groups are set up and online, your cluster is ready for use.


Important Man Pages

The Sun Cluster manual pages are available as part of the Sun Cluster Reference Manual for Solaris OS. Of particular interest are the following administrative man pages:

  • cconsole(1m), ctelnet(1m), crlogin(1m) - multi-window, multi-machine, remote console, login and telnet commands
  • sccheck(1m) - check for and report on vulnerable Sun Cluster configurations
  • scconf(1m) - update the Sun Cluster software configuration
  • scdidadm(1m) - global device identifier configuration and administration utility wrapper
  • scinstall(1m) - install Sun Cluster software and initialize new cluster nodes
  • scrgadm(1m) - manage registration and unregistration of resource types, resource groups, and resources
  • scsetup(1M) - interactive cluster configuration tool
  • scshutdown(1m) - shut down a cluster
  • scstat(1M) - monitor the status of Sun Cluster
  • scswitch(1m) - perform ownership and state change of resource groups and disk device groups in Sun Cluster configurations

Resources

Unless otherwise licensed, code in all technical manuals herein (including articles, FAQs, samples) is provided under this License.


Rate and Review
Tell us what you think of the content of this page.
Excellent   Good   Fair   Poor  
Comments:
Your email address (no reply is possible without an address):
Sun Privacy Policy

Note: We are not able to respond to all submitted comments.
BigAdmin
  
 
 
 
Would you recommend this Sun site to a friend or colleague?
Contact About Sun News & Events Employment Site Map Privacy Terms of Use Trademarks Copyright Sun Microsystems, Inc.