BigAdmin System Administration Portal
Sun Docs
Print-friendly VersionPrint-friendly Version

Sun Java System Messaging Server Message Store Disaster Recovery Techniques

Joe Sciallo, October 2007

This article describes general message store backup and restore policies for Sun Java System Messaging Server as well as how to recover from specific message store problems with the following:

  • An individual mailbox

  • Multiple mailboxes on a single partition (or an entire partition)

  • Multiple mailboxes across multiple partitions (or an entire message store host)

  • Multiple mailboxes across multiple message store hosts

This article was developed jointly between system administrators at the University of Wisconsin (Madison) and Messaging Server message store developers at Sun.

This technical article contains the following sections:


Overview of Messaging Server Architecture

A Messaging Server deployment consists of several classes of functionality, including an LDAP directory, Message Transfer Agents (MTAs), Messaging Multiplexors (MMPs), and back-end message stores. You can design the LDAP directory, MTAs, and MMPs for redundancy such that if a single host or even multiple hosts are unavailable, the service continues to function without a noticeable performance outage for end users. The back-end message store layer is different in that if a message store is unavailable, users on that store experience an outage. Accordingly, to ensure the restoration of service in the case of a failure, you need to plan for and document procedures for handling message store issues.

For more information on Messaging Server logical architecture, see Understanding the Two-tiered Messaging Architecture in Sun Java Communications Suite 5 Deployment Planning Guide.


Overview of Message Store Architecture

The message store contains the user mailboxes for a particular Messaging Server instance. The size of the message store increases as the number of mailboxes, folders, and log files increase. You can control the size of the store by specifying limits on the size of mailboxes (disk quotas), by specifying limits on the total number of messages allowed, and by setting aging policies for messages in the store.

The message store itself consists of a number of mailbox databases and the user mailboxes. The mailbox databases consist of information about users, mailboxes, partitions, quotas, and other message store related data. The user mailboxes contain the user's messages and folders. Mailboxes are stored in a message store partition, an area on a disk partition specifically devoted to storing the message store. Message store partitions are not the same as disk partitions, though for ease of maintenance, use one disk partition for each message store partition.

Mailboxes such as INBOX are located in the store_root directory. For example, a sample directory path might be:

store_root/partition/primary/=user/53/53/=mack1

For more information on the message store architecture, see Message Store Directory Layout in Sun Java System Messaging Server 6.3 Administration Guide.


Implementing a Message Store Backup and Restore Policy

Message store backup and restore is one of the most important administrative tasks for your message store deployment. You must implement a backup and restore policy for your message store to ensure that data is not lost if problems such as the following occur:

  • System crashes

  • Hardware failure

  • Accidental deletion of messages or mailboxes

  • Problems when reinstalling or upgrading a system

  • Natural disasters (for example, earthquakes, fire, hurricanes)

  • Migrating users

You have the following options for backing up and restoring the message store. You need to understand the pros and cons of these solutions to make the proper choice for your deployment.

  • The command-line utilities imsbackup and imsrestore. These utilities have a high cost in terms of CPU usage and IOPs, but do enable you to parallelize backup processes, as well as segregate partitions for multiple channels of backup. The default is one backup channel; you can change this up to ten backup channels. The most popular reason for choosing these utilities is that you can restore services while the system is live. For example, in the case of a true catastrophic failure, you can bring up a blank message store, allow new mail to start arriving, which your users can read, and then over the course of a few hours or days, restore the old mail.

  • A “snap-shotting” or other block-based backup solution. These solutions have a much lower impact to the server than does the imsbackup utility. However, when restoring services, you must do a complete restore before you can make the services live.

For more information on developing a message store backup and restore policy for your site, see Creating a Mailbox Backup Policy in Sun Java System Messaging Server 6.3 Administration Guide.


Troubleshooting the Message Store

While message store data is fairly robust, on rare occasions there might be message store data problems. Messaging Server writes error messages to the default log file indicating message store problems. Beginning with Messaging Server 6.3, the software almost always fixes message store data problems transparently. In rare cases, the system writes an error message in the log file when you need to run the reconstruct utility.

Use the following techniques as part of a general approach to troubleshooting the message store:

  • Whenever the stored process is restarted, check the default log for any error messages or instructions to run store repair commands (such as reconstruct -m). Log files are located in the /opt/SUNWmsgsr/data/log directory, unless you have configured them differently. Error messages are likely to be displayed when you stop and start the message store. Follow instructions that appear in the log. For more information, see Repairing Mailboxes and the Mailboxes Database in Sun Java System Messaging Server 6.3 Administration Guide.

  • Try to isolate the problem to a single mailbox, a specific partition, an entire host, or a dependency. Then, if necessary, you can use the procedures in the next section, Recovering From Message Store Problems.


Recovering From Message Store Problems

This section contains the following topics:

What You Should Know Before Attempting to Recover Mailboxes

The mailbox recovery procedures in this section assume that you are familiar with:

  • Your own message store deployment and configuration, including Messaging Server hosts, layout of partitions, directory paths to the Messaging Server software, and so on

  • Locating a user account on disk by using the hashdir command

  • Messaging Server commands, such as reconstruct, mboxutil, start-msg, stop-msg, and so on

  • Checking Messaging Server and operating system logs

  • Message Store theory of operations, as described in Message Store Startup and Recovery in Sun Java System Messaging Server 6.3 Administration Guide

Note - Beginning with Messaging Server 6.0, when users log in and their mailbox is not in the folder.db file, the system automatically enters the mailbox in the folder.db file, along with other recovered data. You can run the reconstruct -r command on the folder at this point if you think the store.idx file has been corrupted. Otherwise, the folder will work as expected. Beginning with Messaging Server 6.0, error correction has been added to correct the store.idx file simply upon user login. Nevertheless, you will still want to run the reconstruct command to repair serious problems.

To Recover an Individual Mailbox

This task describes how to use the reconstruct command to recover individual mailboxes, and what the differences are between Messaging Server 6 and iPlanet Messaging Server 5.

Beginning with Messaging Server 6.0, you can immediately start the message store, and users may log in while the reconstruct command runs in the background. The stored process attempts to do what is necessary by checking the database for log accumulation and so forth. If there is a problem, the stored process either replaces the database with a snapshot, or removes the database so the administrator does not need to deal with it.

If you have an iPlanet Messaging Server 5 deployment, before recovering an individual mailbox or small number of mailboxes, consider whether to allow users on the system before you resolve the problem mailboxes. In most cases, this means allowing a significant portion of users to work while a small number of mailboxes are “broken.” At first that might seem desirable, but give thought as to whether or not the problems on the individual accounts might cause a large scale outage to recur and if this risk outweighs the benefits of letting the majority of users have access to the system before you complete repairs.

Before You Begin

The message store must be running. Users can be logged in and using the message store.

  1. Identify and reconstruct the accounts with problems:
    reconstruct -r

    where

    -r

    Repairs the spool areas of all mailboxes within the user partition directory.

    The following command runs across the entire store and looks for inconsistencies but does not make changes. To specify a single partition, or pattern, refer to the following examples:

    reconstruct -p partition -r
    
    reconstruct -n -r user/a\*/INBOX

    In the latter example, run such a reconstruct command for each letter. This checks only inboxes.

  2. Use the mboxutil command to determine if the mailbox is missing from mail system databases.

    If yes, go to Step 3. If no, run the following command:

    reconstruct -m -p partition -u userid

    The message store should now be able to function correctly, including for this user.

  3. If the mailbox is missing from mail system databases, perform the following:
    1. Create the mailbox.
      mboxutil -c user/userid/INBOX
    2. Create the other non-system folders.
      mboxutil -c user/userid/foldername

      This creates an empty store.idx and database entry.

    3. Run the following command to populate the store.idx files.
      reconstruct -rf user/userid

      The-f option will cause reconstruct to ignore index count errors.

    The message store should now be able to function correctly, including for this user.

To Recover Multiple Mailboxes on a Single (or Entire) Partition

Before You Begin

Notify users that the message store needs to be stopped and restarted.

  1. Shut down the message store and all other mail tasks to release any DB locks.
    stop-msg
  2. Restart the message store.
    start-msg store

    The message store attempts to initialize and begins processing the transaction logs. When the message store makes progress, you can start up the rest of the Messaging Server daemons (by using the commands start-msg or start-msg imap, start-msg http, and so forth), but do so gradually.

  3. If the message store cannot initialize properly, it automatically rolls back to the best store backup candidate (DBs and Transaction Logs), and informs you to run the reconstruct -m command.

    This step is necessary to account for the transactions dumped by the rollback. If the rollback version has multiple transaction logs, allow a couple of minutes before gradually starting Messaging Server processes. Then run the reconstruct command.

To Recover Multiple Mailboxes on Multiple Partitions (or on an Entire Message Store Host)

Recovering multiple partitions is similar to recovering a single partition, except that you must choose whether to restore all the partitions at once or run a separate command for each partition.

  • With multiple partitions, choose between the following two commands.
    reconstruct -m
    
    or
    
    reconstruct -p first_partition -m
    
    reconstruct -p second_partition -m
    
    ...
    
    reconstruct -p last_partition -m

    The advantage to the first option is that you only have to monitor one process. The second option will complete in less time than running only one process because the processes are able to run in parallel, but you have multiple processes to run and monitor.

To Recover Multiple Mailboxes on Multiple Message Store Hosts

If multiple hosts are having problems, the problem is most likely in the underlying infrastructure. Possibilities include (but are not limited to) the following:

  • Enterprise storage

  • LDAP directory

  • Power or environment

  • Network

  • Operating system and patches

  1. Check the mail system's log, by default located in the /opt/SUNWmsgsr/data/log directory.

    An error message might indicate where the underlying problem resides.

  2. Otherwise, check the system logs. (On Solaris OS, start with /var/adm/messages).

    Resolve any problems with the underlying infrastructure before attempting to repair the mail system. Once you have resolved those problems, use To Recover Multiple Mailboxes on Multiple Partitions (or on an Entire Message Store Host) to recover individual hosts.


Troubleshooting a Hung Database

Transaction log buildup is one symptom of a hung database caused either by excessive deadlocking, or by an orphaned lock created by a process exiting unexpectedly, including a killed process.

Starting with Messaging Server 6.2, log file accumulation is monitored and warning messages are printed. The watcher program registers and reports crashed processes that can leave orphaned locks. In addition, an automatic restart option is available.

Note - Future planned product features enable the message store to start immediately, and find user-owned folders before the reconstruct -m is run.

New in Messaging Server 6.3 is a program that verifies the snapshots. This addresses the case where database corruption is spread from the active database to the snapshots. If you are running Messaging Server 6.3, database snapshots are configured automatically through the imdbverify utility. In this way, you automate database recovery. You can configure imdbverify to run more frequently if desired.

Caution - It is very unlikely that the database will become corrupted. Do not try to manually operate on the database except as a last resort. If you do end up manually attempting to fix the database, be sure to report this when working on the problem with the Sun Support Center.

To Identify a Hung Database

  1. Do you have any of the following symptoms?
    • Only one message store is having a problem.

    • Messages have built up in the ims-ms channel. (Use the imsimta qm summarize command to view what is happening on the system.)

    • There are users that hang on one message store. Review the logs that show progress information.

    • Message Store tasks, such as peruser checkpoint and nightly expire, have not completed.

  2. Check the message store DB and lock information.
    • Messaging Server 6.3: Run the imcheck -s command to check the message store DB and lock information.

    • Prior to Messaging Server 6.3: Run the db_stat command. First check the location of the /var/path/store/tmp directory with the configutil -o store.dbtmpdir command. If this value is not set, the default value is the store database directory, msg_path/data/store/mboxutil/. Once you have the location of the store/tmp directory, run the db_stat command, first switching to the Messaging Server user (for example, mailsvr):

      su msg_user
      
      msg_path/lib/db_stat -t -h /var/path/store/tmp

    At this point, you will see multiple transaction log files of the form log.000000xxxx.

  3. Check for persistent DB locks that may indicate an orphaned lock. Run multiple times and see if any locks don't clear.

For More Information

See the following topics for more information on message store backup, recovery, and troubleshooting:

Here are additional resources:


Comments (latest comments first)

Discuss and comment on this resource in the BigAdmin Wiki

Unless otherwise licensed, code in all technical manuals herein (including articles, FAQs, samples) is provided under this License.


BigAdmin