Sun Java System Messaging Server Message Store Disaster Recovery Techniques
Joe Sciallo, October 2007
This article describes general message store backup and restore policies for Sun Java
System Messaging Server as well as how to recover from specific message store
problems with the following:
An individual mailbox
Multiple mailboxes on a single partition (or an entire partition)
Multiple mailboxes across multiple partitions (or an entire message store host)
Multiple mailboxes across multiple message store hosts
This article was developed jointly between system administrators at the University of Wisconsin
(Madison) and Messaging Server message store developers at Sun.
This technical article contains the following sections:
A Messaging Server deployment consists of several classes of functionality, including an LDAP
directory, Message Transfer Agents (MTAs), Messaging Multiplexors (MMPs), and back-end message stores. You
can design the LDAP directory, MTAs, and MMPs for redundancy such that if
a single host or even multiple hosts are unavailable, the service continues to
function without a noticeable performance outage for end users. The back-end message store
layer is different in that if a message store is unavailable, users on
that store experience an outage. Accordingly, to ensure the restoration of service in
the case of a failure, you need to plan for and document procedures
for handling message store issues.
The message store contains the user mailboxes for a particular Messaging Server instance.
The size of the message store increases as the number of mailboxes, folders,
and log files increase. You can control the size of the store by
specifying limits on the size of mailboxes (disk quotas), by specifying limits on
the total number of messages allowed, and by setting aging policies for messages
in the store.
The message store itself consists of a number of mailbox databases and the
user mailboxes. The mailbox databases consist of information about users, mailboxes, partitions, quotas,
and other message store related data. The user mailboxes contain the user's messages
and folders. Mailboxes are stored in a message store partition, an area on
a disk partition specifically devoted to storing the message store. Message store partitions
are not the same as disk partitions, though for ease of maintenance, use
one disk partition for each message store partition.
Mailboxes such as INBOX are located in the store_root directory. For example,
a sample directory path might be:
Implementing a Message Store Backup and Restore Policy
Message store backup and restore is one of the most important administrative tasks
for your message store deployment. You must implement a backup and restore policy
for your message store to ensure that data is not lost if
problems such as the following occur:
System crashes
Hardware failure
Accidental deletion of messages or mailboxes
Problems when reinstalling or upgrading a system
Natural disasters (for example, earthquakes, fire, hurricanes)
Migrating users
You have the following options for backing up and restoring the message store.
You need to understand the pros and cons of these solutions to
make the proper choice for your deployment.
The command-line utilities imsbackup and imsrestore. These utilities have a high cost in terms of CPU usage and IOPs, but do enable you to parallelize backup processes, as well as segregate partitions for multiple channels of backup. The default is one backup channel; you can change this up to ten backup channels. The most popular reason for choosing these utilities is that you can restore services while the system is live. For example, in the case of a true catastrophic failure, you can bring up a blank message store, allow new mail to start arriving, which your users can read, and then over the course of a few hours or days, restore the old mail.
A “snap-shotting” or other block-based backup solution. These solutions have a much lower impact to the server than does the imsbackup utility. However, when restoring services, you must do a complete restore before you can make the services live.
While message store data is fairly robust, on rare occasions there might be
message store data problems. Messaging Server writes error messages to the default log
file indicating message store problems. Beginning with Messaging Server 6.3, the software almost
always fixes message store data problems transparently. In rare cases, the system writes
an error message in the log file when you need to run the
reconstruct utility.
Use the following techniques as part of a general approach to troubleshooting the
message store:
Whenever the stored process is restarted, check the default log for any error messages or instructions to run store repair commands (such as reconstruct-m). Log files are located in the /opt/SUNWmsgsr/data/log directory, unless you have configured them differently. Error messages are likely to be displayed when you stop and start the message store. Follow instructions that appear in the log. For more information, see Repairing Mailboxes and the Mailboxes Database in Sun Java System Messaging Server 6.3 Administration Guide.
Try to isolate the problem to a single mailbox, a specific partition, an entire host, or a dependency. Then, if necessary, you can use the procedures in the next section, Recovering From Message Store Problems.
What You Should Know Before Attempting to Recover Mailboxes
The mailbox recovery procedures in this section assume that you are familiar with:
Your own message store deployment and configuration, including Messaging Server hosts, layout of partitions, directory paths to the Messaging Server software, and so on
Locating a user account on disk by using the hashdir command
Messaging Server commands, such as reconstruct, mboxutil, start-msg, stop-msg, and so on
Checking Messaging Server and operating system logs
Note - Beginning with Messaging Server 6.0, when users log in and their mailbox is
not in the folder.db file, the system automatically enters the mailbox in the
folder.db file, along with other recovered data. You can run the reconstruct-r
command on the folder at this point if you think the store.idx file has
been corrupted. Otherwise, the folder will work as expected. Beginning with Messaging Server
6.0, error correction has been added to correct the store.idx file simply
upon user login. Nevertheless, you will still want to run the reconstruct command
to repair serious problems.
To Recover an Individual Mailbox
This task describes how to use the reconstruct command to recover individual mailboxes,
and what the differences are between Messaging Server 6 and iPlanet Messaging Server
5.
Beginning with Messaging Server 6.0, you can immediately start the message store, and
users may log in while the reconstruct command runs in the background. The
stored process attempts to do what is necessary by checking the database for
log accumulation and so forth. If there is a problem, the stored process
either replaces the database with a snapshot, or removes the database so the
administrator does not need to deal with it.
If you have an iPlanet Messaging Server 5 deployment, before recovering an individual
mailbox or small number of mailboxes, consider whether to allow users on the
system before you resolve the problem mailboxes. In most cases, this means allowing
a significant portion of users to work while a small number of mailboxes
are “broken.” At first that might seem desirable, but give thought as to
whether or not the problems on the individual accounts might cause a large
scale outage to recur and if this risk outweighs the benefits of letting
the majority of users have access to the system before you complete repairs.
Before You Begin
The message store must be running. Users can be logged in and
using the message store.
Identify and reconstruct the accounts with problems:
reconstruct -r
where
-r
Repairs the spool areas of all mailboxes within the user partition directory.
The following command runs across the entire store and looks for inconsistencies but
does not make changes. To specify a single partition, or pattern, refer to
the following examples:
In the latter example, run such a reconstruct command for each letter. This
checks only inboxes.
Use the mboxutil command to determine if the mailbox is missing from mail
system databases.
If yes, go to Step 3. If no, run the following command:
reconstruct -m -p partition -u userid
The message store should now be able to function correctly, including for this
user.
If the mailbox is missing from mail system databases, perform the following:
Create the mailbox.
mboxutil -c user/userid/INBOX
Create the other non-system folders.
mboxutil -c user/userid/foldername
This creates an empty store.idx and database entry.
Run the following command to populate the store.idx files.
reconstruct -rf user/userid
The-f option will cause reconstruct to ignore index count errors.
The message store should now be able to function correctly, including for this
user.
To Recover Multiple Mailboxes on a Single (or Entire) Partition
Before You Begin
Notify users that the message store needs to be stopped and restarted.
Shut down the message store and all other mail tasks to release any
DB locks.
stop-msg
Restart the message store.
start-msg store
The message store attempts to initialize and begins processing the transaction logs. When
the message store makes progress, you can start up the rest of the
Messaging Server daemons (by using the commands start-msg or start-msg imap, start-msg http, and
so forth), but do so gradually.
If the message store cannot initialize properly, it automatically rolls back to the
best store backup candidate (DBs and Transaction Logs), and informs you to run
the reconstruct-m command.
This step is necessary to account for the transactions dumped by the rollback.
If the rollback version has multiple transaction logs, allow a couple of minutes
before gradually starting Messaging Server processes. Then run the reconstruct command.
To Recover Multiple Mailboxes on Multiple Partitions (or on an Entire Message Store Host)
Recovering multiple partitions is similar to recovering a single partition, except that you
must choose whether to restore all the partitions at once or run a
separate command for each partition.
With multiple partitions, choose between the following two commands.
The advantage to the first option is that you only have to
monitor one process. The second option will complete in less time than running
only one process because the processes are able to run in parallel, but
you have multiple processes to run and monitor.
To Recover Multiple Mailboxes on Multiple Message Store Hosts
If multiple hosts are having problems, the problem is most likely in the
underlying infrastructure. Possibilities include (but are not limited to) the following:
Enterprise storage
LDAP directory
Power or environment
Network
Operating system and patches
Check the mail system's log, by default located in the /opt/SUNWmsgsr/data/log directory.
An error message might indicate where the underlying problem resides.
Otherwise, check the system logs. (On Solaris OS, start with /var/adm/messages).
Transaction log buildup is one symptom of a hung database caused either by
excessive deadlocking, or by an orphaned lock created by a process exiting unexpectedly,
including a killed process.
Starting with Messaging Server 6.2, log file accumulation is monitored and warning messages
are printed. The watcher program registers and reports crashed processes that can leave
orphaned locks. In addition, an automatic restart option is available.
Note - Future planned product features enable the message store to start immediately, and find
user-owned folders before the reconstruct-m is run.
New in Messaging Server 6.3 is a program that verifies the snapshots. This
addresses the case where database corruption is spread from the active database to
the snapshots. If you are running Messaging Server 6.3, database snapshots are configured
automatically through the imdbverify utility. In this way, you automate database recovery. You
can configure imdbverify to run more frequently if desired.
Caution - It is very unlikely that the database will become corrupted. Do not try
to manually operate on the database except as a last resort. If you
do end up manually attempting to fix the database, be sure to
report this when working on the problem with the Sun Support Center.
To Identify a Hung Database
Do you have any of the following symptoms?
Only one message store is having a problem.
Messages have built up in the ims-ms channel. (Use the imsimta qm summarize command to view what is happening on the system.)
There are users that hang on one message store. Review the logs that show progress information.
Message Store tasks, such as peruser checkpoint and nightly expire, have not completed.
Check the message store DB and lock information.
Messaging Server 6.3: Run the imcheck -s command to check the message store DB and lock information.
Prior to Messaging Server 6.3: Run the db_stat command. First check the location of the /var/path/store/tmp directory with the configutil -o store.dbtmpdir command. If this value is not set, the default value is the store database directory, msg_path/data/store/mboxutil/. Once you have the location of the store/tmp directory, run the db_stat command, first switching to the Messaging Server user (for example, mailsvr):
su msg_usermsg_path/lib/db_stat -t -h /var/path/store/tmp
At this point, you will see multiple transaction log files of the
form log.000000xxxx.
Check for persistent DB locks that may indicate an orphaned lock. Run multiple
times and see if any locks don't clear.
For More Information
See the following topics for more information on message store backup, recovery, and
troubleshooting: