Troubleshooting the Message Transfer Agent for Sun Java System Messaging Server
K. Caudill, S. Hjorth, and K. Hubner, August 2007
This article describes how to troubleshoot the Message Transfer Agent (MTA) for Sun
Java System Messaging Server, specifically message buildup in channels, including the TCP channels
(tcp_local and tcp_intranet) and the ims-ms channel. This article treats troubleshooting as a
process of learning about the MTA so that you can identify signs of
problems and can then respond with possible solutions.
Products covered by this article are:
Sun Java System Messaging Server 6
iPlanet Messaging Server 5
Note - The information in this technical note describes Sun Java System Messaging Server 6
commands. Where appropriate, this article mentions iPlanet Messaging Server 5 commands.
Understanding Message Flow Through the MTA and Channels
Before you begin to troubleshoot why messages build up in a channel queue,
or understand if an actual problem might be occurring on your system, you
must first understand how mail messages flow through the MTA.
The MTA can be thought of as Messaging Server's central brain, responsible for
message routing. The MTA takes in messages through SMTP sessions from other systems
and decides what to do with those messages. The first stop for a
message is the MTA SMTP server, which executes programs to handle the
SMTP session. Based on numerous configuration possibilities, the SMTP server processes the message, which
could include message blocking, address changing, or channel enqueueing. A channel is a
message connection with another system or destination.
Actually, the sequence of events is a bit more complex. The dispatcher spawns
the SMTP server by listening on port 25 (or whichever port is
defined for the SMTP server in the dispatcher.cnf file). When the dispatcher detects an
attempt to connect to port 25, it starts an SMTP process to handle
the incoming connection. The SMTP server typically decides whether to accept or reject
the message based on numerous configuration possibilities. During the SMTP dialogue, the MTA
machinery must decide what to do with the message. However, the PORT_ACCESS mapping
table works with the dispatcher rather than the SMTP server to allow or
block access to certain ports such as the SMTP port (port 25).
The focus of this article is the decision to route the message
to a channel where the message is to be enqueued. Once enqueued to
a channel, the message's destination could be another server (either on the Internet
or on your company's intranet), a remote message store, a specific domain name, a
channel for extra processing, such as virus filtering, or a local message store.
When a channel contains messages but is not delivering them, the messages build
up in the channel's message queue. The message queues are directories, located by
default in the msg-svr-base/data/queue/channel/ directory. The MTA holds the messages for future delivery
until the situation preventing their delivery is resolved.
The specific channels discussed in this technical note are the TCP channels
tcp_local and tcp_intranet and the ims-ms channel. The tcp_local channel is responsible for
routing messages to the Internet, while the tcp_intranet channel is responsible for delivering messages
to remote message stores on your company's intranet. The tcp_intranet channel also routes
messages to any intermediary internal systems on their way to another system. The
ims-ms channel is responsible for delivering messages to the local message store.
Though this might be counterintuitive, in general, message buildup in a message queue
is not a problem. The MTA is designed to handle message buildup. Internet
mail (SMTP) is a store-and-forward mail system. Internet mailers are designed to store
the messages that cannot yet be delivered. Outgoing SMTP messages might not normally
be able to be delivered immediately. Network problems, problems on remote hosts, problems
with remote users' mailboxes, and so forth, are common cases that cause the
MTA to hold and not deliver messages. MTAs, including the MTA belonging to
Messaging Server, therefore, can store and retry outgoing messages. An often-encountered case has
to do with message store users being over quota and hence their messages
are waiting for retry. From the MTA design point of view, this case
is the same case as being unable to deliver a message because of
a network problem. The MTA handles the overquota situation with the same underlying
mechanisms: the messages are stored in queues where they will be retried for
delivery. Thus, seeing a buildup of messages in queues is not necessarily a
cause for alarm.
The important question about queue buildup is when does a problem exist that
needs troubleshooting. The next sections explain the main causes of queue buildup and
how to troubleshoot and resolve such situations.
Main Causes of Message Queue Buildup
In general, four situations can cause messages to build up in the
MTA message queues:
Performance problems. These problems occur when your system is unable to “keep up.” Performance problems manifest themselves by high system CPU load and disk I/O utilization, and individual MTA related processes running with very high CPU utilization. In such situations, perform an in-depth look at the entire system and tune it to reduce load. Performance problems are not discussed here because how to tune your Messaging Server deployment is outside the scope of this article.
Destination host problems. Regardless of what is causing problems on the destination host (networking problems, system problems, and so on), this situation can cause either all delivery threads to stop so that no valid email gets through or a lot of queued emails. The problem of delivery threads stopping presents itself as a lot of active messages waiting to be processed but with a low system CPU load. The problem of queued mail shows as lots of queued email with long delivery attempt history.
Problems with Messaging Server itself. An example of a Messaging Server problem is when the stored process has an orphan database lock for a user account, resulting in message buildup for users in the ims-ms channel.
Configuration issues. Two common configuration issues include: (1) the job controller ignoring queued email or queued email due to overquota accounts; or (2) queued email due to slow directory response, common for big mailing lists.
The following sections describe both a general approach to troubleshooting channels and specific
tips for destination host problems, problems with Messaging Server itself, and configuration issues.
Troubleshooting TCP Channels
This section describes a general approach for determining if a problem exists when
you have message buildup in TCP message queues.
To Troubleshoot TCP Channels
You must enable logging for the channels that you want to troubleshoot. The
amount of logging that you set (log level) depends on your situation.
When you suspect that a situation exists where more than the normal store-and-forward
aspect of message buildup in message queues is occurring, inspect the channel queues.
Use the imsimta qm summarize command. For more information, see imsimta qm.
If you have a large backlog of messages, use the imsimta qm messageschannel
command instead because the imsimta qm summarize command can greatly impact your system.
The imsimta qm messages command lists the destination hosts for which messages are queued in
the specified channel. The imsimta qm messages command also lists how many messages are waiting for
their next scheduled retry (delayed messages) as contrasted with how many are ready
to be retried now (active messages). When you use the imsimta qm messages command, you
must specify a channel name (a wildcard is not valid input). For example:
In this example, 2,000 messages are waiting for their next scheduled retry to
be delivered to example.com, and 3,000 are waiting for their next scheduled retry
for delivery to sesta.com.
Note - The Messaging Server 6.3 release removed the imsimta qm messages command. However, Messaging Server
6.3 does contain a new, useful imsimta qm jobs command to help understand why messages are
not being delivered. You can also use the equivalent command in Messaging Server
6.3, imsimta qm summarize -hosts.
Output from the imsimta qm command is also helpful for the situation where messages
have failed and are waiting to be retried. If many messages exist in
a channel queue, but most of them are delayed, the problem is most
likely with the remote domain. See Tips for Dealing with Destination Host Problems for information on how to work
around this problem.
Messaging Server 6.3. A message can be in the process of being tried by a job, on a channel waiting to be tried, or on a channel waiting to be retried. You might see a zero (0) in the active messages column of the output of the messages command because the tally includes messages which have not been tried yet and those which were previously delayed and are now ready to be tried again. In Messaging Server 6.3, you can see the messages being retried with the new jobs command.
iPlanet Messaging Server 5. Use top -to channel or top -domain_to channel to analyze what is occurring on a particular channel.
Look for trends on your system.
For example, when most of the mail is destined for one remote domain,
check the status of that remote domain. Additionally, look in the mail.log_current file to
determine what has happened recently when you tried to send mail to
that remote domain. See Tips for Dealing with Destination Host Problems for information on how to work around this
problem.
Examine the delivery attempt history for some messages to determine if these messages
are all non-delivery notifications for spam, which was not deliverable.
Use the imsimta qm dir -toaddress command to select a group of messages. Then
use this information to look at the delivery attempt history of some of
the messages. (You use the sequence numbers from the dir listing). If these messages
are all non-delivery notifications for spam, which was not deliverable, determine how those
original spam messages got into the system. Verify that the messages are spam
by using the imsimta qm subcommands dir, read, and history and route the non-delivery
notifications through a different outbound channel to prevent them from choking the normal
tcp_local channel queue.
For example, use the notificationchannel and dispositionchannel keywords to specify an alternate
process channel to queue delivery status notifications and modify status notifications, respectively. Then
you use source-specific rewrite rules to direct messages from these process channels to a
particular tcp_* channel set to only use a few processes or threads. For
more information, see Source-Channel-Specific Rewrite Rules ($M, $N).
Verify that the master process for the channel is started.
The tcp_* channels use the smtp_client process. To find out which process is
associated with which channel in order to dequeue, see the master_command parameter in
the associated channel block in the job_controller.cnf file. See Examples of Use for more information.
Tips for Dealing with Destination Host Problems
When you have determined that messages are queued to an unavailable remote host,
you have two options:
Create a new channel for the host. If this host is consistently a problem, all future email will go to this new channel. For existing messages that are enqueued, you can either wait for the problem with the destination host to be resolved or delete the messages from the queue.
Increase the number of delivery threads for the channel, or set a ceiling on the number of queued messages that will trigger a new thread or process to start. See the max_client_threads parameter in the channel option file and the threaddepth channel keyword, respectively.
A situation that occurs with Messaging Server itself is when messages build up
in the ims-ms channel. The four cases where the ims-ms channel shows
a buildup of messages in the queue are:
IMAP_MAILBOX_LOCKED. You might see this error in a message file that is briefly in the queue area. The error repeats in a message file in the queue area only if the mailbox remains locked for an extended period. The job controller retries delivery of these messages after short delays until either the message gets delivered or a different error is encountered.
IMAP_MAILBOX_BADFORMAT, IMAP_MAILBOX_NOTSUPPORTED. The mailbox is most likely corrupted. This case rarely occurs. You might want to use the reconstruct command for these cases. See reconstruct for more information.
IMAP_IOERROR. The message store is most likely corrupted or otherwise inaccessible. This case occurs very rarely.
IMAP_QUOTA_EXCEEDED. The user or users are over quota. This case is the most common and is discussed in the remainder of this section.
Note - If the channel gets a permanent delivery failure error, then the message is
immediately bounced and does not remain in the ims-ms queue area.
To troubleshoot the ims-ms channel, use the following high-level approach:
Perform a similar investigation as you would for tcp_* channels (see Troubleshooting TCP Channels) by using the imsimta qm summarize command to view what is happening on the system.
Use the imsimta qm history command to examine the message IDs to detect if different types of messages exist.
A message's file name starting with ZZ indicates that it has not been tried yet. The message file name is a counter starting at ZZ and decremented (ZY, ZX, and so on) each time the message is tried, fails, and is reenqueued for retry. For a ZZ*, no history exists.
In general, but not always, when you have non-ZZ*, non-.HELD files in the queue area, you have the IMAP_QUOTA_EXCEEDED case. (The frequency with which you see IMAP_MAILBOX_LOCKED conditions probably depends upon user and email client characteristics. This condition is more common with users who like to receive and move around lots of large attachments but it should typically occur rarely.)
For a site that enforces quota, probably most of the non-ZZ*, non-.HELD messages in the ims-ms queue area are there because of the recipient user being over quota. To verify this case, run the imsimta qm command with the history subcommand. You should see over quota in the history of the overquota messages.
Note - In iPlanet Messaging Server 5.2, the imsimta qm top command was enhanced to have more sorting options.
Look for Q status messages in the mail.log_current file pertaining to the ims-ms channel.
When you see “mailbox is busy” and Q status in the mail.log_current file, then the message is put back on the queue to be retried later as per the job controller's scheduling and the backoff keyword on the channel.
If you do not find Q status messages, check that the ims_master process is running. If there are errors in its log file, the ims_master process could be hung. Use the imsimta process command to verify running processes.
If your troubleshooting determines that users are overquota, use the following procedure.
To Resolve Overquota Mailboxes
Inform users that they need to delete messages to return their mailbox to
underquota status.
In lieu of informing users to delete messages, increase their quota.
Reduce the time that mail is queued for overquota accounts before being bounced
back as overquota.
See the store.quotagraceperiodconfigutil parameter. If you don't want to queue email for
overquota accounts (and bounce the message straight back), set this parameter to 0
(no grace period). This parameter is available in iPlanet Messaging Server 5 as
well.
(Messaging Server 6 only). Enable the local.store.overquotastatusconfigutil parameter to enable quota
enforcement before messages are enqueued in the MTA and to prevent the MTA
from filling up with messages.
Enable this setting on all front-end MTA systems and back-end mailstore systems.
About the ims-master Process Shutting Down and Starting Up
At times you might see messages in the imta log file that
the ims-master process is shutting down and starting up:
[30/Aug/2006:17:05:05 -0400] learn ims_master[19736]: General Notice: Sun Java(t
m) System Messaging Server ims_master 6.2-7.02 (built Jun 13 2006) shutting down
[30/Aug/2006:17:05:20 -0400] learn ims_master[28310]: General Notice: Sun Java(t
m) System Messaging Server ims_master 6.2-7.02 (built Jun 13 2006) starting up
[30/Aug/2006:17:07:24 -0400] learn ims_master[28310]: General Notice: Sun Java(t
m) System Messaging Server ims_master 6.2-7.02 (built Jun 13 2006) shutting down
[30/Aug/2006:17:07:32 -0400] learn ims_master[28380]: General Notice: Sun Java(t
m) System Messaging Server ims_master 6.2-7.02 (built Jun 13 2006) starting up
[30/Aug/2006:17:19:31 -0400] learn ims_master[28380]: General Notice: Sun Java(t
m) System Messaging Server ims_master 6.2-7.02 (built Jun 13 2006) shutting down
These notice messages are normal operation and do not indicate a problem. As
with all channel jobs, ims-ms channel jobs shut down occasionally based on either
having nothing to do or based on timing out (getting old). Then the
job controller restarts new jobs as needed.
Configuring the MTA
You might want to increase the number of processes the job controller can
start for the tcp_local, tcp_intranet, or other tcp_* channels, or increase the number
of threads each of those processes will start. You might also want to
give the tcp_local channel its own pool. If you observe queued messages
(total across all queues) to be greater than 100,000, increase the value of
MAX_MESSAGES for the job_controller.cnf setting. See Job Controller Configuration File for more information.
Additional Information About the MTA
This section contains additional information to help you understand MTA operations.
What Are .HELD Messages?
If the MTA detects that messages are bouncing between servers or channels, delivery
is halted and the messages are stored in a file with the
suffix .HELD in the msg-srv-base/data/queue/channel directory. Typically, a message loop occurs because
each server or channel thinks the other is responsible for delivery of the
message. You need to manually fix these .HELD messages with the imsimta process held command.
There is an unfortunate collision of terminology and concepts between .held messages
and the hold channel. And worse still, the command to process .held messages
is called release, whereas the command to process messages on the hold
channel is called process_hold.
You use the hold channel to hold messages of a recipient temporarily prevented
from receiving new messages. For example, you might be moving a user's mailbox
and want to hold new incoming messages. The hold channel is located
in the msg-svr-base/queue/hold directory. Messages are written to this queue as ZZxxx.held
files. Because the job controller doesn't “see” these .held files, they are not dequeued
for delivery. You release these files with the imsimta qm release command, and the
reprocess daemon reprocesses them.
Messaging Server makes use of MAX_*_RECEIVED_LINES options that you set in the option.dat file
to determine when a message is put into the .HELD state. The most
relevant options and their default values are:
MAX_LOCAL_RECEIVED_LINES=10
MAX_RECEIVED_LINES=50
MAX_TOTAL_RECEIVED_LINES=100
Once a message has looped through the MTA enough to accumulate MAX_RECEIVED_LINES
header lines indicating the local MTA, then the message becomes .HELD.
You can cause the MTA to immediately recognize that it has connected to
itself, rather than waiting to accumulate MAX_LOCAL_RECEIVED_LINES local Received: headers, by specifying
the loopcheck keyword on the appropriate channel(s) in the imta.cnf file.
For More Information
Use the following to aid in troubleshooting the MTA: