Traditionally, when a hardware or software fault occurred on a Solaris
system, a message would usually be logged to the appropriate device specified
in /etc/syslog.conf, and the rest of the diagnosis and repair was
left to the administrator. Predictive Self-Healing technology is introduced in
the Solaris 10 OS, which is available for preview through the Software Express for Solaris program. Predictive Self-Healing is a newly designed cohesive architecture and methodology for
automatically diagnosing, reporting, and handling software and hardware fault
conditions. This new technology lessens the time required to debug a hardware
or software problem and provides the administrator and Sun Technical Support
with detailed data about each fault. The architecture consists of an event management protocol, the fault manager, and the software fault-handling software, the Solaris Service Manager.
The Solaris Fault Manager
When a hardware fault occurs, predictive self-healing augments traditional
syslog messages by issuing binary telemetry events that are then
correlated by underlying software. The underlying software then automatically
diagnoses the fault, notifies the administrator, and takes corrective action
when possible. Sun's fault manager also provides a fault code and directs the
administrator to the corresponding knowledge base article at http://www.sun.com/msg/ when appropriate.
The first implementation of Sun's fault manager covers various SPARC CPU,
memory, and I/O bus nexus components. A later release is scheduled to include modules for the Solaris OS on x86 platforms.
The system administrator's primary interaction with the Sun fault manager
happens through the fault manager daemon, fmd(1M).
fmd(1M) starts at boot time and forks into the background (see
the fmd(1M) man page for complete details) and continues to
monitor the running system. When a component produces an error, the fault
management system handles the error and then correlates the error report data
with previous error reports and other related information in order to diagnose
and react to the underlying fault. Once diagnosed, the fault manager assigns
the problem a Universal Unique Identifier (UUID) which distinguishes the
problem across any set of systems. When possible, fmd(1M) will
then initiate steps to self-heal the failed component. The
fmd(1M) program will also log the fault to syslogd
and/or notify the administrator when appropriate.
The Fault Managed Resource Identifier
The Fault Managed Resource Identifier (FMRI) identifies a resource within
the fault manager for the purpose of fault and error event propagation. The
fault manager naming scheme, of which the FMRI is a URI subclass, is based on
the URI syntax defined in RFC 2396. The FMRI syntax has an arbitrary number
of different schemes, each naming a tree of related resources. FMRIs can be
represented as URI strings or component name-value pairs. For example, the
FMRI for DIMM 0 in memory bank 1 of memory module 2 on system board 3 of
domain A of a Sun Fire 15K server could be represented as the following URI string. (Note: Line should not be broken in actual use.)
chassis-id: 138A2036 product-id: SunFire15000 domain-id:
A
authority
system-board
3
uint32_t
cpu-module
2
uint32_t
memory-bank
1
uint32_t
dimm
0
uint32_t
The fault manager associates one of the following states with every
FMRI:
ok: The resource is present and in use and has no known
problems.
unknown: The resource is not present or not usable but has no
known problems. This might indicate the resource has been disabled or
unconfigured by an administrator.
degraded: The resource is present and usable, but one or more
problems have been diagnosed.
faulted: The resource is present but is not usable because
one or more unrecoverable problems have been diagnosed. The resource has been
disabled to prevent further damage to the system and requires human
intervention.
Fault Manager Command-Line Tools
The Solaris implementation of the fault manager includes several command-line tools to observe and modify the behavior of fmd(1M) and
its modules. The most common tools that the administrator will use are the
fmadm(1M), fmdump(1M), and fmstat(1M)
tools.
The fmadm(1M) utility can view, load, and unload modules and
view and update the resource cache. It provides system administrators with a
way to display every resource that fmd(1M) believes to be
faulty. The most common fmadm(1M) subcommands (see the
fmadm(1M) man page for complete details) are:
config: Display the configuration, including the module name,
version, and description of each component module.
faulty [-ai]: Display the list of resources currently
believed to be faulted. The FMRI, resource state, and UUID of the diagnosis are listed for each resource. By default, the fmadm faulty command only lists output for resources that are currently present and faulty. If the -a option is
specified, all resource information cached by the fault manager is listed,
including information for components no longer present in the system. If the
-i option is specified, the persistent cache identifier for each
resource in the fault manager is shown instead of the most recent state and
UUID.
load path: Load the specified module. The specified path must be an absolute path and refer to a module present in one of the defined directories for
modules.
unload module: Unload the specified module. The module name
is that specified in the fmadm config output. The fault manager
usually loads and unloads modules automatically based on the system
configuration, so this command should be seldom used.
rotate errlog | fltlog: Schedule a rotation of the specified
fault manager log file. The log files are automatically rotated by an entry
in the logadm(1M) configuration file that uses this
subcommand.
The fmdump(1M) program enables the system administrator to view
any log files associated with fmd(1M) and retrieve specific
details of any diagnosis. By default the fmdump(1M) command shows
the fault log, but will show the error log if given the -e
command-line switch. The fmdump(1M) command can also take command
line options to select only certain events (see the fmdump(1M)
man page for complete details):
-c class: Select events that match the specified class.
-t time: Select events that occurred on or after the specified
time.
-T time: Select events that occur on or before the specified
time.
-U UUID: Select events that match the specified UUID.
Increasingly verbose output can be obtained for any command by specifying
-v or -V.
The fmstat(1M) program is designed to report the statistics of
the fault management system. If the -m module argument is given,
fmstat(1M) reports statistics kept by the specified
module. If -m is not specified,
fmstat(1M) reports the following statistics for each module (see
the fmstat(1M) man page for complete details):
module: The name of the module as reported by fmadm
config
ev_recv: The number of events received by the module
ev_acpt: The number of events accepted by the module as
relevant to a diagnosis
wait: The average number of events waiting to be examined by
the module
svc_t: The average service time, in milliseconds, for events
received by the module
%w: The percentage of time that there were events waiting to
be processed
%b: The percentage of time that the module was busy
processing events
open: The number of active cases owned by the module
solve: The number of cases solved by the module since it was
loaded
memsz: The amount of dynamic memory currently allocated by
the module
bufsz: The amount of persistent buffer space currently
allocated by the module
An Example of the Predictive Self-Healing Fault Manager
Once a CPU fault has occurred, the administrator might see this message on
the console and logged to syslog:
SUNW-MSG-ID: SUN4U-8000-6H, TYPE: Fault, VER: 1, SEVERITY: Major
EVENT-TIME: Sun Oct 17 14:15:50 PDT 2004
PLATFORM: SUNW,Sun-Blade-1000, CSN: -, HOSTNAME: myhost
EVENT-ID: 64fe6c23-12b7-ccd1-f0a7-b531941738f8
DESC: The number of errors associated with this CPU has exceeded acceptable levels.
Refer to http://sun.com/msg/SUN4U-8000-6H for more information.
AUTO-RESPONSE: An attempt will be made to remove the affected CPU from service.
IMPACT: Performance of this system may be affected.
REC-ACTION: Schedule a repair procedure to replace the affected CPU. Use fmdump
-v -u <EVENT_ID> to identify the CPU.
The CPU state changes from ok to faulted, the
processes using that CPU are terminated, and the CPU is taken offline. The
state of the CPU can be viewed by using the psrinfo(1M)
command:
psrinfo
0 on-line since 09/27/2004 16:57:30
1 faulted since 10/17/2004 14:15:50
Run the fmdump(1M) command listed in the fault message,
using the EVENT-ID for more information on the fault. The output shows that
CPU 1 has a problem and the component in Slot 1 needs replacing.
The text Slot 1, indicating the location of the defective part,
can be found silk screened on the motherboard.
fmdump -v -u 64fe6c23-12b7-ccd1-f0a7-b531941738f8
TIME UUID SUNW-MSG-ID
Oct 17 14:15:50.1630 64fe6c23-12b7-ccd1-f0a7-b531941738f8 SUN4U-8000-6H
100% fault.cpu.ultraSPARC-III.l2cachedata
FRU: hc:///component=Slot 1
rsrc: cpu:///cpuid=1/serial=1107C270C8A
Once a replacement CPU is delivered, the defective CPU from Slot
1 can be replaced and re-enabled.
The Solaris Service Manager
To better handle software faults, Sun has redesigned the way it starts and
monitors services. Instead of the the traditional /etc/init.d
startup scripts, many programs in the Solaris 10 OS have been converted to use
the service management framework (smf) of the Solaris Service Manager to
start, stop, modify, and monitor programs. The service manager is also used
to identify software interdependencies and ensure that services are started in
the correct order. Should a service, such as sendmail, suddenly die, the
service manager automatically verifies that all of the requirements for the
sendmail service are running and respawns the necessary programs. When a
hardware fault occurs and hardware is offlined, the service manager can
restart any programs under service manager control that needed to be stopped
to remove the hardware from service.
Each service under the control of the service manager is controlled by an
XML configuration file, called a manifest, that defines the name of the
service, the type, any dependencies, and other important information. These
manifests are stored in a repository and can be viewed and modified by the
repository daemon, svc.configd(1M). The repository is read by the master restarter daemon, svc.startd(1M), which evaluates the
dependencies and initiates the services as needed. Traditional inetd services
are now part of the service manager as well. Any of the inetd services can be
enabled, disabled, or restarted via the same mechanism as any other service
manager-enabled program.
Service Manager Command-Line Tools
The service manager is made up of a number of programs, some of which are
meant to be used by the administrator to view and manage services and service
properties. These commands include: svcadm(1M),
svcprop(1), svcs(1), and svccfg(1M).
Additionally, the commands inetconv(1M) and
inetadm(1M) exist to help transition traditional inetd services
and manage them in the service manager framework.
The svcadm(1M) command allows the activation, deactivation,
and state manipulation of service instances in the service configuration
repository. Modification of these properties causes the responsible delegated
restarter to take action to move the service instance into the appropriate
state. If the service is not delegated, the master restarter performs these
functions. The -v switch prints verbose information to standard
out. Valid subcommands to the svcadm(1M) are:
disable [-t] [FMRI | pattern]: Disables the service instance
specified by the operands. If the -t option is specified, the
instance reverts to its previous enabled setting (which may have been
disabled) upon reboot.
enable [-rt] [FMRI | pattern]: Enables the service instance
specified by the operands. If the -r option is specified, the
instance is enabled and recursively enables its dependencies. If the
-t option is specified, the instance reverts to its previous
enabled setting (which may have been enabled) upon reboot.
refresh [FMRI| pattern]: Refreshes the service instance
specified by the operands. The instance should re-read its configuration.
restart [FMRI| pattern]: Restarts the service instance
specified by the operands.
delegate restarter_FMRI [FMRI | pattern]: Changes the
restarter assignment for the given instances to the restarter specified by
restarter_FMRI property. If you re-delegate the instance to
svc.startd(1M), this is equivalent. Re-delegation requires a
restart operation to take effect. Not all restarters support the same
underlying application model. Therefore, not all potential delegations result
in a functioning service instance.
mark [-It] instance_state [FMRI| pattern]: Moves the service
instance specified by the operands to the specified instance_state, either
degraded and maintenance. A service must be in the online state to be interred
in the degraded state. If the -I option is specified, the
service instance is moved into the specified state immediately. If the
-t option is specified, the interment is temporary. It persists
only for the lifetime of the current system instance. The temporary interment
option is not available for the degraded state.
milestone [-d] milestone_FMRI: Moves the system to the
specified milestone. All services that the given milestone does not depend on
(directly or indirectly) are temporarily disabled. If the -d
option is specified, the given milestone becomes the default final milestone,
and persists across reboots.
clear [FMRI| pattern]: For a service in the maintenance
state, bring the service instance specified by each operand to the
uninitialized state, such that it can be brought back online. For a service
placed in the degraded state by the mark subcommand, bring the service back to
the online state.
The svcprop(1) program prints values of properties in the
service configuration repository. Properties are selected by -p
options and FMRI operands. By default, when a single property is selected,
its values are printed separated by spaces on a single line. The following
options are supported:
-c: Retrieves the current property values, without
composition.
-f: Designates properties by their FMRIs. Implies option -t.
-p [name/]name: Prints values of the named property or
property group for each of the property groups, instances, or services
specified by the operands.
-q: Quiet. Produces no output.
-s snapshot: Uses the named snapshot to retrieve the
specified property or property group. If the given property group is not
present in the snapshot, the current property values are examined.
-t: Uses the multi-property output format.
-v: Verbose. Prints error messages for nonexistent
properties, even if option -q is also used.
-w: Waits for the selected property group or property to
change before printing anything.
The svcs(1) command displays information about service
instances as recorded in the service configuration repository. The
svcs(1) command has three different forms:
The first form prints one-line status listings for service instances
specified by the arguments. Each instance is listed only once, and with no
arguments; all enabled service instances, even if temporarily disabled, are
listed. The second form of the command prints one-line status listings for
the dependencies or dependents of the service instances specified by the
arguments. The third form prints detailed information about specific services
and instances. The options seen above in the three command explanations
are:
-?: Displays an extended usage message, including column
specifiers.
-a: Also selects disabled service instances.
-d: Lists the services or service instances upon which the
given service instances depend.
-D: Lists the service instances that depend on the given
services or service instances.
-H: Omits the column headers.
-l: Displays all available information about the selected
services and service instances, with one service attribute displayed for each
line. Information for different instances is separated by blank lines.
-o col[,col]...: Prints the specified columns. Each
col should be a column name.
-p: Lists processes associated with each service instance. A
service instance may have no associated processes. The process
ID, start time, and command name (PID,
STIME, and CMD fields from ps(1)) are
displayed for each process.
-R instance_FMRI: Selects service instances that have the
specified service instance as their restarter.
-s col: Sorts output by column. col should be a
column name. Multiple options behave additively.
-S col: Sorts by col in the opposite order as
options.
-v: Displays verbose columns: STATE,
NSTATE, STIME, CTID, and
FMRI.
The column names used with the svcs(1) command are case
sensitive and are as follows:
CTID: The primary contract ID for the service instance, if
one exists.
DESC: A brief description of the service from its template
element. A service may not have a description available, in which case a
hyphen is used to denote an empty value.
FMRI: The FMRI of the service instance.
INST: The instance name of the service instance.
NSTA: The abbreviated next state of the service instance, as
given in the STA column description. A hyphen denotes that the
instance is not transitioning, otherwise it's the same as the
STA.
NSTATE: The next state of the service. A hyphen is used to
denote that the instance is not transitioning, otherwise it's the same as the
STATE.
SCOPE: The scope name of the service instance.
SVC: The service name of the service instance.
STA: The abbreviated state of the service instance.
STATE: The state of the service instance. An asterisk is
appended for instances in transition, unless the NSTA or
NSTATE column is also being displayed.
STIME: If the service instance entered the current state
within the last 24 hours, this column indicates the time that it did so.
Otherwise, this column indicates the date on which it did so, printed with
underscores in place of blanks.
The svccfg(1M) command is used to import, export, and modify
the configurations of services in the repository. It can be invoked
interactively, by specifying subcommands, or by specifying a command file
containing a series of subcommands. The three forms of invocation are:
For a complete list of all of the available subcommands, please read the
svccfg(1M) man page.
The inetconv(1M) program converts inetd.conf entries into
smf(5) manifests, and imports them into the repository. There is
a one-to-one mapping between a service line in the specified input file and
the resulting configuration file generated. By default, the configuration
files are named using the following template:
<svcname>-<proto>.xml
The <svcname> token is replaced by the service's name
and the <proto> token by the service's protocol. Any
forward slash characters that exist in the source line for the service name or
protocol are replaced with underscores. Each resulting manifest includes the
service line as a comment. If a service line is found to be malformed or to
be for an internal inetd service during the conversion process,
no manifest is generated and that service line in the input file is skipped.
The inetconv(1M) program accepts the following command line
options:
-?: Display a usage message.
-e: Enable smf(5) services which are enabled in
the input file.
-f: If a service manifest of the same name as the one to be
generated is found in the destination directory, inetconv(5) will
overwrite that manifest if this option is specified. Otherwise, an error
message is generated and the conversion of that service is not performed.
-i srcfile: Permits the specification of an alternate input
file srcfile. If this option is not specified, then the
inetd.conf(4) file is used as input.
-n: Turns off the auto-import of the manifests generated
during the conversion process. Later, if you want to import a generated
manifest into the smf(5) repository, you can do so through the
use of the svccfg(1M) utility.
-o destdir: Permits the specification of an alternate
destination directory destdir for the generated configuration
files. If this option is not specified, then the manifests are placed in
/var/svc/manifest/network/rpc if they are RPC services, otherwise
in /var/svc/manifest/network.
The inetadm(1M) program views and configures inetd-controlled
services. The following options are supported:
-?: Display a usage message.
-l FMRI: List all properties for the specified service
instance in name=value pairs. In addition, if the property value
is inherited from the default value provided by inetd, the
name=value pairs are identified by the token (default). Property
inheritance occurs when properties do not have a specified service instance
default.
-e FMRI: Enable the specified service instance.
-d FMRI: Disable the specified service.
-p: Lists all default inet service property values provided
by inetd in the form of name=value pairs. If the value is of
boolean type, it is listed as TRUE or FALSE.
-m FMRI property_name=value [property_name=value...]: Change
the values of the specified properties of the identified service
instance. Properties are specified as whitespace-separated
name=value pairs. To remove an instance-specific value and
accept the default value for a property, simply specify the property without a
value.
-M property_name=value [property_name=value...]: Change the
values of the specified inetd default properties. Properties are specified as
whitespace-separated name=value pairs.
Examples of the Predictive Self-Healing Service Manager
If a service has a problem, use the service manager tools to help diagnose
it and review the suggested course of action to correct the issue. For
example, the svcs -x option lists information about every service
that isn't running, and why:
svcs -x
svc:/application/print/server:default (LP Print Service)
State: disabled since Tue Oct 05 22:27:55 2004
Reason: Disabled by an administrator.
See: http://sun.com/msg/SMF-8000-05
See: lpsched(1M)
Impact: 1 service is not running.
http://www.sun.com/msg/ provides additional information on the type of issue, and
suggests steps to acquire additional data and correct the problem.
Use svccfg(1) to view the properties of the SMTP server and
determine its dependencies: