BigAdmin System Administration Portal
XPerts

BigAdmin XPerts

XPert Transcript: Solaris Fault Manager and Service Manager smf(5)
Cynthia McGuire, Liane Praza - Sun Microsystems, Inc.

XPerts Home
Last Updated December 16, 2005
 
 
Page 1 (1-10 of 16 questions) Pages:   1 - 2 » Next
 
  1. What does Predictive Self-Healing do for me?
  2. What happens if my system experiences a hardware fault?
  3. What happens when an smf(5) service fails?
  4. Is my x86 system capable of leveraging Predictive Self-Healing and the fault manager?
  5. What does it mean for a daemon to participate in smf(5) and become self-healing?
  6. How do I get my output on STDOUT and/or how do I handle STDOUT when I convert it to a service?
  7. Is there a tool/script that I can use to build manifests for third-party applications?
  8. How does an admin easily restart the entire svc init train after a boot time failure without actually rebooting?
  9. Is there a Sun repository for smf where people can submit their smf files for others to use?
  10. The generic_open profile is used by svc.startd by default. What causes the other profiles such as platform_SUNW,Sun-Fire-15000.xml to be used when I boot up a Sun Fire 15K system?

Q: What does Predictive Self-Healing do for me?

A: The Predictive Self-Healing architecture provides a set of technologies designed to maximize availability in the face of software and hardware faults and facilitate a simpler and more effective end-to-end experience for system administrators.

The first set of technologies includes Solaris components that implement predictive self-healing for CPU, memory, and I/O bus nexus components. The architecture is used to facilitate a simplified administration model wherein traditional error messages intended for humans are replaced by binary telemetry events consumed by software components that automatically diagnose the underlying fault or defect. The results of the automated diagnosis are used to initiate self-healing activities such as administrator messaging, isolation or deactivation of faulty components, and guided repair.

The second set of technologies includes Solaris components that implement predictive self-healing for software services. The Service Management Facility (smf(5)) presents a simplified administrative model for software services. Each software service has an advertised state, and failures are diagnosed automatically by the system to point to the root cause of problems. Automated restart of failing services is performed whenever possible to reduce the time humans must spend repairing faulty software. If a service cannot be restarted automatically, smf(5) describes the cause of the fault so that time-to-repair is significantly shorter.

Back to top


Q: What happens if my system experiences a hardware fault?

A: Upon diagnosis of a hardware fault, a self-healing system will direct an administrator to a knowledge article to learn more about the impact of the problem or repair procedures.

You can access the knowledge article corresponding to a self-healing message by taking the Sun Message Identifier (SUNW-MSG-ID) and appending it to the link http://www.sun.com/msg/ in your web browser. By the time you read the article, other agents participating in the self-healing system may have already offlined an affected component and taken other action to keep your system and services available.

Back to top


Q: What happens when an smf(5) service fails?

A: When a service participating in smf(5) fails due to a hardware error such as an uncorrectable memory fault, software error such as a service core dump, or even administrative error such as killing the wrong process, the fault is detected and the service is automatically restarted by smf(5). Any services which depend upon the failing service are also immediately stopped so that they can be restarted when the failing service recovers.

Service status is given for all services on the system through the svcs(1) command. If a service cannot be restarted by the system, it is put into the "maintenance" state. The svcs -x command gives as much diagnosis of the failure as possible, including the reason for failure, pointers to log files and man pages, and a summary of other services which are impacted by the failure. A pointer to a knowledge article with further information about possible causes and remedies for the fault is also given.

Back to top


Q: I know the Solaris 10 OS provides self-healing components for automated diagnosis and recovery for UltraSPARC III and UltraSPARC IV CPU and memory systems, along with enhanced resilience and telemetry for PCI-based I/O. What does this mean for x86 systems? Is my x86 system capable of leveraging Predictive Self-Healing and the fault manager?

A: We are feverishly working to provide the same level of Predictive Self-Healing capabilities for the CPU, memory and PCI (PCI Express) subsystems for our AMD Opteron and Athlon 64 based offerings as we have on UltraSPARC III and UltraSPARC IV based systems. This work is scheduled to be completed by the end of 2005 and available in an appropriately timed Solaris update release as well as via the Solaris Express program.

Back to top


Q: What does it mean for a daemon to participate in smf(5) and become self-healing?

A: A Solaris service can participate in smf(5) as soon as anyone writes a short description of that service called a "manifest", and delivers it into /var/svc/manifest. This manifest describes the service, including important attributes such as its name, service dependencies, and how to stop and start the service.

Sun has created manifests for most of the services delivered with the Solaris OS today. ISVs may provide manifests for their services, or you can quickly write a manifest for any software critical to your environment.

Once a manifest is provided in /var/svc/manifest, it is automatically uploaded into the smf(5) repository, where the service can be viewed and managed, and will be visible using all smf(5) commands. All enabled services in the smf(5) repository are automatically started by the system once all their dependencies are satisfied.

While smf(5) takes over system startup for the Solaris OS, init.d scripts written for previous releases of Solaris will continue to work without modification. Services started by init.d scripts, however, cannot participate in the self-healing features for software services, and will not be automatically restarted when they fail.

Back to top


Q: In the Solaris 8 OS I had an /etc/rc2.d/S98myStuff script which wrote to STDOUT, and now in the Solaris 10 OS it's writing to /var/svc/log/milestone-multi-user:default.log because of SMF. I have not yet converted S98myStuff into a service, so how do I get my output on STDOUT and/or how do I handle STDOUT when I convert it to a service? (i.e. How can I view the console messages from my startup script, even when booted without -m verbose?)

A: smf(5) explicitly places the startup messages from services more under control of the administrator. We offer the ability to configure boot to either be quiet (the default), or give a single startup message for each service (via the -m verbose option to boot). In the future, we expect to offer other options as well for the amount of service messaging. By default, all output from the rc?.d scripts during boot and all service startup no matter when it happens is logged to the service's logfile in /var/svc/log.

However, that's all there to give you more control over the messages from services you didn't write, and allow analysis of problems even if the output to the console has been lost. If you have a custom /etc/rc?.d script and you want to make sure certain messages go to the console when it starts, the best way is to include smf_include.sh in your script:

   . /lib/svc/share/smf_include.sh

And then write your log messages using something like:

   echo message 2>&1 | smf_console

The benefit of using smf_console is that messages are still logged to /var/svc/log for postmortem analysis, as well as being displayed on the console. This will work both in both an /etc/rc?.d script and an smf(5) service.

Back to top


Q: Is there a tool/script that I can use to build manifests for third-party applications? Currently I am using the templates under /var/svc/manifest and it seems a very lengthy process.

A: If the application is specified by an inetd.conf line, you can use inetconv(1M) to generate a starting manifest for the service.

For applications started by /etc/rc?.d scripts, we provide a Service Developer Introduction which documents the information you need to gather about your service to write a complete manifest: http://www.sun.com/bigadmin/content/selfheal/sdev_intro.html

Usually, we find that while the first manifest takes a bit of time to write, subsequent manifests go quite quickly.

Some other tools might also be helpful:

/usr/demo/jds/bin/jedit is a text editor that includes XML syntax highlighting and understands how to validate against the smf(5) service bundle DTD.

xmllint(1) is helpful to quickly check for errors in the manifest file without having to svccfg import the manifest.

Webmin has an smf(5) module, which includes a service creation tool to lead you through the questions from the Service Developer Introduction. If you're only creating one service, it can be helpful. If you're looking to write a couple of manifests, editing the manifests directly is usually quicker.

Back to top


Q: How does an admin easily restart the entire svc init train after a boot time failure without actually rebooting? For example, if a file system fails to mount, nearly all network services never get started. What's the simple one-line command to take another stab at getting SMF to restore or start services after such a condition is found and repaired?

A: The short answer: You just need to tell smf(5) that you've repaired the file system service. Just use svcadm clear for the file system service that was in the maintenance state, and all of the services waiting for the file system to be mounted will automatically start.

The longer answer:

If services aren't being started, you can ask the system what's wrong by running svcs -x. With no other arguments, svcs -x will tell you services which smf(5) considers to be in an unusual state: enabled but not running, or keeping another service from running.

That is, the svcs -x command attempts to diagnose service failures to their root cause, rather than just telling you everything that's broken. If you include the -v option, svcs -xv, you'll see the list of impacted services for each root cause.

In the specific case described, a file system fails to mount and the appropriate file system service will go into the maintenance state. If you run svcs -x, you'll see that many services aren't running because that file system service (e.g. svc:/system/filesystem/local) is in the maintenance state.

Services in the maintenance state are known by smf(5) to need administrator attention. So, once you've repaired the file system, you just need to let smf(5) know that you believe you've corrected the error and it should continue on with boot. You do this with svcadm(1M). If, for example, it was filesystem/local that was in maintenance, you'd run:

   # svcadm clear filesystem/local

Then smf(5) would make sure the file systems were OK and continue on with the boot process, starting up all the services that were blocked behind the service in maintenance. A full restart of all services isn't necessary, since smf(5) knows the precise dependency relationships among the services.

Back to top


Q: Is there a Sun repository for smf where people can submit their smf files for others to use? Could be very useful.

A: We're collecting links to manifests and methods that people have created through the smf(5) OpenSolaris community. There's a fledgling list at http://opensolaris.org/os/community/smf/manifests/.

We're always happy to include more links, so please submit any others you know about via the instructions on that web page.

Back to top


Q: I have read that the generic_open profile is used by svc.startd by default. What causes the other profiles such as platform_SUNW,Sun-Fire-15000.xml to be used when I boot up a Sun Fire 15K system?

A: During installation, we determine the platform and link platform.xml to the appropriate platform-specific file. So, on a Sun Fire 15K server, we link platform.xml to platform_SUNW,Sun-Fire-15000.xml in order to enable the appropriate platform-specific services.

We also choose to link generic.xml to generic_open.xml. But, if you're looking for a more secure system by default, you could replace that link during a jumpstart install with either a link to our generic_limited_net.xml profile, which provides a more locked-down networking configuration, or with a profile of your own.

The site.xml profile is designed to be owned completely by you for local deployment choices. Sun will never deliver a site.xml, so you can use it to customize your environment, again by creating a site.xml profile during a jumpstart install.

You can also configure your system with any profile after the initial installation by using svccfg apply <profile_file>. Profiles are a simple way to enable and disable groups of services. You can create your own profile easily by copying one of the existing profiles in /var/svc/profile and customizing it with your own service enable/disable choices.

Information about profile application is available in the smf_bootstrap(5) man page.

Back to top


BigAdmin