The doctor is in (the kernel).
Sun's Predictive Self-Healing technology debuts in Solaris 10, dramatically reducing downtime and administrative complexity.
Like your health, the best medicine for your network is preventive medicine. That means keeping close tabs on your enterprise and nipping small problems in the bud before they cascade into major crises. Yet system monitoring and response capabilities have traditionally been available only with complex, expensive add-on software.
In the Solaris 10 Operating System (Solaris OS), Sun is putting a system "doctor" right in the kernel: Predictive Self-Healing (PSH) technology. These innovative features are the first components in Sun's broad PSH architecture.
The PSH features in the Solaris 10 OS reduce risk and increase availability right out of the box. PSH capabilities enable Sun systems to accurately predict component failure and mitigate problems before they wreak system-wide havoc--seamlessly pulling your systems from the jaws of downtime and fixing what's wrong.
With PSH capabilities, you can:
- Maximize the availability of systems and software in the face of faults
- Reduce the complexity of system repair
- Save time and money through reduced operating costs

 |
The Solaris 10 OS, the upcoming release of the industry's leading UNIX platform, integrates powerful new capabilties to deliver extreme levels of performance, availability, and security. In addition to the Predictive Self-Healing capabilities reviewed in this article, Solaris 10 includes these revolutionary technologies:
N1 Grid Containers technology delivers a breakthrough approach to system virtualization and resource utilization.
DTrace is a comprehensive dynamic tracing framework that concisely answers arbitrary questions about system behavior.
Solaris 10 also includes technology that enables running a range of Linux applications at near-native speeds.
|
|
|
PSH technology is scalable, extensible, and portable; and it will be incorporated across Sun's product line for a common service and administration experience.
"With Predictive Self-Healing, every application, subsystem, and piece of hardware can integrate with an architecture that can diagnose faults as they occur and do what's necessary to maintain availability," explains Sun's Mike Shapiro, senior staff engineer in the Solaris Kernel Development group and the lead architect of PSH technology.
Predictive Self-Healing Features in the Solaris 10 Operating System
Sun's company-wide initiative to deliver self-healing systems includes two key components in the Solaris 10 OS: the Solaris Fault Manager and the Solaris Service Manager software. The first release of PSH capabilities implements predictive self-healing for CPU, memory, and I/O bus nexus components, as well as automated restart for application services.
The benefits of the PSH features in the Solaris 10 OS are many:
- Predictive diagnosis and isolation of faulty components
- Automatic diagnosis and restart of hardware and software
- Simplified administration for managing services
- Fast and easy repair, plus links to knowledge-base articles
- On-the-fly updates with zero system downtime
In fact, the capabilities in PSH are so valuable that Sun used it extensively during the creation of the Solaris 10 OS itself, keeping development on track.
"PSH detected a bad CPU on the Solaris Kernel Development group's gate machine, where the master copy of the Solaris OS code lives," explains Shapiro. "PSH seamlessly pushed the CPU offline before it had the chance to fail and crash the server--and delay work on all the other Solaris 10 OS features."
Solaris Fault Manager Software
If a self-healing system senses a problem, it dynamically takes a CPU, I/O devices, and/or regions of memory offline before they can cause system failure. In the Solaris 10 OS, the Solaris Fault Manager software isolates and disables bad components, helping to ensure continuous service even if you're unaware of a potential problem.
The Solaris Fault Manager software automatically diagnoses problems in just a few seconds, instead of the many days it can take even a crack IT staff. Business-critical applications and essential system services continue running uninterrupted in case software fails, a hardware component goes south, or even if software is misconfigured. And the entire system is open, permitting administrators and field personnel to observe the activities of the diagnostic system.
A PSH technology-enabled system issues easy-to-understand diagnostic messages that link to articles in Sun's knowledge base, which clearly guide administrators through tasks that require human intervention. As a result, the overall time from an automated diagnosis to an appropriate human intervention, if required, is greatly reduced.
Solaris Service Manager Software
The Solaris Service Manager software is the other half of the PSH implementation in the Solaris 10 OS. It turns application services into first-class objects that administrators can observe and manage in a uniform way, and it implements the capability of automatically restarting and managing them.
The Solaris Service Manager software can restart services if they are accidentally terminated by an administrator, if they are aborted as the result of a software programming error, or if they are innocently affected by an underlying hardware problem.
Additionally, the Solaris Service Manager software simplifies and secures common administrative tasks, such as disabling services and changing properties. The Solaris Service Manager software also speeds system boot by starting services in parallel according to their dependencies. The "undo" feature helps safeguard against human errors by permitting easy rollback of changes.
"With Predictive Self-Healing, every application, subsystem, and piece of hardware can integrate with an architecture that can diagnose faults as they occur and do what's necessary to maintain availability."
Mike Shapiro
Senior Staff Engineer, Solaris Kernel Development Group
Sun Microsystems, Inc.
|
The Solaris Service Manager software provides observability and fault isolation for legacy Solaris OS services without requiring them to change. By adding a simple XML file to their software, developers can convert most existing applications to take advantage of the full suite of features.
Building a Better Feedback Loop
As customers' PSH technology-enabled networks gather information about system problems, a robust feedback loop rapidly grows between Sun and those customers, which fuels ongoing improvements.
"The big picture is that PSH features will help Sun offer customers a quantitative view of enterprise availability, which leads to more informed purchasing decisions," says Shapiro. "PSH technology allows us to be more proactive and predictive in the way we interact with customers and provide services to them."
What's Next for PSH Technology?
PSH technology brings new availability technology to every Solaris 10 OS system. ZFS, another major component in the Solaris 10 OS, also includes self-healing features. Visit sun.com on September 14 for a feature story about all the capabilities of ZFS, a vertically integrated storage system that provides end-to-end data integrity, immense (128-bit) capacity, and very simple administration.
To achieve even higher levels of availability for the enterprise, you can also add redundancy and cross-machine fail-over to services deployed on the Solaris OS systems using the Sun Cluster software. In much the same way that PSH capabilities within the Solaris 10 OS work in close harmony with monitored components, the Sun Cluster software improves availability through tight integration with hardware.
Try Predictive Self-Healing Technology Today
Put PSH capabilities to use now by downloading the Software Express for Solaris 10 OS release, which includes the Solaris Fault Manager software and CPU, memory, and I/O support for UltraSPARC processor systems. Sun plans to release the Solaris Service Manager software component in the next Software Express for Solaris OS release. You can also join the discussion about PSH technology, including the Solaris Fault Manager software.
The Solaris 10 OS is only the beginning for the PSH architecture. Future updates will include similar hardware diagnostic capabilities for AMD Opteron-based x86 systems, as well as self-healing features for other system components. The PSH architecture is designed for upgrading, allowing new diagnostic capabilities to be added without the need for system downtime.
"The predictive self-healing features in the Solaris 10 OS are just the first stop on our roadmap," says Shapiro. "We are building self-healing technology into our systems from the lowest levels of the hardware/software stack upward. The result is a scalable and effective architecture that rapidly diagnoses and adapts to issues as they occur, and then isolates the problem without downtime."
|