Fast Track to Solaris 10 Adoption: Predictive Self-Healing
Performance Issues
Please click on a question below or download a pdf version.
- What mechanism does PSH use to detect an application crash? Will it be able to automatically upload crash dump to Sun and initiate a service call to tech services?
- I see PSH-handled memory and CPU potential problems. Are there any plans for predicting disk failures, perhaps through the use of SMART technology built into many drives?
- Given that PSH supports systems as individuals, will it handle raid levels with customizations?
- How will PSH handle fan failures or CPU overheats?
- Does self-describing data imply that there are XML schemas?
- Are PSH capabilities applied system wide, or is there a finer granularity (ex. Solaris 10 OS Zones/grid containers)
- Will PSH be able to interact across zones? For example, offlining one zone and reallocating a NIC card to replace one that failed in a different (more critical) zone?
- How does PSH interact with Containers to help prevent downtime? Can a hardware failure be transparent to a container?
- My server had a panic on CPU2 write back check error. Should I say that Solaris 10 OS PSH will save the server from this type of error?
- Can PSH send traps to a monitoring solution when there is a fault?
- Will the SUNW-MSG-ID be server-centric with updates?
- Will this provide the ability to dynamically reconfigure domains based on load? I have a batch process that runs each evening; I would like the domain to borrow resources when the other domains are idle and return them when needed.
- Does PSH work or react any differently on real (Intel) processors, compared to AMD processors?
- Will every event that used to result in a message to syslog will now (also) result in a message to fmd?
- Will I be able to add new errors and responses to PSH?
- Is there a possibility for PSH to detect failures in services on other machines, e.g., a database failure on other server?
- Will PSH have a defined message set logging to /var/adm/messages, and will there be a string that can identify all PSH messages?
- If it auto-recovers, does it also auto-patch and upgrade?
- What errors that would panic a system today will be handled by PSH? What won't?
- How does PSH decide on corrective action to take? Is there a fault-action list published? Can the admin control the actions taken, especially something that affects performance of the system?
- Are the messages of Software Express Program showed before the problem, occurs starting from symptoms?
- What are the data sources that feed error Reports? Are any of these new relative to the Solaris 9 OS?
- What happens when a fault occurs on a CPU where the kernel is running or where permanent memory
- Does this include kernel parameter tuning, e.g., the number of handles for users and such (being auto-tuned)?
- Can PSH be used to monitor and correct problems in user apps? If so, can I use the framework to run custom scripts? Is there is an API for the framework?
- What, if any overhead does FMD introduce, or will this be similar in nature to the use of DTrace?
 |
Q: What mechanism does PSH use to detect an application crash? Will it be able to automatically upload crash dump to Sun and initiate a service call to tech services?
A: This involves a daemon (fmd) that collects information on events inside the system that updates an event list that is "published" for use by the other agents inside the system, and so on. Regarding the service call, this is an ongoing process with Services and monitoring agents; our plan is to leverage some customized agents to do so. Please follow up at http://www.sun.com/software/solaris/10/
Back to top
Q: I see PSH-handled memory and CPU potential problems. Are there any plans for predicting disk failures, perhaps through the use of SMART technology built into many drives?
A: Yes. We are planning to harvest SMART data and create an error telemetry that allows us to predict disk failures.
Back to top
Q: Given that PSH supports systems as individuals, will it handle raid levels with customizations?
A: Integration of PSH outside of a single system is planned. We are looking to bring PSH technology out to the network, raid, and fabric-based storage.
Back to top
Q: How will PSH handle fan failures or CPU overheats?
A: CPU Overheats: yes; Fan Failures: yes, depending on platform.
Back to top
Q: Does self-describing data imply that there are XML schemas?
A: XML can be used to marshal our self-describing error and fault protocol event data.
Back to top
Q: Are PSH capabilities applied system wide, or is there a finer granularity (ex. Solaris 10 OS Zones/grid containers)
A: Actually, the granularity of retiring faulty resources is much finer than even zones: in the Solaris 10 OS, we can offline faulty CPUs, individual physical pages of memory, and I/O devices, and kill processes and restart the affected service. This works in local zones as well as the global zone.
Back to top
Q: Will PSH be able to interact across zones? For example, offlining one zone and reallocating a NIC card to replace one that failed in a different (more critical) zone?
A: This is exactly the type of functionality we will be able to build by connecting PSH with Solaris's IPMP feature and the ability to export virtualized network interfaces into zones. We don't have this in the Solaris 10 OS yet, but as we convert our networking subsystems to PSH, we will be able to do that, and that is exactly the type of thing we want to be able to deliver to you.
Back to top
Q: How does PSH interact with Containers to help prevent downtime? Can a hardware failure be transparent to a container?
A: PSH interacts with Containers in that we try to isolate errors to a user process if possible and restart its containing service. If we can't do that, but we can isolate the problem to a zone (container), then it can be restarted. Hardware failures are "transparent" to a container in that Containers typically depend on virtualized resources, such as a pool of CPUs or a filesystem. Depending on the failure mode of the underlying resource and how that manifests through the virtualized resource exported to the container, that problem may be "visible" or not. Finally, the diagnosis results and suggested repair actions communicated to syslog, for example, are always transparent to containers those are only logged to the global zone for the system administrator, and are not seen by users in the local zones.
Back to top
Q: My server had a panic on CPU2 write back check error. Should I say that Solaris 10 OS PSH will save the server from this type of error?
A: I'd need the complete error message with the context of where we detected the error to tell you whether it would be recoverable or not, but yes, PSH would have automatically diagnosed this problem for you. Errors such as the one you describe now produce automated telemetry events to be diagnosed by PSH, and we've made a continuous effort across our Solaris 8 and 9 OS patches and in the Solaris 10 OS, to harden the Solaris OS against all such errors to the degree permitted by the hardware.
Back to top
Q: Can PSH send traps to a monitoring solution when there is a fault?
A: Our plan is to leverage custom agents to do it; please check our site often for new information: http://www.sun.com/msg
Back to top
Q: Will the SUNW-MSG-ID be server-centric with updates?
A: As appropriate, the message ID will be platform specific. For example, a fault message that is specific to a Sun Fire 6900 system will contain a message ID that is unique to that platform. The message and its ID will direct the admin to platform-specific response and repair actions.
Back to top
Q: Will this provide the ability to dynamically reconfigure domains based on load? I have a batch process that runs each evening; I would like the domain to borrow resources when the other domains are idle and return them when needed.
A: No. All PSH responses (DR, included) are based on the diagnosis of a system fault.
Back to top
Q: Does PSH work or react any differently on real (Intel) processors, compared to AMD processors?
A: No it is just the same.
Back to top
Q: Will every event that used to result in a message to syslog will now (also) result in a message to fmd?
A: No. The transition from old-style software that simply spews error messages to syslog to self-healing telemetry is a gradual one. We've focused on some of the key areas for RAS in the first release (e.g., CPU, Memory, I/O), and we'll be working on others in priority order. We also want our partners and ISVs to plug in.
Back to top
Q: Will I be able to add new errors and responses to PSH?
A: In the first release of PSH, the way we will permit you to plug in to PSH is using the Service Management Facility (SMF) for user applications. Later, we will begin exposing APIs for other types of plug-ins to device driver developers and for other uses. We will also deliver modules that permit administrators to configure custom responses to diagnosis results such as e-mail messages, SNMP traps, and so on.
Back to top
Q: Is there a possibility for PSH to detect failures in services on other machines, e.g., a database failure on other server?
A: As far as the system is running the Solaris 10 OS, yes, just remember this is a per-system feature.
Back to top
Q: Will PSH have a defined message set logging to /var/adm/messages, and will there be a string that can identify all PSH messages?
A: Yes, it does. If you download the white paper from http://www.sun.com/msg/ you will see an example screen-shot of the diagnosis message. The message always starts with "SUNW-MSG-ID" in the upper-left-hand corner. You can also configure the syslog-msgs PSH module to send the PSH diagnosis results to one of syslogd(1M)'s LOCAL0-7 facilities, and then set up syslog.conf to segregate that facility into a separate file (i.e., other than /var/adm/messages). Finally, we will be delivering in the Solaris 10 OS timeframe a module that will permit administrators to forward such messages to custom scripts (e.g., to e-mail them).
Back to top
Q: If it auto-recovers, does it also auto-patch and upgrade?
A: Not currently, but we are researching diagnosis of software defects and automated responses and self-healing of broken software packages.
Back to top
Q: What errors that would panic a system today will be handled by PSH? What won't?
A: In both past releases and in the Solaris 10 OS, we've been actively working to make the system recover from as many types of errors as possible. One place we've made great progress is in I/O: the Solaris 10 OS will not panic from any PCI bus transaction where the hardware maintains system coherence; this was not true in previous releases. We'll also be bringing major improvements to the resilience of Solaris x86 OS on AMD processors during the Solaris 10 OS. There are always cases where the kernel must panic to preserve the integrity of your user data because the h/w error is so severe that it cannot capture enough state or maintain coherence so that the OS can recover. All CPUs and I/O h/w have cases like this. Our goal is to make the Solaris 10 OS able to survive all the others.
Back to top
Q: How does PSH decide on corrective action to take? Is there a fault-action list published? Can the admin control the actions taken, especially something that affects performance of the system?
A: There is not (currently) a way to get a full "action" list. The admin may configure PSH agent activity via configuration options. The current options are course-grain: on or off. Future enhancements will give the admin more fine-grained control over agent actions.
Back to top
Q: Are the messages of Software Express Program showed before the problem, occurs starting from symptoms?
A: Yes.
Back to top
Q: What are the data sources that feed error Reports? Are any of these new relative to the Solaris 9 OS?
A: All of PSH is new relative to the Solaris 9 OS. Prior to the Solaris 10 OS and PSH, all error information from the kernel was transmitted by means of cmn_err(9F), which just sends a text string for humans to syslog. In the Solaris 10 OS, PSH uses a structured event transport to send telemetry events for automated diagnosis from the kernel to fmd(1M). Also, we have enhanced the ability of many subsystems, such as our bus nexus drivers, to be able to capture error reports for automated diagnosis.
Back to top
Q: What happens when a fault occurs on a CPU where the kernel is running or where permanent memory resides?
A: If a CPU fault occurs while a thread is executing in UserLand, then the user process will be terminated and the Service Manager will restart the containing service. If a CPU fault occurs while a thread is executing in the kernel, then it depends on whether the thread is in a protected code region. One example of such a region is copying in data (or out) as part of system call processing, and there are other examples. In these cases, we can similarly contain the problem and continue.
If we are not in a protected region, the kernel will panic and reset, and the problem will be diagnosed on the way back up (or by a service processor). Memory faults have similar different cases: the Solaris VM system has to look at the state of the page, whether a read or write is being attempted, whether the page is clean or dirty, and so on, to determine the degree of isolation and recovery. We're actively working on improving every area of the Solaris OS to handle errors, by isolating and retiring the bad resource to the degree possible, given the hardware platform, whether it be SPARC or x86/AMD technology.
Back to top
Q: Does this include kernel parameter tuning, e.g., the number of handles for users and such (being auto-tuned)?
A: One of our major goals in the Solaris OS has always been to make the system self-tuning. In every release, we've taken away more and more of these: for example, in the Solaris 9 OS, we made the number of PTYs scale automatically, and in the Solaris 10 OS we've taken away the need to tune IPC and shmem tunables, making them dynamically-scaling resource controls. In PSH, we've designed our new features from the ground up not to require custom tunables.
Back to top
Q: Can PSH be used to monitor and correct problems in user apps? If so, can I use the framework to run custom scripts? Is there is an API for the framework?
A: PSH includes our new Service Management Facility (SMF). You can see some previews of this in Stephen Hahn's blog. SMF monitors all running services on the system and can automatically restart them. APIs are provided for writing custom monitoring scripts, including the ability to wait for a service to change state. SMF will appear in the next Solaris Express download.
Back to top
Q: What, if any overhead does FMD introduce, or will this be similar in nature to the use of DTrace?
A: DTrace is a facility for dynamic instrumentation, so it has no overhead when not in use, and an overhead proportional to the question you ask when you use it. FMD is a continuously running daemon, but it only does something when an error is detected on the system for which self-healing telemetry is present. So its cost is proportional to whether the system is experiencing a fault.
|