BigAdmin System Administration Portal
Feature Tech Tip
Print-friendly VersionPrint-friendly Version

Interrupt Resource Management Feature in the OpenSolaris OS

Scott Carter, August 2009

Overview

This tech tip explains Solaris Interrupt Resource Management (IRM), which optimizes the use of interrupt vectors to improve IO performance. This feature is available in OpenSolaris build 107, and is scheduled to be in the Solaris 10 10/09 Operating System for SPARC platforms.

Interrupt vectors are how IO devices inform the system when important events have happened and that devices need to be serviced. Each interrupt vector maps to an interrupt service routine that responds to these events.

Modern IO devices can have multiple interactions occurring simultaneously with external hardware. By specifying the set of possible external events with finer granularity, and by giving an IO device more interrupt vectors to signal those events, the overall IO performance can be improved. Doing so facilitates the use of smaller, more efficient interrupt service routines that can execute in parallel. The improved interrupt handling efficiency can improve overall IO performance, especially on multi-processor systems.

The new technology for specifying and signaling more independent events is Message Signaled Interrupts (MSI), and the Solaris feature to give IO devices more interrupt vectors is Interrupt Resource Management (IRM).

Technical Background

It used to be that PCI devices had to physically assert an interrupt pin to signal an interrupt. But MSI allows IO devices to instead signal interrupts by sending messages over the bus, the same way they send ordinary data.

An IO device enumerates the set of events for which it could signal interrupts, and these numbered events are known as MSI numbers. To signal an event, the IO device then just writes an MSI message onto the bus containing the MSI number of whichever event occurred.

MSI is a large improvement because a PCI bus has very few interrupt lines, but there are many more bits available in the space of an MSI message. So a much wider range of events can be signaled. The MSI specification supports a range of 32 MSI numbers per device, and an extended version (MSI-X) expands this to 2048. By contrast, a PCI bus has only 4 interrupt lines per each IO device.

When an IO device sends an MSI message, the IO bus controller receives it and triggers an interrupt vector to the host system in response. Typically there is a translation table in the IO bus controller (programmed by the system) to map MSI numbers into interrupt vectors. But this leads to a new limitation affecting MSI and MSI-X: Instead of a limited number of interrupt lines, there is a limited number of interrupt vectors available in the system. Many SPARC based systems provide 256 interrupt vectors to each PCI bus. And x86 based systems have 256 interrupt vectors per processor, shared by all IO buses.

The capabilities of individual IO devices vary; they don't necessarily signal the full range of possible MSI numbers. But together, all the IO devices in a system could consume all the available interrupt vectors. Thus, the Solaris OS must manage how many interrupt vectors are given to each IO device.

The goal is to give each IO device as many interrupt vectors as possible. If an IO device signals more MSIs than it has interrupt vectors, its interrupt service routines will be less efficient. When multiple MSI numbers are mapped to the same vector, the interrupt service routine must first determine which one occurred before servicing it. This involves reading status registers and analyzing them, which wastes IO bus cycles and CPU cycles.

Directly mapping an MSI to its own interrupt vector by itself avoids this overhead. Furthermore, if multiple MSIs use a shared interrupt vector, there is the undesired effect of serializing those events: Each vector can trigger only once at a time.

How IRM Works

To understand how IRM works, it is important to first understand when a device driver initializes its interrupt handling. Originally, this occurred only when a device got attached to the system. If any new devices were inserted, they could use only whatever remaining interrupt vectors were still unused. And if a device was removed, its interrupt vectors could only return to the pool of unused vectors instead of being redistributed to other devices.

This was fine when each IO device could signal only one or two interrupts. Systems could be designed with enough interrupt vectors to support all the IO devices that could ever possibly be attached. But MSI and MSI-X now make this less true. Many more interrupts can now be signaled independently and the total number of interrupt vectors in the system might not always be enough.

Before the Solaris IRM feature, a tradeoff was made to sacrifice some of the potential IO performance to ensure that there was always a supply of available interrupts. This ensured new IO devices could always be hotplugged into a running system. No matter how many MSIs an IO device could signal, Solaris always imposed the conservative limit of just two interrupt vectors per device.

With IRM, it is now possible to safely exceed this limit and give each device more interrupt vectors. This is because IRM introduces new interfaces to the Solaris Device Driver Interface (DDI) specifically to reconfigure a driver's interrupt handling at any time after the device is already attached.

With the new DDI interfaces, a device driver can register a callback function, which will then be used by IRM to notify the device driver whenever there are more or fewer interrupt vectors available. The device driver can then reconfigure itself in response to these notifications.

This ability to dynamically reconfigure the interrupt handling of attached drivers now makes it safe for Solaris to fully utilize all available interrupt vectors. If a new IO device is inserted, IRM rebalances how many interrupt vectors are given to each IO device. Some drivers might be notified to release previously used interrupt vectors in order to accommodate the new IO device. Likewise, when an IO device is removed, its interrupt vectors can now be redistributed among the remaining IO devices.

The rebalancing algorithms consider the different needs of each IO device to compute a set of allocations for each that is fair in relation to the others. And the results of each balancing are consistent no matter how the final IO configuration was reached. So there is no difference regardless of whether an IO device was part of the initial configuration or it was hotplugged later.

Larger numbers of interrupt vectors are given only to IO devices whose drivers provide the necessary callback mechanism. And for various technical reasons, the IRM feature is available only to IO devices that use MSI-X to signal their interrupts. The previous limit still applies in all other cases.

Exploring IRM

Mostly, IRM is an internal, behind-the-scenes sort of thing that automatically tunes the interrupt utilization of a running system. But how well the system can be tuned depends upon using the right IO devices with the right drivers.

Two new MDB macros are provided to investigate what tuning has occurred and what opportunities exist for improvement. The first is ::irmpools, which displays a high-level summary of the interrupt utilization. The second is ::irmreqs, which displays details about the IO devices associated with an individual IRM pool.

The definition of IRM pools and how IO devices are associated with them depends upon the system architecture. Currently, IRM pools are defined only on SPARC based systems to represent the interrupt vectors for each PCI bus controller. (The implementation for x86 systems is under review; it is possible that there will be global pools shared by all buses.) If no IRM pools are defined in the system, these MDB macros display nothing.

Here is an example on a SPARC based system (a Sun SPARC Enterprise M5000 Server):

The example shows there are four IRM pools, one for each PCI express bus. Each pool has a total size of 256 interrupt vectors and supports IO devices using either MSI or MSI-X. The output shows the total number of interrupt vectors requested from each pool and the total number utilized from each pool.

Also shown are the capabilities and current interrupt utilization of each IO device that is attached to one of the PCI Express buses. The output shows how many MSI numbers each device can signal (NINTRS), how many its driver requested from the system (NREQ), and how many it was given (NAVAIL).

The output also indicates which drivers have a callback function for IRM notifications. The REQUESTED value of an IRM pool is the sum of the NREQ values of its associated IO devices, and the RESERVED value is the sum of the associated NAVAIL values.

This example clearly shows that interrupt utilization could be improved with updated device drivers.

For More Information

Here are some additional resources:


Comments (latest comments first)

Discuss and comment on this resource in the BigAdmin Wiki

Unless otherwise licensed, code in all technical manuals herein (including articles, FAQs, samples) is provided under this License.


BigAdmin
  
 
BigAdmin Upgrade Hub