Solaris OS Networking -- The Magic RevealedSunay Tripathi (Sr. Staff Engineer, Solaris Core Technology Group), January 2006 Abstract: This paper discusses the networking advancements in the Solaris 10 OS, as well as the evolution of networking in previous releases. Topics include TCP, UDP, IP, the device driver framework, and tuning for performance. Contents
1.0 BackgroundThe networking stack of the Solaris 1.x Operating System was a BSD variant and was pretty similar to the BSD Reno implementation. The BSD stack worked fine for low-end machines, but Sun wanted to satisfy the needs of low-end customers, as well as enterprise customers, and so the Solaris OS migrated to AT&T SVR4 architecture, which became the Solaris 2.x platform. With the Solaris 2.x OS, the networking stack went through a makeover and transitioned from a BSD-style stack to a STREAMS-based stack. The STREAMS framework provided an easy message-passing interface that allowed the flexibility of one STREAMS module to interact with another STREAMS module. Using the STREAMS' inner and outer perimeter, the module writer could provide mutual exclusion without making the implementation complex. The cost of setting up a STREAM was high but the number of connection setups per second was not an important criterion and connections were usually long lived. When the connections were more long lived (NFS, FTP, and so on), the cost of setting up a new stream was amortized over the life of the connection. During the late 1990s the servers became heavily SMP-based, running large numbers of CPUs. The cost of switching processing from one CPU to another became high as the mid- to high-end machines became more NUMA-centric. Since STREAMS by design did not have any CPU affinity, packets for a particular connection moved around to different CPUs. It was apparent that the Solaris product needed to move away from the STREAMS architecture. The late 1990s also saw the explosion of the World Wide Web. An increase in processing power meant a large number of short-lived connections, making connection setup time equally important. With the Solaris 10 platform, the networking stack went through one more transition where the core pieces (for example, socket layer, TCP, UPD, IP, and device driver) used an IP Classifier and serialization queue to improve the connection setup time, scalability, and packet processing cost. STREAMS architecture is still used to provide the flexibility that ISVs need to implement additional functionality. 2.0 The Stack of the Solaris 10 OSLet's have a look at how the new framework and its key components function. Before the Solaris 10 OS, the stack uses STREAMS perimeter and kernel adaptive mutexes for multithreading. TCP uses a STREAMS QPAIR perimeter, UDP uses a STREAMS QPAIR with PUTSHARED, and IP uses a PERMOD perimeter with PUTSHARED and various TCP, UDP, and IP global data structures protected by mutexes. The stack is executed by both user-land threads executing various system calls, the network device driver read-side interrupt or device driver worker thread, and by STREAMS framework worker threads. The current perimeter provides a per-module, per-protocol stack layer, or horizontal perimeter. The STREAMS-based stack of Solaris 9 and before provides a horizontal perimeter around the individual protocol layer, which often leads to a packet being processed on more than one CPU, and by more than one thread with queuing between protocol layers. This leads to excessive context switching and poor data locality for connection-specific data structures. The "FireEngine" approach is to merge all protocol layers into one STREAMS module that is fully multithreaded. Inside the merged module, instead of per-data structure locks, a per-CPU synchronization mechanism called "vertical perimeter" is used. The "vertical perimeter" is implemented using a serialization queue abstraction called "squeue." Each squeue is bound to a CPU, and each connection is in turn bound to a squeue that provides any synchronization and mutual exclusion needed for the connection-specific data structures. The connection (or context) lookup for inbound packets is done outside the perimeter, using an IP connection classifier, as soon as the packet reaches IP. Based on the classification, the connection structure is identified. Since the lookup happens outside the perimeter, we can bind a connection to an instance of the vertical perimeter or "squeue" when the connection is initialized and process all packets for that connection on the squeue it is bound to, maintaining better cache locality. More details about the vertical perimeter and classifier are given in later sections. The classifier also becomes the database for storing a sequence of function calls necessary for all inbound and outbound packets. This allows the Solaris networking stacks to be changed from the current message-passing interface to a BSD-style function call interface. The string of functions created on the fly (event-list) for processing a packet for a connection is the basis for an eventual new framework where other modules and third-party high-performance modules can participate in this framework.
Squeue guarantees that only a single thread can process a given connection at any given time, thus serializing access to the TCP connection structure by multiple threads (from both the read and write side) in the merged TCP/IP module. It is similar to the STREAMS QPAIR perimeter but instead of just protecting a module instance, it protects the whole connection state from IP to Vertical perimeter or squeues by themselves just provide packet serialization and mutual exclusion for the data structures, but by creating a per-CPU perimeter and binding a connection to the instance attached to the CPU-processing interrupts, we can offer much better data locality. We could have chosen between creating a per-connection perimeter or a per-CPU perimeter -- that is, an instance per connection or per CPU. The overheads involved with a per-connection perimeter and thread contention give lower performance, which made us choose a per-CPU instance. For a per-CPU instance, we had the choice of queuing a connection structure for processing or instead just queuing the packet itself and storing the connection structure pointer in the packet itself. The former approach leads to some interesting starvation scenarios where packets for a connection keep arriving. To prevent such a situation, the overheads caused a lowered performance. Queuing the packet allows us to protect the ordering and is much simpler, and this is the approach we have taken for FireEngine. As mentioned before, each connection instance is assigned to a single squeue and is thus only processed within the vertical perimeter. As a squeue is processed by a single thread at a time, all data structures used to process a given connection from within the perimeter can be accessed without additional locking. This improves both the CPU and thread context data locality of access of the connection metadata, the packet metadata, and the packet payload data. In addition this will allow the removal of per-device driver worker-thread schemes, which are problematic in solving a system-wide resource issue and allow additional strategic algorithms to be implemented to best handle a given network interface based on throughput of the network interface and the system throughput (for example, fanning out per-connection packet processing to a group of CPUs). The thread entering squeue may either process the packet right away or queue it for later processing by another thread or worker thread. The choice depends on the squeue entry point and on the state of the squeue. The immediate processing is only possible when no other thread has entered the same squeue. The squeue is represented by the following abstraction:
typedef struct squeue_s {
int_t sq_flag; /* Flags tells squeue status */
kmutex_t sq_lock; /* Lock to protect the flag etc */
mblk_t *sq_first; /* First Packet */
mblk_t *sq_last; /* Last Packet */
thread_t sq_worker; /* the worker thread for squeue */
} squeue_t;
It is important to note that the squeues are created on the basic of per H/W execution pipeline -- that is, cores, hyper threads, and so on. The stack processing of the serialization queue (and the H/W execution pipeline) is limited to one thread at a time, but this actually improves performance because the new stack ensures that there are no waits for any resources, such as memory or locks inside the vertical perimeter. Also, allowing more than one kernel thread to time-share the H/W execution pipelines has more overheads versus allowing only one thread to run uninterrupted.
The worker thread is always allowed to drain the entire queue. Choosing the correct Drain model is quite complicated. Choices are among the following:
These options can be independently applied to the read thread and the write thread. Typically, the draining by an interrupt thread should always be time-bounded "drain and process" while the write thread can choose between "processes your own" and time-bounded "process and drain." For the Solaris 10 release, the write thread behavior is a tunable with the default being "process your own," while the read side is fixed to "time-bounded process and drain." The signaling of the worker thread is another option worth exploring. If the packet arrival rate is low and a thread is forced to queue its packet, then the worker thread should be allowed to run as soon as the entering thread is finished processing the squeue when there is work to be done. On the other hand, if the packet arrival rate is high, it may be desirable to delay waking up the worker thread, and hope for an interrupt to arrive shortly after to complete the drain. Waking up the worker thread immediately when the packet arrival rate is high creates unnecessary contention between the worker and interrupt threads. The default for the Solaris 10 OS is a delayed wakeup of the worker thread. Initial experiments on available servers showed that the best results are obtained by waking up the worker thread after a 10-minute delay. Placing a request on the squeue requires a per-squeue lock to protect the state of the queue, but this doesn't introduce scalability problems because it is distributed between CPUs and is only held for a short period of time. We also utilize optimizations, which allow avoiding context switches while still preserving the single-threaded semantics of squeue processing. We create an instance of a squeue per CPU in the system and bind the worker thread to that CPU. Each connection is then bound to a specific squeue and thus to a specific CPU as well. The binding of a squeue to a CPU can be changed, but the binding of a connection to a squeue never changes because of the squeue protection semantics. In the merged TCP/IP case, the vertical perimeter protects the TCP state for each connection. The squeue instance used by each connection is chosen either at the "open," "bind," or "connect" time for outbound connections or at "eager connection creation time" for inbound ones. The choice of the squeue instance depends on the relative speeds of the CPUs and the NICs in the system. There are two cases:
For the Solaris 10 OS, the determination of NIC being faster or slower than CPU is done by the system administrator in the form of tuning the global variable
squeue_t *squeue_create(squeue_t *, uint32_t, processorid_t, void (*)(), \
void *, clock_t, pri_t);
void squeue_bind(squeue_t *, processorid_t);
void squeue_unbind(squeue_t *);
void squeue_enter(squeue_t *, mblk_t *, void (*)(), void *);
void squeue_fill(squeue_t *, mblk_t *, void (*)(), void *);
The IP connection fanout mechanism consists of three hash tables:
As part of the lookup, a connection structure (a superset of all connection information) is returned. This connection structure is called
typedef struct conn_s {
kmutex_t conn_lock; /* Lock for conn_ref */
uint32_t conn_ref; /* Reference counter */
uint32_t conn_flags; /* Flags */
struct ill_s *conn_ill; /* The ill packets are coming on */
struct ire_s *conn_ire; /* ire cache for outbound packets */
tcp_t *conn_tcp; /* Pointer to tcp struct */
void *conn_ulp /* Pointer for upper layer*/
edesc_pf conn_send; /* Function to call on read side */
edesc_pf conn_recv; /* Function to call on write side */
squeue_t *conn_sqp; /* Squeue for processing */
/* Address and Ports */
struct {
in6_addr_t connua_laddr; /* Local address */
in6_addr_t connua_faddr; /* Remote address. */
} connua_v6addr;
#define conn_src V4_PART_OF_V6(connua_v6addr.connua_laddr)
#define conn_rem V4_PART_OF_V6(connua_v6addr.connua_faddr)
#define conn_srcv6 connua_v6addr.connua_laddr
#define conn_remv6 connua_v6addr.connua_faddr
union {
/* Used for classifier match performance */
uint32_t conn_ports2;
struct {
in_port_t tcpu_fport; /* Remote port */
in_port_t tcpu_lport; /* Local port */
} tcpu_ports;
} u_port;
#define conn_fport u_port.tcpu_ports.tcpu_fport
#define conn_lport u_port.tcpu_ports.tcpu_lport
#define conn_ports u_port.conn_ports2
uint8_t conn_protocol; /* protocol type */
kcondvar_t conn_cv;
} conn_t;
The interesting member to note is the pointer to the squeue, or vertical perimeter. The lookup is done outside the perimeter and the packet is processed/queued on the squeue connection to which it is attached. Also,
Also, the connection fanout mechanism has provisions for supporting wildcard listeners, for example, The IP Classifier APIs look like this:
conn_t *ipcl_conn_create(uint32_t type, int sleep);
void ipcl_conn_destroy(conn_t *connp);
int ipcl_proto_insert(conn_t *connp, uint8_t protocol);
int ipcl_proto_insert_v6(conn_t *connp, uint8_t protocol);
conn_t *ipcl_proto_classify(uint8_t protocol);
int *ipcl_bind_insert(conn_t *connp, uint8_t protocol, ipaddr_t src,
uint16_t lport);
int *ipcl_bind_insert_v6(conn_t *connp, uint8_t protocol,
const in6_addr_t * src, uint16_t lport);
int *ipcl_conn_insert(conn_t *connp, uint8_t protocol, ipaddr_t src,
ipaddr_t dst, uint32_t ports);
int *ipcl_conn_insert_v6(conn_t *connp, uint8_t protocol,
in6_addr_t *src, in6_addr_t *dst, uint32_t ports);
void ipcl_hash_remove(conn_t *connp);
conn_t *ipcl_classify_v4(mblk_t *mp);
conn_t *ipcl_classify_v6(mblk_t *mp);
conn_t *ipcl_classify(mblk_t *mp);
The names of the functions are fairly self-explanatory.
Since the stack is fully multithreaded (barring the per-CPU serialization enforced by the vertical perimeter), it uses a reference-based scheme to ensure that the connection instance is available when needed. The reference count is implemented by For an established TCP connection, three references are guaranteed to be on it. Each protocol layer has a reference on the instance (one each for TCP and IP) and the classifier itself has a reference since it is an established connection. Each time a packet arrives for the connection and the classifier looks up the connection instance, an extra reference is placed, which is dropped when the protocol layer finishes processing that packet. Similarly, any timers running on the connection instance have a reference to ensure that the instance is around whenever the timer fires. The memory associated with the connection instance is freed once the last reference is dropped. 3.0 TCP
The Solaris 10 OS provides the same view for TCP as previous releases -- that is, TCP appears as a clone device but it is actually a composite, with the TCP and IP code merged into a single D_MP STREAMS module. The merged TCP/IP module's STREAMS entry points for open and close are the same as the IP's entry points viz.
The operational part of TCP is fully protected by the vertical perimeter that entered through the
Figure 1 TCP entry points for use by vertical perimeter: tcp_input - All inbound data packets and control messages tcp_output - All outbound data packets and control messages tcp_close_output - On user close tcp_timewait_output - timewait expiry tcp_rsrv_input - Flow control relief on read side tcp_timer - All tcp timers 3.1 The Interface Between TCP and IP
FireEngine changes the interface between TCP and IP from the existing STREAMS- based message passing interface to a functional call-based interface, both in the control and data paths. On the outbound side TCP passes a fully prepared packet directly to IP by calling
Similarly, control messages are also passed directly as function arguments. The basic protocol processing code was unchanged. Let's have a look at common socket calls and see how they interact with the framework.
A socket open of TCP or open of
tcp_connect are similar to tcp_bind. The full bind() request is prepared as a TPI message and passed as a function argument to ip_bind_v{4, 6}. IP calls into the classifier and inserts the connection in the connected hash table. The conn_ hash table in TCP is no longer used.
This path is part of tcp_bind. The tcp_bind prepares a local bind TPI message and passes it as a function argument to ip_bind_v{4, 6}. IP calls the classifier and inserts the connection in the bind hash table. The listen hash table of TCP does not exist anymore.
The accept implementation in pre-Solaris 10 releases did the bulk of the connection setup processing in the listener context. The three-way handshake was completed in the listener's perimeter, and the connection indication was sent up the listener's STREAM. The messages necessary to perform the accept were sent down on the listener STREAM and the listener was single-threaded from the point of sending the T_CONN_RES message to TCP until sockfs received the acknowledgment. In releases before Solaris 10, if the connection arrival rate was high, the ability of the stack to accept new connections deteriorated significantly.
Furthermore, some additional TCP overhead was involved, which contributed to a slower accept rate. When
The FireEngine model establishes an "eager" connection (an incoming connection is called "eager" until accept completes) in its own perimeter as soon as a SYN packet arrives, thus making sure that packets always land on the correct connection. As a result, it is possible to completely eliminate the TCP global queues. The connection indication is still sent to the listener on the listener's STREAM but the accept happens on the newly created acceptor STREAM (thus, there is no need to allocate data structures for this STREAM) and the acknowledgment can be sent on the acceptor STREAM. As a result, The new model was carefully implemented because the new incoming connection (eager) exists only because there is a listener for it, and both eager and listener can disappear at any time during accept processing as a result of eager receiving a reset or listener closing.
The eager connection starts out by placing a reference on the listener so that the eager reference to the listener is always valid even though the listener might close. When a connection indication needs to be sent after the three-way handshake is completed, the eager places a reference on itself so that it can close on receiving a reset but any reference to it is still valid. The eager sends a pointer to itself as part of the connection indication message, which is sent by way of the listener's STREAM after checking that the listener has not closed. When the Close processing in TCP no longer has to wait until the reference count drops to zero since references to the closing queue and references to the TCP are now decoupled. Close can return as soon as all references to the closing queue are gone. The TCP data structures themselves may continue to stay around as a detached TCP in most cases. The release of the last reference to the TCP frees up the TCP data structure.
A user-initiated close only closes the stream. The underlying TCP structures may continue to stay around. The TCP then goes through the TCP does not even need to call IP to transmit the outbound packet in the most common case, if it can access the IRE. With a merged TCP/IP we have the advantage of being able to access the cached IRE for a connection, and TCP can put next the data directly to the link-layer driver based on the information in the IRE. FireEngine does exactly the aforementioned.
TCP Fusion is a protocol-less data path for loopback TCP connections in the Solaris 10 OS. The fusion of two local TCP endpoints occurs at connection establishment time. By default, all loopback TCP connections are fused. This behavior may be changed by setting the system-wide tunable
If fusion fails, we fall back to the regular TCP data path; if it succeeds, both endpoints proceed to use
The latter path is taken if synchronous STREAMS is enabled. It is automatically disabled if
Locking in TCP Fusion is handled by squeue and the mutex The rate limit for small writes flow control for TCP Fusion in synchronous stream mode is achieved by checking the size of receive buffer and the number of data blocks, both set to different limits. This is different from regular STREAMS flow control where cumulative size check dominates data block count check (the STREAMS queue high water mark typically represents bytes). Each enqueue triggers notifications sent to the receiving process; a build-up of data blocks indicates a slow receiver, and the sender should be blocked or informed at the earliest moment instead of further wasting system resources. In effect, this is equivalent to limiting the number of outstanding segments in flight.
The minimum number of allowable enqueued data blocks defaults to 8 and is changeable by way of the system-wide tunable 4.0 UDPApart from the framework improvements, the Solaris 10 OS has made additional changes in the UDP packets' move through the stack. The internal code name for the project was "Yosemite." Before the Solaris 10 release, the UDP processing cost was evenly divided between per-packet processing cost and per-byte processing cost. The packet processing cost was generally due to STREAMS, the stream head processing, and packet drops in the stack and driver. The per-byte processing cost was due to lack of H/W cksum and unoptimized code branches throughout the network stack. 4.1 UDP Packet Drop Within the Stack Although UDP is supposed to be unreliable, the local area networks have become pretty reliable and applications tend to assume that there will be no packet loss in a LAN environment. This assumption was largely true, but in versions preceding Solaris 10, the stack was not very effective in dealing with UDP overload and tended to drop packets within the stack itself.On Inbound, packets were dropped at more than one layer throughout the receive path. For UDP, the most common and obvious place is at the IP layer due to the lack of the resources needed to queue the packets. Another important yet apparent place of packet drops is at the network adapter layer. This type of drop commonly occurs when the machine is dealing with a high rate of incoming packets.
The UDP
This is a fully multithreaded UDP module running under the same protection domain as IP. The module allows for a tighter integration of the transport (UDP) with the layers above and below it. This allows
UDP needs exclusive operation on a per-endpoint basis, when executing functions that modify the endpoint state. The Solaris 10 model uses an internal, STREAMS-independent perimeter to achieve the previous synchronization and is described as follows:
Entering UDP from the top or from the bottom must be done using
To support this, the new UDP model employs two modes of operation, namely UDP MT HOT mode and UDP SQUEUE mode. In the UDP MT HOT mode, multiple threads may enter a UDP endpoint concurrently. This is used for sending or receiving normal data and is similar to the putshared STREAMS entry points. Control operations and other special cases call
While in stable modes, UDP keeps track of the number of threads operating on the endpoint. The Though UDP and IP are running in the same protection domain, they are still separate STREAMS modules. Therefore, STREAMS plumbing is kept unchanged and a UDP module instance is always pushed above IP. Although this causes an extra open and close for every UDP endpoint, it provides backwards compatibility for some applications that rely on such plumbing geometry to do certain things, such as issuing I POP on the stream to obtain direct access to IP9. The actual UDP processing is done within the IP instance. The UDP module instance does not possess any state about the endpoint and merely acts as a dummy module, whose presence is to keep the STREAMS plumbing appearance unchanged. The Solaris 10 platform allows for the following plumbing modes:
These modes imply that we do not support any intermediate module between IP and UDP; in fact, Solaris technology has never supported such a scenario in the past, as the interlayer communication semantics between IP and transport modules are private. 4.3 UDP and Socket Interaction
A significant event that takes place during the
For transport modules, being directly beneath Synchronous STREAMS is an extension to the traditional STREAMS interface for message passing and processing. It was originally added as part of the combined copy and checksum effort. It offers a way for the entry point of the module or driver to be called in synchronous manner with respect to a user I/O request. In traditional STREAMS, the stream head is the synchronous barrier for such a request. Synchronous STREAMS provides a mechanism to move this barrier from the stream head down to a module below.
The TCP implementation of synchronous STREAMS was complicated in releases prior to Solaris 10, due to several factors. A major factor was the combined checksum and
In contrast, in the Solaris 10 OS, TCP isn't dependent on checksum during
Each time data arrives, the transport module schedules for the application to retrieve it. If the application is currently blocked (sleeping) during a read operation, it will be unblocked to allow it to resume execution. This is achieved by calling
An application may also be blocked in As part of the read operation, the transport module delivers data to the application by returning it from its read-side synchronous STREAMS entry point. In the case of loopback TCP, the synchronous STREAM read entry point returns the entire content (byte stream) of its receive queue to the stream head; any remaining data will be re-enqueued at the stream head awaiting the next read. For UDP, the read entry point returns only one message (datagram) at a time.
By default, direct transmission and read-side synchronous STREAMS optimizations are enabled for all UDP and loopback TCP sockets when
(Note that I INSERT or I REMOVE
If a fallback is required, 5.0 IPAs mentioned before, all the transport layers have been merged in an IP module that is fully multithreaded and acts as a pseudo device driver, as well as a STREAMS module. The key change in IP was the removal of IP client functionality and the multiplexing of the inbound packet stream. The new IP Classifier (which is still part of the IP module) is responsible for classifying the inbound packets to the correct connection instance. The IP module is still responsible for network layer protocol processing and plumbing and managing the network interfaces. Let's have a quick look at how the plumbing of network interfaces, multipathing, and multicast works in the new stack.
Plumbing is a long sequence of operations involving message exchanges between IP, ARP, and device drivers. Most set IOCTLs are typically involved in plumbing operations. A natural model is to serialize these IOCTLs one per ILL (IP lower level). For example, plumbing of One other possibility is to use an even further fine-grained approach and serialize operations per IPIF, rather than per ILL. This will be beneficial only if many IPIF are hosted on an ILL, and if the operations on different IPIFs don't have any mutual interference. Another possibility is to completely multithread all IOCTLs using standard Solaris MT techniques. But this is needlessly complex and does not have much added value. It is hard to hold locks across the entire plumbing sequence, which involves waits, and message exchanges with drivers or other modules. Not much is gained in performance or functionality by simultaneously allowing multiple set IOCTLs on an IPIF since these are purely non-repetitive control operations. Broadcast IREs are created on a per-ILL basis rather than on a per-IPIF basis. Hence, trying to bring up more than one IPIF simultaneously on an ILL involves extra complexity in the broadcast IRE creation logic. On the other hand, serializing plumbing operations per ILL lends itself easily to the existing IP code base. During the course of plumbing, IP exchanges messages with the device driver and ARP. The messages received from the underlying device driver are also handled exclusively in IP. This is convenient since we can't hold standard mutex locks across the putnext in trying to provide mutual exclusion between the write-side and read-side activities. Instead of the all-exclusive PERMOD syncq, this effect can be easily achieved by using a per-ILL serialization queue. 5.2 IP Network Multipathing (IPMP)
IPMP operations are all driven around the notion of an IPMP group. Failover and failback operations function between two ILLs, usually as part of the same IPMP group. The IPIFs and ILMs are moved between the ILLs. This involves bringing down the source ILL and could involve bringing up the destination ILL. Bringing down or bringing up ILLs affects broadcast IREs. Broadcast IREs need to be grouped per IPMP group to suppress duplicate broadcast packets that are received. Thus broadcast IRE manipulation affects all members of the IPMP group. Setting
Multicast joins operate on both the ILG and ILM structures. Multiple threads operating on an IPC (socket) trying to do multicast joins need to synchronize when operating on the ILG. Multiple threads potentially operating on different IPCs (socket endpoints) trying to do multicast joins could eventually end up trying to manipulate the ILM simultaneously and need to synchronize on the access to the ILM. Both are amenable to standard Solaris MT techniques. Considering all the above -- that is, plumbing, IPMP, and multicast -- the common denominator is to serialize all the exclusive operations on a per-IPMP group basis. If IPMP is not enabled, then a 6.0 Solaris 10 Device Driver FrameworkLet's have a quick look at how network device drivers were implemented before the Solaris 10 OS, and why they need to change with the new Solaris 10 stack. 6.1 GLDv2 and Monolithic DLPI Drivers (in Solaris 9 and Earlier Releases) In releases before Solaris 10, a network stack relays on DLPI1 providers, which are normally implemented in one of two ways. The following illustration (Figure 2) shows a stack based on a so-called monolithic Data Link Provider Interface (DLPI)driver and a stack based on a driver utilizing the Generic LAN Driver (GLDv2) module.
Figure 2 The GLDv2 module essentially behaves as a library. The client still talks to the driver instance that is bound to the device, but the DLPI protocol processing is handled by calling into the GLDv2 module, which will then call back into the driver to access the hardware. Use of the GLD module has a clear advantage in that the driver writer need not reimplement large amounts of mostly generic DLPI protocol processing. Layer two (Data-Link) features, such as 802.1q Virtual LANs (VLANs), can also be implemented centrally in the GLD module, allowing them to be leveraged by all drivers. The architecture still poses a problem, though, when it comes to implementing a feature such as 802.3ad link aggregation (a.k.a. trunking), where the one-to-one correspondence between network interface and device is broken. Both GLDv2 and a monolithic driver depend on DLPI messages and are communicated with upper layers by way of a STREAMS framework. This mechanism was not very effective for link aggregation or 10Gb NICs. With the new stack, a better mechanism was needed that could ensure data locality and allow the stack to control the device drivers at a much finer granularity to deal with interrupts. 6.2 GLDv3 -- A New Architecture Solaris 10 software introduces a new device driver framework called GLDv3 (internal name "project Nemo"), along with the new stack. Most of the major device drivers were ported to this framework, and all future and 10Gb device drivers will be based on this framework. This framework also provided a STREAMS-based DLPI layer for backward compatibility (to allow external, non-IP modules to continue to work). GLDv3 architecture virtualizes Layer two of the network stack. A one-to-one correspondence between network interfaces and devices no longer exists. The following illustration (see Figure 3) shows multiple devices registered with a MAC Services Module (MAC). It also shows two clients: A traditional client that communicates by way of a DLPI to a Data-Link Driver (DLD) and one that is kernel-based and simply makes direct function calls into the Data-Link Services Module (DLS).
Figure 3 6.2.1 GLDv3 Drivers
GLDv3 drivers are similar to GLD drivers. The driver must be linked
with a dependency on misc/mac. and misc/dld. It must call
typedef struct mac {
const char *m_ident;
mac_ext_t *m_extp;
struct mac_impl *m_impl;
void *m_driver;
dev_info_t *m_dip;
uint_t m_port;
mac_info_t m_info;
mac_stat_t m_stat;
mac_start_t m_start;
mac_stop_t m_stop;
mac_promisc_t m_promisc;
mac_multicst_t m_multicst;
mac_unicst_t m_unicst;
mac_resources_t m_resources;
mac_ioctl_t m_ioctl;
mac_tx_t m_tx;
} mac_t;
This structure must persist for the lifetime of the registration, for example, it cannot be de-allocated until after
The important members of this
typedef uint64_t (*mac_stat_t)(void *, mac_stat_t);
This entry point is called to retrieve a value for one of the statistics defined in the
typedef int (*mac_start_t)(void *);
This entry point is called to bring the device out of the reset/quiesced state that it was in when the interface was registered. No packets will be submitted by the MAC module for transmission and no packets should be submitted by the driver for reception before this call is made. If this function succeeds, then zero should be returned. If it fails, then an appropriate
typedef void (*mac_stop_t)(void *); This entry point should stop the device and put it in a reset/quiesced state, such that the interface can be unregistered. No packets will be submitted by the MAC for transmission once this call has been made, and no packets should be submitted by the driver for reception once it has completed.
typedef int (*mac_promisc_t)(void *, boolean_t);
This entry point is used to set the promiscuity of the device. If the
second argument is
typedef int (*mac_multicst_t)(void *, boolean_t, const uint8_t *);
This entry point is used to add and remove addresses to and from the
set of multicast addresses for which the device will receive packets.
If the second argument is
typedef int (*mac_unicst_t)(void *, const uint8_t *); This entry point is used to set a new device unicast address. Once this call is made then only packets with the new address and the media broadcast address should be received unless the device is in promiscuous mode.
typedef void (*mac_resources_t)(void *, boolean_t); This entry point is called to request that the driver register its individual receive resources or Rx rings.
typedef mblk_t *(*mac_tx_t)(void *, mblk_t *);
This entry point is used to submit packets for transmission by the device. The second argument points to one or more packets contained in
typedef struct mac_info {
uint_t mi_media;
uint_t mi_sdu_min;
uint_t mi_sdu_max;
uint32_t mi_cksum;
uint32_t mi_poll;
boolean_t mi_stat[MAC_NSTAT];
uint_t mi_addr_length;
uint8_t mi_unicst_addr[MAXADDRLEN];
uint8_t mi_brdcst_addr[MAXADDRLEN];
} mac_info_t;
typedef enum {
MAC_STAT_IFSPEED = 0,
MAC_STAT_MULTIRCV,
MAC_STAT_BRDCSTRCV,
MAC_STAT_MULTIXMT,
MAC_STAT_BRDCSTXMT,
MAC_STAT_NORCVBUF,
MAC_STAT_IERRORS,
MAC_STAT_UNKNOWNS,
MAC_STAT_NOXMTBUF,
MAC_STAT_OERRORS,
MAC_STAT_COLLISIONS,
MAC_STAT_RBYTES,
MAC_STAT_IPACKETS,
MAC_STAT_OBYTES,
MAC_STAT_OPACKETS,
MAC_STAT_ALIGN_ERRORS,
MAC_STAT_FCS_ERRORS,
MAC_STAT_FIRST_COLLISIONS,
MAC_STAT_MULTI_COLLISIONS,
MAC_STAT_SQE_ERRORS,
MAC_STAT_DEFER_XMTS,
MAC_STAT_TX_LATE_COLLISIONS,
MAC_STAT_EX_COLLISIONS,
MAC_STAT_MACXMT_ERRORS,
MAC_STAT_CARRIER_ERRORS,
MAC_STAT_TOOLONG_ERRORS,
MAC_STAT_MACRCV_ERRORS,
MAC_STAT_XCVR_ADDR,
MAC_STAT_XCVR_ID,
MAC_STAT_XVCR_INUSE,
MAC_STAT_CAP_1000FDX,
MAC_STAT_CAP_1000HDX,
MAC_STAT_CAP_100FDX,
MAC_STAT_CAP_100HDX,
MAC_STAT_CAP_10FDX,
MAC_STAT_CAP_10HDX,
MAC_STAT_CAP_ASMPAUSE,
MAC_STAT_CAP_PAUSE,
MAC_STAT_CAP_AUTONEG,
MAC_STAT_ADV_CAP_1000FDX,
MAC_STAT_ADV_CAP_1000HDX,
MAC_STAT_ADV_CAP_100FDX,
MAC_STAT_ADV_CAP_100HDX,
MAC_STAT_ADV_CAP_10FDX,
MAC_STAT_ADV_CAP_10HDX,
MAC_STAT_ADV_CAP_ASMPAUSE,
MAC_STAT_ADV_CAP_PAUSE,
MAC_STAT_ADV_CAP_AUTONEG,
MAC_STAT_LP_CAP_1000FDX,
MAC_STAT_LP_CAP_1000HDX,
MAC_STAT_LP_CAP_100FDX,
MAC_STAT_LP_CAP_100HDX,
MAC_STAT_LP_CAP_10FDX,
MAC_STAT_LP_CAP_10HDX,
MAC_STAT_LP_CAP_ASMPAUSE,
MAC_STAT_LP_CAP_PAUSE,
MAC_STAT_LP_CAP_AUTONEG,
MAC_STAT_LINK_ASMPAUSE,
MAC_STAT_LINK_PAUSE,
MAC_STAT_LINK_AUTONEG,
MAC_STAT_LINK_DUPLEX,
MAC_STAT_LINK_STATE,
MAC_NSTAT /* must be the last entry */
} mac_stat_t;
The macros 6.2.2 MAC Services (MAC) Module Some key driver support functions include:
extern mac_resource_handle_t mac_resource_add(mac_t *, mac_resource_t *); Various members are defined as the following:
typedef void (*mac_blank_t)(void *, time_t, uint_t);
typedef mblk_t *(*mac_poll_t)(void *, uint_t);
typedef enum {
MAC_RX_FIFO = 1
} mac_resource_type_t;
typedef struct mac_rx_fifo_s {
mac_resource_type_t mrf_type; /* MAC_RX_FIFO */
mac_blank_t mrf_blank;
mac_poll_t mrf_poll;
void *mrf_arg;
time_t mrf_normal_blank_time;
uint_t mrf_normal_pkt_cnt;
} mac_rx_fifo_t;
typedef union mac_resource_u {
mac_resource_type_t mr_type;
mac_rx_fifo_t mr_fifo;
} mac_resource_t;
This function should be called from the
This
The other fields
The interrupt rate is controlled by the upper layer by calling
The
extern void mac_resource_update(mac_t *); Invoked by the driver when the available resources have changed.
extern void mac_rx(mac_t *, mac_resource_handle_t, mblk_t *);
This function should be called to deliver a chain of packets, contained in 6.2.3 Data-Link Services (DLS) Module The DLS module provides Data-Link Services interface analogous to DLPI. The DLS interface is a kernel-level functional interface as opposed to the STREAMS message-based interface specified by DLPI. This module provides the interfaces necessary for the upper layer to create and destroy a data link service; it also provides the interfaces needed to plumb and unplumb the NIC. The plumbing and unplumbing of NIC for GLDv3-based device drivers is unchanged from the older GLDv2 or monolithic DLPI device drivers. The major changes are in data paths that allow direct calls, packet chains, and much finer-grained control over NIC. 6.2.4 Data-Link Driver (DLD) The Data-Link Driver supplies a DLPI using the interfaces provided by the DLS and MAC modules. The driver is configured using IOCTLs passed to a control node. These IOCTLs create and destroy separate DLPI provider nodes. This module deals with DLPI messages necessary to plumb/unplumb the NIC and provides the backward compatibility for the data path through STREAMS for non-GLDv3-aware clients. 6.3 GLDv3 Link Aggregation Architecture The GLDv3 framework provides support for link aggregation as defined by IEEE 802.3ad. The key design principles for designing this facility were:
GLDv3 link aggregation is implemented by means of a pseudo driver called
Figure 4
The GLDv3 The Solaris 10 platform improved the H/W checksum offload capability further to improve overall performance for most applications. A 16-bit one's complement checksum offload framework has existed in Solaris software for some time. It was originally added as a requirement for Zero Copy TCP/IP in the Solaris 2.6 OS but was not extended until recently to handle other protocols. Solaris technology defines two classes of checksum offload as the following:
Adding support for non-fragmented IPv4 cases (unicast or multicast) is trivial for both transmit and receive, as most modern network adapters support either class of checksum offload with minor differences in the interface. The IPv6 cases are not as straightforward, because very few full-checksum network adapters are capable of handling checksum calculation for TCP/UDP packets over IPv4/IPv6. The fragmented IP cases have similar constraints. On transmit, checksumming applies to the unfragmented datagram. In order for an adapter to support checksum offload, it must be able to buffer all of the IP fragments (or perform the fragmentation in hardware) before finally calculating the checksum and sending the fragments over the wire; until then, checksum offloading for outbound IP fragments cannot be done. On the other hand, the receive fragment reassembly case is more flexible since most full-checksum (and all partial-checksum) network adapters are able to compute and provide the checksum value to the network stack. During fragment reassembly stage, the network stack can derive the checksum status of the unfragmented datagram by combining the values altogether. Things were simplified by not offloading checksum when IP option were present. For partial-checksum offload, certain adapters limit the start offset to a width sufficient for simple IP packets. When the length of protocol headers exceeds such limit (due to the presence of options), the start offset will wrap around causing incorrect calculation. For full-checksum offload, none of the capable adapters is able to correctly handle the IPV4 source-routing option. When transmit checksum offload takes place, the network stack will associate eligible packets with ancillary information needed by the driver to offload the checksum computation to hardware. In the inbound case, the driver has full control over the packets that get associated with hardware-calculated checksum values. Once a driver advertises its capability through DL CAPAB HCKSUM, the network stack will accept full and/or partial-checksum information for IPv4 and IPv6 packets. This process happens for both non-fragmented and fragmented payloads. Fragmented packets will first need to go through the reassembly process because checksum validation happens for fully reassembled datagrams. During reassembly, the network stack combines the hardware-calculated checksum value of each fragment. 6.4.1
Over time,
The key of an aggregation must be an integer value between 1 and 65535. Some devices do not support configurable data-links or aggregations. The fixed data-links provided by such devices can be viewed using The GLDv3 framework allows users to select the outbound load-balancing policy across various members of aggregation while configuring the aggregation. The policy specifies which dev object is used to send packets. A policy consists of a list of one or more layer specifiers separated by commas. A layer specifier is one of the following:
For example, to use upper-layer protocol information, the following policy can be used: -P L4 To use the source and destination MAC addresses, as well as the source and destination IP addresses, the following policy can be used: -P L2,L3
The framework also supports Link Aggregation Control Protocol (LACP) for GLDv3-based aggregations which can be controlled by When a new device is inserted into a system, a default non-VLAN data-link will be created for the device during reconfiguration boot, or DR. The configuration of all objects will persist across reboot.
In future, we plan to use 7.0 Tuning for Performance
The tuning of the Solaris 10 stack is designed to give stellar, out-of-the-box performance, irrespective of the hardware used. The secret lies in using techniques such as dynamically switching between interrupt versus polling mode, which gives very good latencies when load is manageable by allowing the NIC to interrupt per packet, and switching to polling mode for better throughput and well-bounded latencies when load is very high. The defaults are also carefully picked based on hardware configuration. For instance, the In spite of these features, sometimes it is necessary to tweak some tunables to deal with extreme cases or specific workloads. In the following section, we discuss some tunables that control the stack behavior. Care should be taken to understand the impact; otherwise, the system might become unstable. It is important to note that for the bulk of the applications and workloads, the defaults will give the best results.
set ip:ip_squeue_fanout=1
set ip:ip_squeue_bind=0
The default value is 2 and can be changed by way of set ip:tcp_squeue_wput=1 This value should be set to 1 when the number of CPUs is far more than the number of active NICs, and the platform has inherently higher memory latencies where the chances of an application thread doing squeue drain and getting pinned are high.
ip:ip_squeue_wait=0
In addition, some protocol-level tuning (such as changing the 8.0 FutureIt is expected that the Solaris networking stack will continue to build on better vertical integration between layers, which will improve locality and performance further. With the advent of chip multithreading and multicore CPUs, it is expected that the number of parallel execution pipelines will continue to increase even on low-end systems. A typical two-CPU machine today is dual-core, providing four execution pipelines, and will likely have hyperthreading as well. The NICs are also becoming advanced, offering multiple interrupts by way of MSI-X; small classification capabilities; multiple DMA channels; and various stateless offloads such as large segment offload, and so on. Future work is expected to continue to leverage on these hardware trends, including support for TCP offload engines, Remote Direct Memory Access (RDMA), and iSCSI. Other specific areas that are being worked on are as follows:
9.0 AcknowledgmentsMany thanks to Thirumalai Srinivasan, Adi Masputra, Nicolas Droux, and Eric Cheng for contributing parts of this text. Also, thanks are due to all the members of the Solaris networking community for their help. Comments (latest comments first)Discuss and comment on this resource in the BigAdmin Wiki
Unless otherwise licensed, code in all technical manuals herein (including articles, FAQs, samples) is provided under this License. |
BigAdmin SubscriptionsBigAdmin Areas
BigAdmin Sun Center
BigAdmin Topics | ||||