BigAdmin System Administration Portal
Feature Article
Print-friendly VersionPrint-friendly Version

Meet the Architects: Solaris 10

Last Updated June 2005

Through the Sun Software Express for Solaris program, you can catch a glimpse into the future of the Solaris Operating System. Now you can also get a glimpse into some of the exciting new features and functionality of the Solaris 10 platform and meet the architects behind them.

Through BigAdmin XPerts, you will also be able to ask these architects questions about their areas of expertise. Ask a question today on BigAdmin XPerts.

Meet the Architects:



Mike Shapiro -- Predictive Self-Healing

Mike Shapiro Senior Staff Engineer, Solaris Kernel Development

As part of Sun's effort to continue to introduce revolutionary technology to reduce complexity and costs for administrators and increase availability, we've developed a bunch of new features that deliver Predictive Self-Healing to our systems. The first set of Predictive Self-Healing features appear in the June Solaris Express release. These features include:

  • The new Solaris Fault Manager
  • Standardized messaging that is linked to a new customer web site: Predictive Self-Healing Knowledge Article Web
  • Automatic self-healing for CPUs and memory on UltraSPARC-III and UltraSPARC-IV processor-based systems

We've also improved the ability of the Solaris OS to handle, isolate, and recover from PCI I/O faults. Later on, we plan to add self-healing support for many other kinds of system resources, including hardware running on AMD Opteron processors, and even system software components as well, as we build a complete self-healing system.

The idea behind Predictive Self-Healing is to fundamentally change the notion of the interaction between the system and administrators and users who have to deal with problems that occur, be they hardware faults, software defects, or configuration errors. Historically, most system and application software is designed to react to problems by spewing error messages to syslogd(1M), leaving users and administrators to sift through dozens or even hundreds of bizarre and non-standard messages in the /var/adm/messages file in order to diagnose problems and determine what to do. This process is time-consuming, error-prone, slow, and incredibly frustrating. And if you don't get the right answer fast, your system or application may soon be affected.

A self-healing system is one where those error messages, or problem symptoms, are replaced by a continuous flow of telemetry information that is processed by a Fault Manager. A Fault Manager dispatches the telemetry to an ecosystem of software components we call diagnosis engines, which know how to automatically diagnose the underlying problem from a stream of telemetry. Once we diagnose a problem, the Fault Manager then notifies agents to automatically offline the affected component (for example, offline an affected CPU or region of memory), and describe to the administrator the problem, the impact, the automated reaction we implemented for them, and what corrective action we think they should apply. These messages appear in a new, standardized format that is linked to a continuously updated web site of knowledge articles on sun.com.

The most rewarding part of our work on building self-healing systems has been to watch internal Sun users react with amazement when a system actually has a problem and responds before they even know about it! And instead of getting a stream of undecipherable error messages, a single message tells them exactly what the problem was and what, if anything, they need to repair or replace. It's exciting to change and raise expectations about what a computer system should do for you, and it's going to get even better as we convert more and more of the software and hardware subsystems to implement self-healing, all with the same simple administrative model. Stay tuned to the Solaris Express program, sun.com, and SunNetwork conferences as we roll out more information throughout the year.

Back to top



Alain Durand -- Solaris Networking Architect

Alain DurandReady to Deploy IPv6

Several years ago, I created a T-shirt at IETF with a picture of the earth rising above the moon, with the legend "IPng: save the Internet for the future generations." Today, the new generation has arrived (I now have a two-and-a-half-year-old girl and a five-month-old boy), and IPng (Internet Protocol next generation) has become IPv6. It is now a reality whose adoption has started all over the world.

Back in 2000, Sun shipped the Solaris 8 Operating System. It was the first commercially supported operating system with IPv6. The kernel knew how to deal with IPv6 packets, and we provided the fundamental API. Coming with it at the time was an already impressive list of applications supporting IPv6, like telnet, NFS, Sendmail, NIS and NIS+, and so on. The Solaris 8 OS was mainly targeted at application developers to enable them to integrate IPv6 awareness in their next application releases.

Since then, we have made a lot of progress. IPv6 phase 2, as we called it internally, has been a major project that took several years and is the product of a very dedicated team of the brightest engineers I've worked with. Together, we bring you the results of this project with the Solaris 10 OS. It offers not only new code (like a DNS that is fully IPv6-aware), updated APIs, security and privacy extensions, transition mechanisms, and more and more applications, but also the expertise of four years of internal deployment. We use IPv6 every day, and not on a test network, on our production environment. Our major systems used to develop and test the Solaris 10 OS in real life are running IPv6. We worked with our IT department to deploy it not only on our main campus, but also in most of our network engineering sites around the world. Using the appropriate transition mechanisms among the many developed at IETF, it now takes us less than 10 minutes to connect a new site or campus to our IPv6 internal network!

What comes into the Solaris 10 OS is more than the operating system bits. You will find Java technology in the JDK 1.5 release, which is IPv6 aware. This means that your applications written in the Java programming language can work transparently with either IPv4 or IPv6. The Solaris 10 OS also comes with the Sun Java Enterprise System, and you find major components like the Web Server, Application Server, and Directory Server supporting IPv6.

The Solaris 10 OS is the IPv6 deployment-ready release!

Note: IPv6 addresses are 128-bit long. For comparison, IPv4 addresses are 32-bit long. With IPv4, you can grow the Internet to about a couple of billion nodes, that is, not enough to give one to every human being on this planet. With IPv6, the address space is virtually infinite. The real number, 2^128, is so big that I have renounced printing it in decimal form.

Back to top



Adam H. Leventhal -- Solaris Kernel Engineer

Adam Leventhal DTrace for Application Tracing: "Now DTrace is even more flexible for applications and can provide unique insight into their behaviors."

DTrace is the new dynamic tracing facility that brings to the Solaris Operating System a level of system-wide observability that has previously been unavailable in industrial-strength operating systems. An even moderately sophisticated Solaris user should find that DTrace can vastly simplify some tasks, and most users will find that DTrace allows them to answer questions that had previously been unsolvable.

My colleagues and I in Solaris Kernel Development have found DTrace to be indispensable for answering our own questions about kernel behavior -- questions that either would have taken a huge amount of time and effort to answer or were unanswerable as a practical matter. Since our day-to-day work mostly centers around kernel behavior and performance, our existing observability tools so far have actually been better for the kernel than for user-land, even though the latter is a much more constrained environment. DTrace is different! As much as it's a great facility for understanding the kernel, DTrace is even more flexible for applications and can provide unique insight into their behaviors.

The DTrace team of Bryan Cantrill, Mike Shapiro, and I designed DTrace to support complete system observability from kernel function calls through the system call table to user-land functions and instructions, and into Java applications. DTrace lets users trace at the granularity of the entry and return of any application function call or the execution of any application instruction -- whatever is needed to solve the problem at hand.

My largest contribution to DTrace was the pid provider that supplies this user-land tracing. Traditional debuggers and tracing frameworks have either had to stop all threads in an application or accept the possibility of missing some events of interest -- in DTrace, neither was acceptable or possible, given the requirements of the framework. We came up with a completely new technique for application tracing, and while the implementation was tricky (involving excruciating battles with the minutia of both the SPARC and x86 instruction sets), the results were rewarding both as a designer and as a DTrace user.

Meanwhile in the kernel, we've been adding stable probes that offer an even richer view to application developers and system integrators. Ever wondered why CPU utilization is less than 100 percent, even though your application has work to do, or why a Java thread is yielding the CPU? Tying DTrace's stable probes from the sched and proc providers to its application tracing can provide quick and concise answers to those questions.

Our future plans for DTrace are to bring to user-land the notion of stable probes that expose stable abstractions. In an upcoming release of the Solaris Express, you will be able to track user-land lock contention with plockstat(1M), built on those stable abstractions. We hope that someday all applications and libraries will, like the kernel, expose their stable abstractions so that, for example, stable kernel probes and stable probes from a database application could be used to understand the system and tune its behavior based on real data.

To get started, look at our BigAdmin DTrace site. Don't be daunted by the size of the Solaris Dynamic Tracing Guide (written with loving care by the DTrace team); once you start looking at the examples and playing with DTrace on your own systems, you'll discover that time invested in learning DTrace quickly pays for itself.

Back to top



Andrew Tucker -- Solaris Zones Architect

Andy Tucker Solaris Zones (a component of the N1 Grid Container functionality) is a new feature for maximizing the use of your Solaris systems, and getting "better bang for the buck." Zones allow unrelated applications to be run on the same system in a way that isolates each application from the rest, avoiding the security and configuration problems that can occur when running applications together. Each zone is an application environment that includes a set of processes, a part of the file system hierarchy, and one or more network interfaces. To an application or user in a zone, it looks like they have a full Solaris system to themselves -- when in fact they may be sharing it with a number of other zones on the same system. Zones also allow delegated administration: Each zone can have a different root password, and the root user in one zone isn't able to affect anything outside his or her zone.

The original idea for zones started a number of years ago when we were talking with customers about server consolidation. At the time, we had added a number of resource management features to the Solaris OS, allowing an administrator to control how CPUs were allocated to different applications. Customers were interested in improving the utilization of their servers, but were unable to "stack" or consolidate multiple applications on the same box. Some of reasons for this were related to resource allocation, but many were due to the need to isolate applications in terms of configuration, security, and administration.

We developed zones as a way to address these problems. Now, multiple applications running on the same system (but in different zones) can be completely isolated --- even if someone gains super-user access in one zone due to a security hole they won't have access to the rest of the zones in the system. And we can do this in a way that is lightweight and flexible. There's still only one operating system instance to patch, back up, monitor, and so on. And you can use zones on anything from a single-CPU 1U server to a 72-CPU Sun Fire 15K server.

Back to top



Mike Shapiro, Bryan Cantrill, Adam Leventhal -- DTrace Architects

DTrace: "The Best Thing Since Sliced Bread, Twist-off Bottle Caps, Cell Phones and Caffeine."

DTrace Team

(left to right) Mike Shapiro, Bryan Cantrill, Adam Leventhal

If you have ever wanted to understand the behavior of your system, DTrace is the tool for you. DTrace is a new comprehensive dynamic tracing facility we've designed and implemented for the next release of Solaris that gives users, administrators, and developers a level of observability that we consider unprecedented in the history of operating systems.

DTrace helps you understand your system by permitting you to dynamically instrument the operating system kernel and user processes to record data that you specify at locations of interest to you, called probes. Probes are like little programmable sensors scattered all over your Solaris system in interesting places. We've provided more than 30,000 probes for you to use right out of the box! Each probe can be associated with custom programs written in our D programming language, which allows you to access system data using ANSI-C types and expressions and easily capture stack traces, record timestamps, build histograms, and more.

All of DTrace's instrumentation is dynamic and available for use on your production system. So when DTrace is off, you pay no performance cost, and the performance impact of any tracing is limited to only those probes and actions that you enable. Best of all, DTrace is safe: you can't damage the running system because DTrace has security, safety, and error checking at the core of its design. These features will let you use DTrace with confidence on your running system whenever you need it to help investigate a problem.

DTrace was an incredibly exciting product for us to develop, and hopefully our attention to detail and commitment to set a new bar for system tracing frameworks will be evident to you. In our years at Sun, we have spent an enormous amount of time trying to understand the behavior of complicated, optimized, production systems. In building DTrace, we built the tool that we have always wanted for ourselves -- and have fulfilled a dream that we have had for nearly a decade.

It took us nearly two years to design and build DTrace, and in that time we've watched it become literally indispensable to developers here at Sun, allowing us to develop, debug, and tune more effectively than ever before. It was a singular pleasure to hear the feedback of the first external customer to use DTrace, who claimed it was "the best thing since sliced bread, twist-off bottle caps, cell phones and caffeine." We're hoping that you'll have the same experience: We believe that once you use DTrace, you won't want to accept anything less. We've provided a new in-depth answerbook, the Solaris Dynamic Tracing Guide, to help you learn DTrace. It includes a complete feature reference and is packed with examples to help you get started. Once you do get started, join us on the DTrace forum on BigAdmin and let us know what you think.

Back to top



Casper Dik -- Process Rights Management

How Many Privileges Are Enough for Eternity?

Casper Dik

People often ask me how I can appear to be online all the time and still have a family life. The picture should shed some light on that. Honestly, it's not posed, and judging by the age of my son Roemer (pronounced roughly as rumor) in the picture, it must have been taken somewhere in the last six months of the project. He would sleep and I would work, just as in this photograph.

When starting down the path of implementing fine-grained privileges for the Solaris OS, the shortcut of adopting the Trusted Solaris privilege implementation looked very attractive. But we quickly stumbled when trying to answer the question: "How many privileges are enough for eternity?" We just can't ask our customers to recompile when we find out that we were wrong: application binary guarantee and all that. The interfaces we came up with hide all these details from the programmer and allow us to add privileges, grow privilege sets, and even increase the number of privilege sets if we are so inclined. We've even figured out a way to make currently unprivileged operations privileged in future without requiring alterations to programs that want to restrict which privileges they run with.

Privileges are always there and they are always enabled; there's no knob to turn. And still, it all works in a compatible way so you won't notice it unless you look closely. We're even giving you a way to see:

  • Which privileges are available at the other end of a connection: getpeerucred(3c)
  • A door: door_ucred(3door)
  • Which privileges are available to the sender of a datagram: socket.h(3head)

Well, I'm off to find the remaining pesky uid != 0 checks people seem to be fond of hiding in their applications.

Back to top



Sunay Tripathi -- Solaris High Performance Networking

Got Performance? -- "We have given you a Ferrari and an empty road..."

For the next release of the Solaris OS, we have turbo-charged the networking stack to deliver extremely high performance while improving the scalability across all platforms (SPARC and x86). Of course, the changes didn't happen overnight -- it took us (me and my partner Bruce Curtis) two years to do the background research for vertically partitioning the workload using an IP classifier-based lock-less design. This new architecture reduces the overheads of synchronization and cross communication between CPUs (a necessary ingredient for scaling across a very large number of CPUs). In simple terms, it means that we have improved the networking performance across small CPU configurations while maintaining the high scalability across large CPU configurations that the Solaris OS was well-known for.

For the uninitiated, it should be mentioned here that networking performance is not just speeding individual pieces of code but improving the flows so that there are no bottlenecks. The new architecture has a single queue per CPU, and connections are bound to specific CPUs to provide better data locality and vertical separation. Packets for a particular connection are processed only on the CPU the connection is bound to, and once a packet gets picked up for processing, it cuts across all protocol layers without any further queuing. Our changes are transparent and allow network-intensive applications to benefit without doing anything extra.

The implementation was not easy either -- you would think that after two years of research, all the details would be figured out, and all the wrinkles ironed out. We wish! It still took a large team of very dedicated engineers almost a whole year to implement and test this whole new stack.

The project also seamlessly integrates 10-Gbit Ethernet, various kinds of protocol offload, dynamic switching between interrupts and polling, and other emerging network technologies.

Anyhow, we are thrilled with the results -- and it's up to you to play with. We have given you a Ferrari, and an empty road -- let's see how fast you can go!

Back to top



Darren J Moffat -- Solaris Cryptographic Framework

Cryptography and Performance -- You Can Have Both!

The relaxing of U.S. export regulations on cryptography in the last few years allowed us to embark on a new era for security for Solaris systems. The Solaris OS now has a framework for cryptographic algorithms that supports hardware accelerators and optimized software implementations (on UltraSPARC and x86 processor-based platforms).

This has been two years in development. The high-level architecture is the work of Paul Sangster, Solaris Security Architect. Kais Belgaied and I led the design and and implementation of this. The high-level goals we had were "make it pluggable," "make it perform," and "use open standards."

Process-wise this was a very interesting project, due to the requirements placed upon us by U.S. export restrictions and the import restrictions of certain countries that have some large Solaris customers. It resulted in us building a system that allows us to cryptographically sign an ELF object (libraries and kernel modules are both ELF objects). This capability is what allows Solaris to control what functions are accessible to certain hardware accelerators or to certain routines such as IPsec/IKE (IP Security Protocol/Internet Key Exchange).

After several months of research we settled on PKCS#11 as the API for userland applications. This was an existing, well-adopted, and mature standard that met most of our needs of a cryptographic library interface. It was already in use by the Sun middleware products such as the Web Server and Directory Server. We also chose that as the SPI (service provider interface) for userland. These interfaces are all now public (see libpkcs11(3crypto) for an introduction).

We also have a corresponding set of APIs and SPIs in kernel. They are similar in nature to PKCS#11, but don't require sessions; maybe they are closer to the EVP_interfaces of OpenSSL.

We had a team of eight core developers, two QA staff, and four technical writers, as well as a supporting cast of many other helpers.

What About the Performance?

As part of the project we had to convert the private copies of cryptographic algorithms in at least one existing subsystem. We chose to convert Kerberos, since it has both a userland and kernel component (for NFS authentication and transport security) with a lot of shared code between the two. We achieved performance gains of up to 45 percent in some usage cases with ftp and NFS. In parallel we had a team converting IPsec and IKE to use the kernel and userland crypto frameworks; they have similar performance improvements, and in one case we saw as much as 80 percent improvement in IPsec throughput.

We can change the implementation of the cryptographic algorithms to improve performance, and applications benefit from this transparently.

There is still much more to come in this area -- this is just the start.

Back to top



Resources

Disclaimer: Please note that these performance results are anecdotal and may vary. Sun does not promise or guarantee that the same results will be achieved by others.

 


Unless otherwise licensed, code in all technical manuals herein (including articles, FAQs, samples) is provided under this License.


BigAdmin