SitefinderOracle and Sun
Secure Search

BigAdmin System Administration Portal
DTrace Blog: The Texas Ranger

The Texas Ranger: 1 Riot, 1 Ranger

Jarod Jenson, September 12, 2006

For quite sometime, I have been asked to join the masses and start a blog. With the hassles and time crunch that come as a consultant who travels almost 100% of the time, it has been difficult - to say the least - to get one started. I have recently caught up a bit and decided (based on what I have seen recently), that now is as good as any time to start one. So here goes....

First off, a little bit about me. I am Chief Systems Architect for a small consulting company called Aeysis. (Since we pick our own titles, I wanted a fancy one.) What I really do is focus on business critical systems - those that generate revenue for large companies. The vast majority of that work is in the form of performance analysis of either home-grown applications or the big third-party applications like Oracle, Sybase, Weblogic, etc. It is not uncommon for me to visit 3 or more customers a week in different cities across the US and occasionally internationally. What this means is that in the over three years I have been in this role, I have visited hundreds of distinct companies. In those travels, I have met really interesting people, looked at some very cool systems, and had a chance to monitor certain trends and beliefs (my favorite is the view everyone has that their company is the most wacko when it comes to policy and procedures - all large companies have some weird policies).

One of the big trends as of late is the buzz and infatuation with Solaris 10. I can say from personal experience that Solaris 10 has completely changed the way I do business. When I first started focusing on performance problems, engagements were measured in weeks. One to two weeks was not uncommon.

The reason for the lengthy times was a direct reflection of the tools we had at our disposal to do performance analysis. Much of the time was spent trying to understand the application, creating custom monitoring code, or using tools that had such a high probe-effect on the application that you could barely see the forest for the trees. DTrace in Solaris 10 changed all of that. With the advanced observability of DTrace, most of my engagements are now measured by the day. A single day is not uncommon (half days are very popular) and anything more than a couple of days is now extremely rare. My frequent flier miles have gone up exponentially in the last year as a result of this. I have to visit even more customers now (which I love) to get the same number time.

In the early days of Solaris 10, we would build a Solaris 10 system off to the side to run the customer's application and do the analysis. Suggestions were given and changes made to the app, which was then run on its original system (usually running Solaris 8 or 9) to ensure the gains transferred. This worked extremely well, but obviously required extra work in the set-up phase (but thanks to binary compatibility every app just worked on the Solaris 10 box). Many of the customers I now visit have Solaris 10 in production, so we get to do the analysis there - all without worry of causing systemic failure.

With many customers that had successes, we even took this to the next level and would bring their Linux application over to Solaris (some required porting, but Java apps just worked) and do the analysis. In each and every case, we were able to find significant performance gains.

Well, just as it struck me, it appears to have struck many of the customers that taking that optimized app back to Linux was just plain odd. Why, if Solaris runs on the same hardware as RHEL3 (for example), do we use Solaris to improve our app and then return it back to its original OS? Why not just run the application on Solaris where we can use DTrace in production (and get all of the other great Solaris 10 features as well)?

Linux has brought a tremendous amount of good to the world - that goes without saying. However, it still has a bit of baggage for these business critical systems. One comment I hear fairly often is that customers are concerned that they're now finding themselves in the "OS business". Many have very customized RHEL3 installs (i.e. site specific code patches) to work around some bug or issue that plagued their particular install base. This means that it is not straightforward for them to apply patches or upgrade to a more current release without having to find new ways around the original problem. This clearly is a support and maintenance nightmare. Each new release has to go through significant qualification to be deemed "production ready" and this takes countless man hours and effort.

With Solaris 10 supporting SPARC, x86 and x64 platforms, many sites have been aggressively looking at and deploying Solaris 10 for new projects. In addition, they are re-evaluating the OS of choice for existing projects. Now, I am admittedly biased in my view. I have the sole goal of making applications just absolutely rip from a performance perspective and with Solaris 10, I get DTrace. This means less effort and more success than doing a similar engagement without the benefits of DTrace. Besides, out of the box Solaris 10 can improve the performance of a whole slew of applications (thanks to projects like FireEngine, ZFS, libumem(3LIB), etc.).

Another great thing about Solaris 10 is the fact that it is open source. Now when DTrace takes me in a kernel direction (or libc or other libraries), it is easy to find the code in question for a quick review. Sometimes it is as simple as seeing the implementation and realizing that it was the usage of a particular interface that was the problem. Then it is an easy fix. And to be honest, this is one of the big reasons we all want source. Not so that we can get in the OS support business, but so we have an avenue to quickly (and correctly) answer questions with a source (pun intended) far more authoritative than regular documentation. Building and maintaining a custom distribution is not high on the "wants" list at almost all of these companies.

With the ever nearing date of maintenance only support for RHEL3 fast approaching, now is a good time for companies to take a look at Solaris 10. I have seen the benefits of increased observability time and time again at company after company. And not just for performance. Any aberrant system behavior can be identified with DTrace. I can't really imagine why anyone wouldn't want that power at their fingertips. And until the Linux community (and Red Hat) embrace DTrace, only Solaris 10 has that observability. Some may say that SystemTap will fill that gap, but in a business where you have to carry seriously high errors and omissions insurance, I just can't use anything that doesn't have the safety constraints that are the core of DTrace. I have no fear of using DTrace on even the most critical of systems and don't carry my insurance hotline number with me.

So, as an ending to each blog - I think I'll finish with a DTrace one-liner. For this inaugural blog, I'll use my go-to guy. Many people have seen this, but I think they under-estimate its power. This is a one-liner that uses the profile provider to sample application stack traces at a high rate. You can use it to determine where applications are spending the majority of their CPU time if you find you have a CPU hungry application. Ignore the absolute values of the counts, and focus only on the relative frequency at which each of the stacks are seen.

# dtrace -n profile-301'/arg1/{@[pid, tid, ustack()] = count()}tick-10s{exit(0)}END{trunc(@, 25)}'

(I said "one-liner"; not "one-clause";)

BigAdmin
  
 
BigAdmin Solaris 10 Survey
 
Oracle - The Information Company