This content is submitted by a BigAdmin user. It has not been reviewed for technical accuracy by Sun Microsystems, though it may have been lightly edited to improve readability. If you find an error or would like to comment on the article, please contact the submitter or use the comment field at the bottom of the article.
Community submissions may not follow Sun trademark guidelines. For information on Sun trademarks, please see http://www.sun.com/suntrademarks/.
Using DTrace to Solve a Problem With Defunct Processes
Derek Crudgington, May 2006
Introduction
Dynamic Tracing (DTrace) is a new feature in the Solaris 10 OS that is used to better understand the system and programs run on the system. It allows you to look and see what your system is doing and locate causes of poor behavior. This is done by using the "D programming language," which doesn't take much programming experience to understand. In the example, I don't even have to create any DTrace scripts as there is a package of them already made called the DTraceToolkit (by Brendan Gregg), which contains over 80 scripts.
The Problem
I was trying to get the mod_owa.so module to load with Apache without flipping the machine out. This module is used through Apache to connect to an Oracle database and retrieve information. Whenever I was loading the module with Apache, the machine showed high activity (for just Apache running) and <defunct> processes. (Note: Defunct processes are processes that have become corrupted where they can't talk with the parent or child process.)
{callacct:/u01/app/oracle} /usr/apache/bin/apachectl start
/usr/apache/bin/apachectl start: httpd started
{callacct:/u01/app/oracle} vmstat 1
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr m1 m1 m1 m2 in sy cs us sy id
0 0 0 17075280 447352 89 105 14 6 6 0 2 1 1 1 0 432 2566 671 1 1 98
0 0 0 16958384 157736 326 1242 23 31 31 0 0 40 43 43 0 2319 670 409 1 7 92
0 0 0 16945808 154256 93 1962 24 0 0 0 0 110 113 111 0 4168 1224 736 2 14 83
0 0 0 16945720 149888 1 139 8 0 0 0 0 247 256 261 0 5067 64 716 0 10 90
0 0 0 16948864 151448 3 601 55 879 879 0 0 186 186 174 0 4885 77 637 0 9 91
^C
The machine before starting Apache was idling at 100. We know that something is wrong here if it's dipping into the low 90's and 80's. Here's the ps -ef output:
You can see above the defunct processes and the load time has jumped from zero to .22:
{callacct:/u01/app/oracle} w
9:13am up 4 day(s), 21:49, 3 users, load average: 0.22, 0.12, 0.05
So it's clear that when Apache started with this module, the machine was flipping out, and something was going wrong. The system was also acting laggy and taking forever to respond to simple commands.
Looking for a Solution
I started looking into getting the source for the mod_owa.so module and compiling it (the one I had before was a binary), thinking that might be the problem. But I didn't have much luck with this and decided that it might not even fix the problem anyway. So I decided to look at the system and see what was actually happening.
Here I used execsnoop from the DTraceToolkit to see if anything obvious was showing up. (Note: execsnoop is a DTrace script that tells us file executions as they occur on the system.)
{callacct:/root/DTraceToolkit-0.83} ./execsnoop -a
TIME STRTIME ZONE PROJ UID PID PPID ARGS
[wait 10-15 seconds]
^C
Nothing showed up at all. So next I tried opensnoop. (Note: opensnoop is a DTrace script that tells us file opens as they occur on the system.)
We can see the httpd process is opening all these files over and over again. What we're interested in here is where the FD (file descriptor) has a '-1' value (which means error) for the /u01/app/oracle/ocommon/nls/admin/data/lx1boot.nlb file. I checked to see if that file existed:
{callacct:/u01/app/oracle} ls /u01/app/oracle/ocommon/nls/admin/data/lx1boot.nlb
/u01/app/oracle/ocommon/nls/admin/data/lx1boot.nlb: No such file or directory
The problem was that the program was looking for /u01/app/oracle/ocommon when it should have been looking for /u01/app/oracle/OraHome1/ocommon.
Solving the Problem
The solution here should be simple -- just create a symlink to link the file to that directory:
Now that the link is made, we run opensnoop again to see if there's still a problem:
There's no problem with our lx1boot.nlb file now, but there's an issue with the files in /u01/app/oracle/ldap and /u01/app/oracle/oracore directories.
From looking at the previous directories in the /u01/app/oracle directory, we know that oracore and ldap don't exist, and we probably need to symlink those, too:
Running opensnoop again, we see there are no more errors showing up and vmstat shows cpu idle back down to 100:
Obviously, load averages go back down as well:
{callacct:/u01/app/oracle} w
9:57am up 4 day(s), 22:33, 3 users, load average: 0.02, 0.10, 0.12
And there are no more defunct processes.
About the Author
Derek Crudgington can be reached at dacrud@gmail.com, or see his web site at:
http://hell.jedicoder.net.
The information and links on this page have been provided by a BigAdmin user. The submitter is solely responsible for such information and links. Sun is not responsible for the availability of external sites or resources, and does not endorse and is not responsible or liable for any content, advertising, products, or other materials on or available from such sites or resources. Sun will not be responsible or liable, directly or indirectly, for any actual or alleged damage or loss caused by or in connection with use of or reliance on the information posted here, or goods or services available on or through any external site or resource.
Unless otherwise licensed, code in all technical manuals herein (including articles, FAQs, samples) is provided under this License.