Case Study: Using DTrace and truss in the Solaris 10 OS
Wynne Wang, April 2008
This case study focuses on leveraging DTrace and truss to troubleshoot the Solaris 10 Operating System.
DTrace is a debugging feature of the Solaris 10 OS.
Recently, a customer called me about an unusual issue with the Solaris 10 OS. I have been supporting this customer's Web 2.0 online
product for the last six months.
The customer found that he couldn't add a new user. He tried to add a user using a command similar to
useradd -m -d /home/test test to indicate the user's home directory as /home/. The following error message was displayed:
# useradd -m -g test -d /home/test -c "test user." \
-m -s /bin/bash test
UX: useradd: ERROR: Unable to change owner of files home \
directory: No such file or directory.
So, the customer called me for help. It seemed like a simple Solaris configuration issue. We know that Solaris automounts the
/home directory. So, if we want to use the /home directory, we must make sure autofs
is not running, and then we can create contents inside /home.
All the steps are easy and well documented. We opened the /etc/auto_master file and it looked like this:
We commented out the last line as follows, which turned off the autofs mount for the
home directory in the configuration:
#/home auto_home -nobrowse
Then we restarted the autofs service, and tried the useradd command again.
It should have worked well now, right? But what surprised us is that we still got the following response:
UX: useradd: ERROR: Unable to change owner of files home \
directory: No such file or directory.
What could be wrong?
Searching the knowledge database provided nothing useful. Several similar questions appear there, but the solution didn't help.
So, I tried using truss to dig into the command. What on earth was the command doing? It printed out every
syscall the application made and the parameters. By examining
the detailed log info, I found a hint.
#truss -f -o /tmp/useradd.out useradd -m -d /home/test test
I used -f to follow the children in case they forked a new process. And I opened
the /tmp/useradd.out file, which holds the log info while the
useradd command runs:
After digging for hours, I found a clue. The useradd process called fork() to generate
several children, and one of the children, 1755, died abnormally.
Received signal #18, SIGCLD [caught]
1749: siginfo: SIGCLD CLD_EXITED pid=1755 status=0x0001
When I compared the output to the output of a normal useradd command, I found that SIGCLD didn't appear
in the normal output. After that signal, the children all exited and caused the main process error. Then the useradd
command ran into an error. I wondered if this was the root cause of the problem. Who sent the signal to the innocent child?
Fortunately, Sun has a powerful debugging tool, DTrace, which could help us to dig into the Solaris kernel to find out what
it was doing. So I wrote a small DTrace probe to log who was sending the signal.
The probe looked like this:
#!/usr/sbin/dtrace -qs
proc:::signal-send
/args[2] == SIGCHLD/
{
printf("SIGCLD was sent by %s pid=%d \n", \
args[1]->pr_fname,args[1]->pr_pid);
}
This probe would trigger when something sends a SIGCLD signal. The pr_fname is the process name, and the
pid is the process ID (PID).
I started the probe and made it wait in the background, and then I ran the useradd command again.
It worked. DTrace honestly reported the process name and PID that sent the SIGCLD signal. SIGCLD was sent by find
and the PID was 1499.
I couldn't believe it. The find command is the most used UNIX command. Why did it close my child?
I used which find to get its exact location, and learned that it is the /usr/bin/find program.
So I needed to go to that location and determine why.
I went to the /usr/bin directory. The poor find was there, and it seemed quiet and innocent.
So I used file /usr/bin/find to see what on earth this file really is. The output told me it was a script file. Normally,
find should be a binary executable. So I used vi /usr/bin/find to see the content of the file.
Inside the find file, I found that the script looked like this:
It seemed like this find script checks whether /usr/sbin/i86/ps exists, and if it does, it calls
the real find program to do the work and filter the output with keywords such as EWG, glftpd, and so on.
I knew that find is very a frequently used program, and it should not hide any info. So, why does this script filter the info?
Did someone modify find to try to cover up something?
I ran ls -tral /usr/bin to see the last time find was modified. The output looked like this:
-rwxrwxrwx 1 root other 177 7 30 17:33 w
-rwxrwxrwx 1 root other 181 7 30 17:34 who
-rwxrwxrwx 1 root other 87 7 30 17:36 uptime
-rwxrwxrwx 1 root other 239 7 30 17:37 ptree
-rwxrwxrwx 1 root other 311 7 30 17:41 netstat
-rwxrwxrwx 1 root other 198 7 30 17:43 last
-rwxrwxrwx 1 root other 360 7 30 18:14 ls
-rwxrwxrwx 1 root other 314 7 30 18:17 cat
-rwxrwxrwx 1 root other 219 7 30 18:18 du
-rwxrwxrwx 1 root other 222 7 30 18:19 ps
-rwsr-xr-x 1 root root 6696 8 20 12:30 EWG
-rwxr-xr-x 1 root root 300 1 23 17:29 find
-r-xr-xr-x 1 root root 18720 1 23 17:49 ls
-rwsr-xr-x 1 root root 6696 1 23 17:51 EWG
rwxr-xr-x 4 root bin 17920 1 24 11:28 .
It seems find, who, netstat, du, and
cat were all modified recently. So, I checked whether these commands were dirty or clean.
The command file /usr/bin/who told me that who was a script instead of a binary. The contents
looked like this: