BigAdmin: Java SE
Java SE - Native Memory Leaks and libumem

From the BigAdmin JavaSE Hub

Native Memory Leaks and libumem

Kevin Walls

Some part of the JDK test suite is failing, either crashing or giving out of memory errors.

The core files show they are using up an awful lot of native C heap, and this has only been seen so far with amd64(x64) processes. Using up native heap until the system is exhausted...

Great news that it is then shown to reproduce with a trivial test program:

public class ThreadKiller {
      public static void main(String[] args) throws Throwable {
              while(true) {
                      Thread t = new Thread();
                      t.start();
                      t.join();
              }
      }
}

I give that a try and can't see the problem - I was using sparcv9. Trying an amd-based machine, the problem is obvious. Solaris' pmap shows immediate and persistent growth.

Good machine:

The bad machine:

...and loads of patches installed (we are not saying that is a bad thing, quite the opposite...).

Having good and bad machines can be very useful - we can assess the differences for ideas of where the problem might lie. Is the architecture the problem? Are the additional patches the problem? We need more information on the specifics of the leak.

Native memory leaks - sounds like a job for libumem, the alternate memory allocator with debug features that's been with us since some late-ish version of Solaris 8.

Great article on how libumem works here: http://developers.sun.com/solaris/articles/libumem_library.html

First we need to run the testcase using the umem library - we can force libumem.so into our process by setting LD_PRELOAD on the command-line. There is also an environment variable to tell the library how we want it to behave - in this case the default settings are OK:

LD_PRELOAD=libumem.so UMEM_DEBUG=default java -d64 \
                    -showversion -verbose:gc ThreadKiller

That gets the program running with the umem library, the new library will interpose on malloc and free calls, and we wait for it to demonstrate the leak. Verify with pmap that the leak is still happening: yes, it is.

Take a gcore ("gcore pid"), and look at it in mdb ("mdb programname corefilename").

There are extensions to mdm when using libumem. e.g. "::umem_status" to check it was in place, and for memory corruption issues we might start with "::umem_verify". In our case, there's one obvious command: "::findleaks":

Leaks are not always easy to assess. Is a malloc without a free a leak? But what if the application is truly intending to perform the free later? At what point is something an allocation, and at what point is it a leak?

For that reason, we probably don't want to focus on an allocation which has only one or two instances, but we should look at the largest problem first...

umem is a "slab allocator". Memory is arranged into caches, which contain slabs, which contain buffers. The buffer is where the data actually goes.

A "bufctl_audit" structure in libumem contains information for a buffer including pointers to its slab and cache, but crucially for tracking memory leaks it stores a stacktrace recorded at allocation.

So which of the screenful of numbers above do we need to pay any attention?

Let's focus on the most interesting (high number in the "leaked" column): pull out the BUFCTL address and get more information using a mdb macro:

Pipe that through c++filt, a tool that ships with the compilers and demangles C++ symbols from the standard input or command line (although I just read about a $G command that enables this within mdb):

The C++ Runtime library, libCrun, is making the allocation. The thread is in the process of exiting, which happens a lot in this test program, and the "ex" functions relate to exception handling (C++ exceptions, not Java).

The problem system was using C++ runtime library patch 119964-06. The current revision is -07, and it is always good to know how the latest update affects a problem.

We can quickly test that without installing the actual patch:

LD_PRELOAD=./lib.07/amd64/libCrun.so.1 \
            /usr/java/bin/amd64/java -d64 \
            -showversion -verbose:gc ThreadKiller
(which probably doesn't need the -d64 in this case, as we are running the 64-bit java directly).

Using libCrun.so.1 from 119964-07, memory usage is stable.

Now we notice the readme for the patch includes this bug fix:
6391358 C++ code calling pthread_exit() on amd64 leads to a memory leak

So we were hitting an existing bug, that thankfully has an existing fix!

BigAdmin
  
 
BigAdmin Upgrade Hub