Some part of the JDK test suite is failing, either crashing or giving out of memory errors.
The core files show they are using up an awful lot of native C heap, and this has
only been seen so far with amd64(x64) processes. Using up native heap until the
system is exhausted...
Great news that it is then shown to reproduce with a trivial test program:
public class ThreadKiller {
public static void main(String[] args) throws Throwable {
while(true) {
Thread t = new Thread();
t.start();
t.join();
}
}
}
I give that a try and can't see the problem - I was using sparcv9. Trying
an amd-based machine, the problem is obvious. Solaris' pmap shows immediate
and persistent growth.
Good machine:
The bad machine:
...and loads of patches installed (we are not saying that is a bad thing,
quite the opposite...).
Having good and bad machines can be very useful - we can assess the differences
for ideas of where the problem might lie. Is the architecture the problem?
Are the additional patches the problem? We need more information on the specifics
of the leak.
Native memory leaks - sounds like a job for libumem, the alternate memory allocator
with debug features that's been with us since some late-ish version of Solaris 8.
First we need to run the testcase using the umem library - we can force libumem.so
into our process by setting LD_PRELOAD on the command-line. There is also an
environment variable to tell the library how we want it to behave - in this case
the default settings are OK:
That gets the program running with the umem library, the new library will interpose on
malloc and free calls, and we wait for it to demonstrate the leak. Verify with pmap
that the leak is still happening: yes, it is.
Take a gcore ("gcore pid"), and look at it in mdb ("mdb programname corefilename").
There are extensions to mdm when using libumem. e.g. "::umem_status" to check
it was in place, and for memory corruption issues we might start with "::umem_verify".
In our case, there's one obvious command: "::findleaks":
Leaks are not always easy to assess. Is a malloc without a free a leak? But
what if the application is truly intending to perform the free later? At what
point is something an allocation, and at what point is it a leak?
For that reason, we probably don't want to focus on an allocation which has only
one or two instances, but we should look at the largest problem first...
umem is a "slab allocator". Memory is arranged into caches, which contain
slabs, which contain buffers. The buffer is where the data actually goes.
A "bufctl_audit" structure in libumem contains information for a buffer including
pointers to its slab and cache, but crucially for tracking memory leaks it stores
a stacktrace recorded at allocation.
So which of the screenful of numbers above do we need to pay any attention?
Let's focus on the most interesting (high number in the "leaked" column): pull out
the BUFCTL address and get more information using a mdb macro:
Pipe that through c++filt, a tool that ships with the compilers and demangles C++ symbols
from the standard input or command line (although I just read about a $G command that
enables this within mdb):
The C++ Runtime library, libCrun, is making the allocation. The thread is in the
process of exiting, which happens a lot in this test program, and the "ex" functions
relate to exception handling (C++ exceptions, not Java).
The problem system was using C++ runtime library patch 119964-06. The current revision is -07, and it is always good to know how the latest update affects a problem.
We can quickly test that without installing the actual patch: