Learn the latest on developing scalable apps, architecting scalable infrastructures, and optimizing across CPUs, memory, and middleware. Explore the pros and cons of horizontal vs. vertical scalability, and size up grid computing solutions.
(Q): If you have to pick top 10 things that you must monitor on any server to look for
performance and/or scalability issues...what would they be?
Richard McDougall (A): Off the top of my head, in no particular order:
- CPU: Check idle time and run queue length.
- If there's a CPU bottleneck, check if it's an application or kernel CPU utilization issue with mpstat: high percentages of users indicate it's an application issue. High sys may
point to high network load or lock contention.
- Memory: Check MDBs memstat to ensure there is sufficient free memory
- Network: Check that networks are not overloaded by observing the bytes xfered against
the availability bandwidth per link.
- CPU for network: check if any CPUs are 100% busy servicing network interrupts. CPUs
at 100% in mpstat, or intrstat are possible candidates.
- File system latency: check the application visible latency with DTrace at the system call
level (perhaps fsstat, iosnoop, or an aggregation around system calls).
- Storage latency: check disk latency with iostat
- 8. Application level lock contention: check application level locks are now visible with
plockstat
- Kernel level locks: Check for hot locks with lockstat.
- Check MMU activity on SPARC using trapstat. Sometimes an application may be
reporting as running 100% in user mode, but may actually be spending a significant
amount of time in kernel mode servicing TLB misses. Trapstat will show the % of time
spent using TLB misses. If a significant amount of time (>10%) is evident, then large
MMU pages may help.
(Q): How does Solaris compare with other platforms in terms of handling large numbers of
sockets and threads? What are the important configuration settings relating to this?
Pallab Bhattacharya (A): The socket handling part is done by the application that
accepted the connection like a WebServer. If there is a single process handling all the
connections then increasing the ulimit for file descriptor is necessary - the default max in
Solaris 10 is 64K per process. The default can be changed by adding the following to /
etc/system set rlim_fd_max=NNNNN This value determines the max number of file
descriptors including open sockets that a process can have - and then use ulimil/limit
command line interface or setrlimit(2) api can be used to adjust the per process limit up to
rlim_fd_max If the arrival rate of connection is also very high then the listen_backlog
parameter for the end point must also be bumped up (app calling listen(#backlog))- the
#backlog must be less than the tcp_conn_req_max_q and tcp_conn_req_max_q0. The
value for the these can be changed using ndd -set /dev/tcp tcp_conn_req_max_q ndd -set
dev/tcp tcp_conn_req_max_q0. The number of threads that a process can have depends on
the total avail mem and the address-space available for the process.. Each thread has its
own stack.. There is no tunable to control the number of threads per process, the default
stack size for a thread is 1MB, it can be changed during thread creation. Bottom-line, if
designed properly with thread-pools and workQ, a single process can handle a very, very
large number of connections keeping the throughput reasonably constant and response time
in sub-seconds.
(Q): How many locks does ZFS add to the OS, and how many locks does UFS use?
Dan Price (A): 7, and 15, respectively. Just kidding. But seriously: The number of locks a
given subsystem uses isn't really an indicator of much of anything. ZFS is designed to be
highly threaded, and to scale to thousands of spindles and vast amounts of I/O flowing
through the system. UFS does OK on this count, but there are some well known points of
contention particularly the single writer lock. You can only have one thread writing to a
given file at a time. ZFS does not have this problem, due to better threading design.
As a side note, in a threaded system, the number of locks can actually scale with the
amount of data being used. It's not uncommon to allocate data structures which themselves
contain locks.
(Q): Where can I find information on lock free algorithms and how can they improve
scalability?
Richard McDougall (A): Check out Phil Harmans blog for an example:
http://blogs.sun.com/pgdh?entry=caring_for_the_environment_making.
(Q): How do zones scale compared with Xen and VMware?
Dan Price (A): Zones scale extremely well we think that you'll be able to cram more
zones onto the system than you will other virtualization technologies. Additionally, zones
make good use of sharing inside the OS, further reducing the resource cost of deploying
them. And there is no "context switch" penalty. For more on how zones really work, I'll
plug my own paper: http://blogs.sun.com/roller/page/dp/20041120#lisa_04_solaris_zones_operating.
(Q): How can Sun Microsystems' products (including software and hardware) help me
design better software, with more quality?
Uday Shetty (A): Depending on the nature of the product if it's native or Java, there are a
number of software products at Sun. Sun Studio Tools or Java Enterprise Studio help
design better software. You can use DTrace or collect/Analyzer to identify performance
bottlenecks. These tools can be run on non-debug code without affecting the overall
performance. You can check opensolaris.org/os/community/DTrace where you can find
information as well as pre-built DTrace scripts. Check out developer.sun.com for more
information on Analyzer.
(Q): It seems like chips are getting faster in terms of mhz, but my applications aren't
necessarily getting faster. What is Sun doing to improve this?
Hugo Rivero (A): Clock rate is not the only indicator of a system's performance. While
processor transist or density (and clockrate) has followed Moore's law, memory is not
getting faster at the same rate. Processors get faster but stall more often, waiting for
memory references. Sun is taking a different approach with throughput computing: instead
of ever increasing clock speed, focus on doing more things at the same time. We will soon
release a processor with Chip multi-threading: multiple cores per chip, multiple threads per
core. This will allow multi-threaded or multi-process apps to take better advantage of the
chip.
(Q): What was the reason to merge the libthread library into libc in Solaris 10? Does this
simplify and reduce complexity when building applications?
Richard McDougall (A): By rolling libthread into libc, it makes it much easier to add
threading into your application. Previously, there were two models; not-threaded or
threaded; previously choosing to link with -lthread would introduce some small overheads
in some areas, and warranted testing the whole app as a multi-threaded app. Here's an
example we had recently; a software vendor wanted to use threads in only one of their
background processes, but their application is shipped with the majority of the code in a
shared library. By using the new s10 libc model, they are now able to call thread_create in
this daemon alone, without needing to retest the rest of the app as multi-threaded.
(Q): We see tlb/tsb misses taking a large (>10%) amount of time under load. Any
comments you can make about the project to enable large pages on memory mapped files,
text pages, etc?
Adam Leventhal (A): Large pages are now supported in the latest Solaris 10 patches and
in OpenSolaris for text and data pages. Work is ongoing for mapped files.
(Q): How many CPUs does Solaris scale to? Is the much touted scalability what makes
Solaris slower than Linux for small servers?
Dan Price (A): CPU scalability depends on the specific platform. In this case, I'm taking
CPU to mean "CPU core" So on a fully loaded E25K, you can have 72 US-IV+ chips, each
of which presents two cores. That means that you're at 144 cores. We've done a ton of work
in Solaris 10 and beyond to close the performance gap against other OS's on small systems
(we even have an active "small systems" performance team). We have frequently
benchmarked ahead of other OS's on 64-bit kernels on AMD64. Come hang out on perfdiscuss@
opensolaris.org for more.
(Q): Re: the single write lock Dan mentioned in response to an earlier question... Is it per
file or per filesystem (as I always heard way back when)?
Dan Price (A): It's per file. We have a reader/writer lock per file; so you can have many
readers at a time, but all readers get locked out when the single writer is writing. In UFS,
this locking (which is actually a POSIX mandated behavior) is implemented per-file. This
means that with applications which have hundreds or thousands of threads and few files,
you can have scalability issues. In ZFS, the locking is per-block, which should help things
significantly.
(Q): Re: thread local storage (TLS). We currently use quite a bit of
pthread_get/set_thread_specific operations. Can you comment as to whether TLS offers a
performance improvement? Or is it simply syntax?
Adam Leventhal (A): It's not just syntax and TLS should provide a significant
improvement over TSD in terms of performance. The underlying mechanisms are quite
different and the compiler can use some specific optimizations for TLS.
(Q): We're going to have all these very parallel systems, and then we get faced with classic
choke points, like concurrency checking for licenses and logins, or shared data pools. Is that
all pushed into messaging service, ESBs and the like, or are there other alternatives to
allowing the scalability while still having active coordination points?
Uday Shetty (A): I think you are better off using LDAP server for
authentication/authorization, and licensing server. The directory server in the Java
Enterprise Suite scales extremely well for large enterprise. You can use messaging service
for shared data pools.
(Q): Do you think the SPECint2000_rate benchmark is useful to compare SPARC systems
against AMD, IBM, HP, Intel etc when used for RDBMS?
Shane Sigler (A): SPECint2000_rate is not a good option for comparing the systems for
DB performance. The TPC benchmarks are more applicable as well as benchmarks like the
SAP benchmarks which more closely mirror what customers are going to see. For example
TPC-C does 6 times as much I/O per transaction as most typical customer transactions.
(Q): Re: umem, we have a model in which thread x allocates and thread y frees. Any
guidance as to whether or when a global object cache would give better or worse
performance as compared to a single per allocating thread cache in this kind of case?
Adam Leventhal (A): Hm... interesting question. It can really depend on the specifics of
your system and your application. If structures are being allocated and freed frequently,
you could potentially incur lock contention or cache contention with a global cache. My
guess is that the per-thread cache is going to be a better bet, but in these cases gathering
data on the possible solutions is essential. Thanks for the question.
(Q): What is the best high performance shared/cluster file system that can run on Solaris
servers and other OSs?
Richard McDougall (A): I hope I can answer this one with a typical engineering response:
it depends :-) Depending on your bandwidth and metadata throughput requirements, there
are multiple solutions in this space. I'll try and answer based on a few key options, feel free
to followup with more questions...On Solaris, we have QFS, Luster and NFS as good examples. QFS allows multi-writer
shared access for nodes in a cluster. For bandwidth, QFS is our highest bandwidth product.
I've personally seen QFS scaling at over 18 Gigabytes per second for a large multi-stream
workload on Starcat based systems. So the bigger question is what type of shared cluster? NFS is designed as a shared file
system, and is certainly the most common clustered file system in use today. Wait, NFS as
a high bandwidth solution? Ethernet bandwidth has been quietly doubling, at a faster pace
than fibre-channel transports. With Solaris 10's fire-engine networking stack, we are now
seeing several hundred megabytes per second on a single NFS connection. This means it's
now possible to build out a high performance storage grid using NFS, if your requirements
are in the order of a gigabyte per second per node. On a final note, if your application is meta-data intensive; i.e., its workload profile is dominated by open, create, and attribute operations, NFS is the clear winner.
In summary, QFS is by far our highest bandwidth option today. I wrote a blog about this earlier this year, which might help expand on the
emergence of IP storage in this area.
(Q): Hugo: thanks for the reply. I'm very interested in using your new Niagara system for
my computation-intensive app, but I heard that the chip doesn't do so well with floatingpoint,
what's the deal?
Hugo Rivero (A): Our first implementation of the Niagara processor will focus on integerintensive
apps. Future versions will ease this restriction. If your application is running on
UltraSparc processors today, you can look at the CPUstat counters
(FA_pipe_completion+FM_pipe_completion)/Instr_cnt).
(Q): Can you relate any information about estimated performance of ZFS as a filesystem
over some of your competitors, specifically VxFS?
Dan Price (A): We can't at this time. But I think we'd like to put ZFS in everyone's hands
as soon as possible; see also http://blogs.sun.com/bonwick.
(Q): We're having some issues to make our strategy more competitive, and reading the
PDF and listening to the presentation Scale my Apps, I was wondering how to increase the
software development process.
Uday Shetty (A): I think it's important to make sure the software design process includes
requirements such as scalability in the product design doc. Also, have a workload in mind
that could traverse all the code path to make sure the application could perform and scale
well based on the customer environment. I don't have a specific tool in mind that could help
with the development process, but you can try using commercial UML tools.
(Q): Will AMD Opterons ever be as scalable as Sun's UltraSparc systems?
Shane Sigler (A): It depends on how you want to define scalability. As CPUs get more
cores and threads per chip, the scalability picture changes. We do work closely with AMD
on the best way to build larger systems and you will see more of this in the future.
(Q): By experiment, I observe that binding to psrsets improves throughput, but it seems
difficult to find a way to integrate the use (or avoidance) of this across multiple system
types which have arbitrary numbers of CPUs and interrupt distributions. Any guidance?
Dan Price (A): Yes! In Solaris 9, we introduced the pools facility and in Solaris 10 refined
it further. Pools allows one to create persistent pools of CPUs (i.e. persistent across reboot).
You can also configure the pools of CPUs with rules like: 2-5 CPUs. As for interrupt
distribution, in our development builds of Solaris (and probably in an S10 update), we have
a new daemon called 'intrd' which automatically redistributes interrupts on the system. If
you want to see how it works, you can give Solaris Express a whirl at
http://www.sun.com/solaris-express
(Q): How do you decide when libumem will help with performance?
Adam Leventhal (A): If your application allocates a lot of memory from many different
threads libumem will probably give you a performance impact. You can also use the new
plockstat(1M) command to look for lock contention in your application. If you see that
malloc_lock is heavily contended, libumem is probably going to help a lot. Trying libumem is a snap: LD_PRELOAD=libumem.so.1
(Q): How much impact, on average, does DTrace impose on a running system while
performing an analysis?
Adam Leventhal (A): More important is the overhead when DTrace isn't in use at all, and
that's zero: there's no tax for having the ability to examine your system. When you're using
DTrace, you pay as you go: the more probes you enable and the more they're hit and the
more work you do, the more overhead the instrumentation will incur. We haven't taken the
time to measure enabled overhead since that's of less concern to us than the disabled probe
effect.
(Q): We have an application that takes an indefinite time between the request and the
response. So, to make it scale we need to handle the request and response on separate
threads, or even better to handle a queue of service objects (request/response pairs) with
our own thread pool. This does not appear to be possible with the current servlet API. Is
there a way to implement this in an application-server-portable fashion?
Pallab Bhattacharya (A): It is true that there is no such API as of now. The queue-pair is
a feature of the servlet container. I think the approach described in the question is a good
approach. If the time spent in each request is mostly uniform, then having a workQ and a
single thread pool is good enough.
(Q): Hello guys. Many times many customers are asking how Solaris, especially Solaris
10, compares with Linux based systems in terms of scalability. Would you share two or
three important points on this topic? Thanks.
Richard McDougall (A): Scalability is one of the big-bets that was the design center focus
for Solaris 2.0 (the rewrite of Sun's original BSD based SunOS). At this time we focused
the design on scalability, betting that SMP would be a longer term way of scaling up
performance (as opposed to faster CPUs). Solaris 2.x was a ground up re-design based on
threads. In contrast to adapting the process-based kernel model, this provided kernel-level
thread scalability in key areas including scheduling, networking and I/O. This was an
important design point for large SMP systems. Today, with the latest Solaris kernel we are seeing scaling to 208 cores on the large Starcat systems. On low end systems, scalability wasn't such an important requirement.
Interestingly due to Chip level multi-threading, scalability is recently becoming a key
requirement. Even low end one chip systems like Niagara have 32 virtual CPUs, making
scaling to the order of 32 critical...In Solaris 10, there are several areas I've seen that show significant scalability differences.
For example, here's a few:
- Systems with many CPUs or large number of virtual CPUs
- Highly threaded applications: the new 1:1 thread model is now integrated into libc and
enabled by default. We've seen a single multi-threaded apps scale to over 100 CPUs
- Memory allocators: the libumem allocator provides a userland equivalent of the slab
allocator.
- File systems: UFS seems to be on par with other file systems on the low end, but
differentiates itself as we scale up the number of disks. There's a white paper showing
some of the scaling curves. I'll be publishing some data showing file system
performance scaling in the near future. The typical scaling differentiator is typically
visible at 20 disks and up.
- ZFS also represents a significant step forward in scalability; it's highly scalable design
overcomes many of the single writer lock and per-directory scaling issues seen in other
file systems.
- The final note on scalability is about observability. The single most important enabler to
improve your application scalability is being able to identify key scaling limitations and
remove them. This is where Solaris 10's DTrace is a key asset; as it makes it possible to
identify and overcome key scaling issues in as little as minutes.
(Q): We've played with DTrace and think it's fab. Are there any sample scripts available to
monitor disk I/O bottlenecks?
Dan Price (A): Yes! There is a script included in /usr/demo/DTrace called 'iosnoop.d'. That
is a good place to start. There's also a good collection of scripts here:
http://users.tpg.com.au/adsln4yb/DTrace.html Rich is yelling at me across the table that
we'll soon have 'fsstat' based on DTrace. Keep your eyes on DTracediscuss@
opensolaris.org for the announcement of that.
(Q): Are there any general DTrace scripts available that will help me determine if my app
will scale?
Adam Leventhal (A): Yes. When we're talking about making an app scale we're often
looking for lock contention in which case plockstat(1M) is a great place to start. That's not a
DTrace script, but a specialized DTrace consumer (however you can see the underlying D
script by doing plockstat -V). Other scripts you might want to look at are the examples in
the sched provider chapter in the Solaris Dynamic Tracing Guide.
(Q): What is the one most critical performance issue faced by our customers, especially
with Solaris 10. Can you please highlight any bottlenecks engineers need to be aware of.
Adam Leventhal (A): That's a very broad question and there's really no _one_ scalability
or performance issue that impacts all customers uniformly. In light of that, we've taken the
approach of providing better systemic observability through DTrace so that those unique
bottlenecks can be identified and removed.
(Q): Everybody knows that XML parsing imposes a big overhead but, we adopt it in order
to maintain compatibility & scaleability, how can we assess the effect of removing XML
messaging exchange in order to get better performance!
Dan Price (A): It sounds like this is a measurement problem. Where is XML consuming
too many processing cycles? Are you using the right XML processing technology (DOM or
SAX) for your application? You may want to experiment with using DTrace
( http://www.opensolaris.org/os/community/DTrace ) to get some sense of the amount of
time and/or CPU cycles that are being consumed; you can also add static probes to your
application using DTrace, which could give you more abstract semantics like "messageparse-
start" and "message-parse-end"
(Q): How can plockstat be used to detect scalability issues within one application? Any
blueprint related with plockstat and its friends?
Adam Leventhal (A): Plockstat(1M) new in Solaris 10 traces synchonization activity
in a process. It will show you lock hold times as well as lock contention. Long hold times
and lock contention will inhibit scalability so hot spots identified by plockstat are good
places to improve.