Expert Exchange Transcript

Unlocking the Secrets to the Scalability Puzzle

Wednesday, November 2, 2005

Learn the latest on developing scalable apps, architecting scalable infrastructures, and optimizing across CPUs, memory, and middleware. Explore the pros and cons of horizontal vs. vertical scalability, and size up grid computing solutions.

Panelists:
  • Adam Leventhal, Solaris Kernel Engineer
  • Dan Price, Solaris Kernel Engineer
  • Richard McDougall, NEED TITLE
  • Hugo Rivero, Distinguished Engineer, Performance and Availability Engineering
  • Uday Shetty, Senior Staff Engineer, Market Development Engineering
  • Shane Sigler, Product Definition and Strategy, Network Systems Group

Transcript


 
  Download the PDF or review live chat transcript below.

(Q): If you have to pick top 10 things that you must monitor on any server to look for performance and/or scalability issues...what would they be?
Richard McDougall (A): Off the top of my head, in no particular order:

  1. CPU: Check idle time and run queue length.
  2. If there's a CPU bottleneck, check if it's an application or kernel CPU utilization issue with mpstat: high percentages of users indicate it's an application issue. High sys may point to high network load or lock contention.
  3. Memory: Check MDBs memstat to ensure there is sufficient free memory
  4. Network: Check that networks are not overloaded by observing the bytes xfered against the availability bandwidth per link.
  5. CPU for network: check if any CPUs are 100% busy servicing network interrupts. CPUs at 100% in mpstat, or intrstat are possible candidates.
  6. File system latency: check the application visible latency with DTrace at the system call level (perhaps fsstat, iosnoop, or an aggregation around system calls).
  7. Storage latency: check disk latency with iostat
  8. 8. Application level lock contention: check application level locks are now visible with plockstat
  9. Kernel level locks: Check for hot locks with lockstat.
  10. Check MMU activity on SPARC using trapstat. Sometimes an application may be reporting as running 100% in user mode, but may actually be spending a significant amount of time in kernel mode servicing TLB misses. Trapstat will show the % of time spent using TLB misses. If a significant amount of time (>10%) is evident, then large MMU pages may help.

(Q): How does Solaris compare with other platforms in terms of handling large numbers of sockets and threads? What are the important configuration settings relating to this?
Pallab Bhattacharya (A): The socket handling part is done by the application that accepted the connection — like a WebServer. If there is a single process handling all the connections then increasing the ulimit for file descriptor is necessary - the default max in Solaris 10 is 64K per process. The default can be changed by adding the following to / etc/system set rlim_fd_max=NNNNN This value determines the max number of file descriptors including open sockets that a process can have - and then use ulimil/limit command line interface or setrlimit(2) api can be used to adjust the per process limit up to rlim_fd_max If the arrival rate of connection is also very high then the listen_backlog parameter for the end point must also be bumped up (app calling listen(#backlog))- the #backlog must be less than the tcp_conn_req_max_q and tcp_conn_req_max_q0. The value for the these can be changed using ndd -set /dev/tcp tcp_conn_req_max_q ndd -set dev/tcp tcp_conn_req_max_q0. The number of threads that a process can have depends on the total avail mem and the address-space available for the process.. Each thread has its own stack.. There is no tunable to control the number of threads per process, the default stack size for a thread is 1MB, it can be changed during thread creation. Bottom-line, if designed properly with thread-pools and workQ, a single process can handle a very, very large number of connections keeping the throughput reasonably constant and response time in sub-seconds.

(Q): How many locks does ZFS add to the OS, and how many locks does UFS use?
Dan Price (A): 7, and 15, respectively. Just kidding. But seriously: The number of locks a given subsystem uses isn't really an indicator of much of anything. ZFS is designed to be highly threaded, and to scale to thousands of spindles and vast amounts of I/O flowing through the system. UFS does OK on this count, but there are some well known points of contention — particularly the single writer lock. You can only have one thread writing to a given file at a time. ZFS does not have this problem, due to better threading design. As a side note, in a threaded system, the number of locks can actually scale with the amount of data being used. It's not uncommon to allocate data structures which themselves contain locks.

(Q): Where can I find information on lock free algorithms and how can they improve scalability?
Richard McDougall (A): Check out Phil Harmans blog for an example: http://blogs.sun.com/pgdh?entry=caring_for_the_environment_making.

(Q): How do zones scale compared with Xen and VMware?
Dan Price (A): Zones scale extremely well — we think that you'll be able to cram more zones onto the system than you will other virtualization technologies. Additionally, zones make good use of sharing inside the OS, further reducing the resource cost of deploying them. And there is no "context switch" penalty. For more on how zones really work, I'll plug my own paper: http://blogs.sun.com/roller/page/dp/20041120#lisa_04_solaris_zones_operating.

(Q): How can Sun Microsystems' products (including software and hardware) help me design better software, with more quality?
Uday Shetty (A): Depending on the nature of the product if it's native or Java, there are a number of software products at Sun. Sun Studio Tools or Java Enterprise Studio help design better software. You can use DTrace or collect/Analyzer to identify performance bottlenecks. These tools can be run on non-debug code without affecting the overall performance. You can check opensolaris.org/os/community/DTrace where you can find information as well as pre-built DTrace scripts. Check out developer.sun.com for more information on Analyzer.

(Q): It seems like chips are getting faster in terms of mhz, but my applications aren't necessarily getting faster. What is Sun doing to improve this?
Hugo Rivero (A): Clock rate is not the only indicator of a system's performance. While processor transist or density (and clockrate) has followed Moore's law, memory is not getting faster at the same rate. Processors get faster but stall more often, waiting for memory references. Sun is taking a different approach with throughput computing: instead of ever increasing clock speed, focus on doing more things at the same time. We will soon release a processor with Chip multi-threading: multiple cores per chip, multiple threads per core. This will allow multi-threaded or multi-process apps to take better advantage of the chip.

(Q): What was the reason to merge the libthread library into libc in Solaris 10? Does this simplify and reduce complexity when building applications?
Richard McDougall (A): By rolling libthread into libc, it makes it much easier to add threading into your application. Previously, there were two models; not-threaded or threaded; previously choosing to link with -lthread would introduce some small overheads in some areas, and warranted testing the whole app as a multi-threaded app. Here's an example we had recently; a software vendor wanted to use threads in only one of their background processes, but their application is shipped with the majority of the code in a shared library. By using the new s10 libc model, they are now able to call thread_create in this daemon alone, without needing to retest the rest of the app as multi-threaded.

(Q): We see tlb/tsb misses taking a large (>10%) amount of time under load. Any comments you can make about the project to enable large pages on memory mapped files, text pages, etc?
Adam Leventhal (A): Large pages are now supported in the latest Solaris 10 patches and in OpenSolaris for text and data pages. Work is ongoing for mapped files.

(Q): How many CPUs does Solaris scale to? Is the much touted scalability what makes Solaris slower than Linux for small servers?
Dan Price (A): CPU scalability depends on the specific platform. In this case, I'm taking CPU to mean "CPU core" So on a fully loaded E25K, you can have 72 US-IV+ chips, each of which presents two cores. That means that you're at 144 cores. We've done a ton of work in Solaris 10 and beyond to close the performance gap against other OS's on small systems (we even have an active "small systems" performance team). We have frequently benchmarked ahead of other OS's on 64-bit kernels on AMD64. Come hang out on perfdiscuss@ opensolaris.org for more.

(Q): Re: the single write lock Dan mentioned in response to an earlier question... Is it per file or per filesystem (as I always heard way back when)?
Dan Price (A): It's per file. We have a reader/writer lock per file; so you can have many readers at a time, but all readers get locked out when the single writer is writing. In UFS, this locking (which is actually a POSIX mandated behavior) is implemented per-file. This means that with applications which have hundreds or thousands of threads and few files, you can have scalability issues. In ZFS, the locking is per-block, which should help things significantly.

(Q): Re: thread local storage (TLS). We currently use quite a bit of pthread_get/set_thread_specific operations. Can you comment as to whether TLS offers a performance improvement? Or is it simply syntax?
Adam Leventhal (A): It's not just syntax and TLS should provide a significant improvement over TSD in terms of performance. The underlying mechanisms are quite different and the compiler can use some specific optimizations for TLS.

(Q): We're going to have all these very parallel systems, and then we get faced with classic choke points, like concurrency checking for licenses and logins, or shared data pools. Is that all pushed into messaging service, ESBs and the like, or are there other alternatives to allowing the scalability while still having active coordination points?
Uday Shetty (A): I think you are better off using LDAP server for authentication/authorization, and licensing server. The directory server in the Java Enterprise Suite scales extremely well for large enterprise. You can use messaging service for shared data pools.

(Q): Do you think the SPECint2000_rate benchmark is useful to compare SPARC systems against AMD, IBM, HP, Intel etc when used for RDBMS?
Shane Sigler (A): SPECint2000_rate is not a good option for comparing the systems for DB performance. The TPC benchmarks are more applicable as well as benchmarks like the SAP benchmarks which more closely mirror what customers are going to see. For example TPC-C does 6 times as much I/O per transaction as most typical customer transactions.

(Q): Re: umem, we have a model in which thread x allocates and thread y frees. Any guidance as to whether or when a global object cache would give better or worse performance as compared to a single per allocating thread cache in this kind of case?
Adam Leventhal (A): Hm... interesting question. It can really depend on the specifics of your system and your application. If structures are being allocated and freed frequently, you could potentially incur lock contention or cache contention with a global cache. My guess is that the per-thread cache is going to be a better bet, but in these cases gathering data on the possible solutions is essential. Thanks for the question.

(Q): What is the best high performance shared/cluster file system that can run on Solaris servers and other OSs?
Richard McDougall (A): I hope I can answer this one with a typical engineering response: it depends :-) Depending on your bandwidth and metadata throughput requirements, there are multiple solutions in this space. I'll try and answer based on a few key options, feel free to followup with more questions...On Solaris, we have QFS, Luster and NFS as good examples. QFS allows multi-writer shared access for nodes in a cluster. For bandwidth, QFS is our highest bandwidth product. I've personally seen QFS scaling at over 18 Gigabytes per second for a large multi-stream workload on Starcat based systems. So the bigger question is what type of shared cluster? NFS is designed as a shared file system, and is certainly the most common clustered file system in use today. Wait, NFS as a high bandwidth solution? Ethernet bandwidth has been quietly doubling, at a faster pace than fibre-channel transports. With Solaris 10's fire-engine networking stack, we are now seeing several hundred megabytes per second on a single NFS connection. This means it's now possible to build out a high performance storage grid using NFS, if your requirements are in the order of a gigabyte per second per node. On a final note, if your application is meta-data intensive; i.e., its workload profile is dominated by open, create, and attribute operations, NFS is the clear winner. In summary, QFS is by far our highest bandwidth option today. I wrote a blog about this earlier this year, which might help expand on the emergence of IP storage in this area.

(Q): Hugo: thanks for the reply. I'm very interested in using your new Niagara system for my computation-intensive app, but I heard that the chip doesn't do so well with floatingpoint, what's the deal?
Hugo Rivero (A): Our first implementation of the Niagara processor will focus on integerintensive apps. Future versions will ease this restriction. If your application is running on UltraSparc processors today, you can look at the CPUstat counters (FA_pipe_completion+FM_pipe_completion)/Instr_cnt).

(Q): Can you relate any information about estimated performance of ZFS as a filesystem over some of your competitors, specifically VxFS?
Dan Price (A): We can't at this time. But I think we'd like to put ZFS in everyone's hands as soon as possible; see also http://blogs.sun.com/bonwick.

(Q): We're having some issues to make our strategy more competitive, and reading the PDF and listening to the presentation Scale my Apps, I was wondering how to increase the software development process.
Uday Shetty (A): I think it's important to make sure the software design process includes requirements such as scalability in the product design doc. Also, have a workload in mind that could traverse all the code path to make sure the application could perform and scale well based on the customer environment. I don't have a specific tool in mind that could help with the development process, but you can try using commercial UML tools.

(Q): Will AMD Opterons ever be as scalable as Sun's UltraSparc systems?
Shane Sigler (A): It depends on how you want to define scalability. As CPUs get more cores and threads per chip, the scalability picture changes. We do work closely with AMD on the best way to build larger systems and you will see more of this in the future.

(Q): By experiment, I observe that binding to psrsets improves throughput, but it seems difficult to find a way to integrate the use (or avoidance) of this across multiple system types which have arbitrary numbers of CPUs and interrupt distributions. Any guidance?
Dan Price (A): Yes! In Solaris 9, we introduced the pools facility and in Solaris 10 refined it further. Pools allows one to create persistent pools of CPUs (i.e. persistent across reboot). You can also configure the pools of CPUs with rules like: 2-5 CPUs. As for interrupt distribution, in our development builds of Solaris (and probably in an S10 update), we have a new daemon called 'intrd' which automatically redistributes interrupts on the system. If you want to see how it works, you can give Solaris Express a whirl at http://www.sun.com/solaris-express

(Q): How do you decide when libumem will help with performance?
Adam Leventhal (A): If your application allocates a lot of memory from many different threads libumem will probably give you a performance impact. You can also use the new plockstat(1M) command to look for lock contention in your application. If you see that malloc_lock is heavily contended, libumem is probably going to help a lot. Trying libumem is a snap: LD_PRELOAD=libumem.so.1

(Q): How much impact, on average, does DTrace impose on a running system while performing an analysis?
Adam Leventhal (A): More important is the overhead when DTrace isn't in use at all, and that's zero: there's no tax for having the ability to examine your system. When you're using DTrace, you pay as you go: the more probes you enable and the more they're hit and the more work you do, the more overhead the instrumentation will incur. We haven't taken the time to measure enabled overhead since that's of less concern to us than the disabled probe effect.

(Q): We have an application that takes an indefinite time between the request and the response. So, to make it scale we need to handle the request and response on separate threads, or even better to handle a queue of service objects (request/response pairs) with our own thread pool. This does not appear to be possible with the current servlet API. Is there a way to implement this in an application-server-portable fashion?
Pallab Bhattacharya (A): It is true that there is no such API as of now. The queue-pair is a feature of the servlet container. I think the approach described in the question is a good approach. If the time spent in each request is mostly uniform, then having a workQ and a single thread pool is good enough.

(Q): Hello guys. Many times many customers are asking how Solaris, especially Solaris 10, compares with Linux based systems in terms of scalability. Would you share two or three important points on this topic? Thanks.
Richard McDougall (A): Scalability is one of the big-bets that was the design center focus for Solaris 2.0 (the rewrite of Sun's original BSD based SunOS). At this time we focused the design on scalability, betting that SMP would be a longer term way of scaling up performance (as opposed to faster CPUs). Solaris 2.x was a ground up re-design based on threads. In contrast to adapting the process-based kernel model, this provided kernel-level thread scalability — in key areas including scheduling, networking and I/O. This was an important design point for large SMP systems. Today, with the latest Solaris kernel we are seeing scaling to 208 cores on the large Starcat systems. On low end systems, scalability wasn't such an important requirement. Interestingly due to Chip level multi-threading, scalability is recently becoming a key requirement. Even low end one chip systems like Niagara have 32 virtual CPUs, making scaling to the order of 32 critical...In Solaris 10, there are several areas I've seen that show significant scalability differences. For example, here's a few:

  • Systems with many CPUs or large number of virtual CPUs
  • Highly threaded applications: the new 1:1 thread model is now integrated into libc and enabled by default. We've seen a single multi-threaded apps scale to over 100 CPUs
  • Memory allocators: the libumem allocator provides a userland equivalent of the slab allocator.
  • File systems: UFS seems to be on par with other file systems on the low end, but differentiates itself as we scale up the number of disks. There's a white paper showing some of the scaling curves. I'll be publishing some data showing file system performance scaling in the near future. The typical scaling differentiator is typically visible at 20 disks and up.
  • ZFS also represents a significant step forward in scalability; it's highly scalable design overcomes many of the single writer lock and per-directory scaling issues seen in other file systems.
  • The final note on scalability is about observability. The single most important enabler to improve your application scalability is being able to identify key scaling limitations and remove them. This is where Solaris 10's DTrace is a key asset; as it makes it possible to identify and overcome key scaling issues in as little as minutes.

(Q): We've played with DTrace and think it's fab. Are there any sample scripts available to monitor disk I/O bottlenecks?
Dan Price (A): Yes! There is a script included in /usr/demo/DTrace called 'iosnoop.d'. That is a good place to start. There's also a good collection of scripts here: http://users.tpg.com.au/adsln4yb/DTrace.html Rich is yelling at me across the table that we'll soon have 'fsstat' based on DTrace. Keep your eyes on DTracediscuss@ opensolaris.org for the announcement of that.

(Q): Are there any general DTrace scripts available that will help me determine if my app will scale?
Adam Leventhal (A): Yes. When we're talking about making an app scale we're often looking for lock contention in which case plockstat(1M) is a great place to start. That's not a DTrace script, but a specialized DTrace consumer (however you can see the underlying D script by doing plockstat -V). Other scripts you might want to look at are the examples in the sched provider chapter in the Solaris Dynamic Tracing Guide.

(Q): What is the one most critical performance issue faced by our customers, especially with Solaris 10. Can you please highlight any bottlenecks engineers need to be aware of. Adam Leventhal (A): That's a very broad question and there's really no _one_ scalability or performance issue that impacts all customers uniformly. In light of that, we've taken the approach of providing better systemic observability through DTrace so that those unique bottlenecks can be identified and removed.

(Q): Everybody knows that XML parsing imposes a big overhead but, we adopt it in order to maintain compatibility & scaleability, how can we assess the effect of removing XML messaging exchange in order to get better performance!
Dan Price (A): It sounds like this is a measurement problem. Where is XML consuming too many processing cycles? Are you using the right XML processing technology (DOM or SAX) for your application? You may want to experiment with using DTrace ( http://www.opensolaris.org/os/community/DTrace ) to get some sense of the amount of time and/or CPU cycles that are being consumed; you can also add static probes to your application using DTrace, which could give you more abstract semantics like "messageparse- start" and "message-parse-end"

(Q): How can plockstat be used to detect scalability issues within one application? Any blueprint related with plockstat and its friends?
Adam Leventhal (A): Plockstat(1M) — new in Solaris 10 — traces synchonization activity in a process. It will show you lock hold times as well as lock contention. Long hold times and lock contention will inhibit scalability so hot spots identified by plockstat are good places to improve.

 
 
 
 
 » User Instructions
 
 » FAQs
 
 » Code of Conduct
 
 
 
 
 
 
Now Showing
Now Showing
See Net Talks
View archives of online discussions with Sun experts, technical gurus, and industry analysts.
   View Now

 
 
 
 
 
 
Special Offers
Take advantage of these special offers:
• High Performance Throughput Computing: Get white papers, videos, data sheets and more
•  x64 Scalability on the Sun Fire Platform: Explore highlights and details behind the new and highly scalable Sun Fire servers
• Sun Grid Rack Configuration Tool: Customize a Sun Grid Rack to meet scalability and price/performance objectives
• Solaris 10 Operating System Free Download: Take advantage of breakthrough scalability, performance, and tuning enhancements

Contact About Sun News Employment Privacy Terms of Use Trademarks Copyright 1994-2006 Sun Microsystems, Inc.