HPC Consortium Boosts Memory and Throughput, Minimizes Datacenter Costs with Sun SPARC Enterprise M9000 and T5140 Servers
Founded in 1998, High Performance Computing Virtual Laboratory (HPCVL) is Canada’s premier high-performance computing (HPC) center. With its main datacenter at Queen’s University in Kingston, Ontario, HPCVL also maintains computing resources at six other Canadian universities and colleges. More than 800 researchers and students from across Canada rely on HPCVL to conduct innovative research in a variety of disciplines including science, engineering, medicine, and economics.
Give researchers access to emerging server technologies
Reduce datacenter footprint
Minimize IT costs including power and cooling
Solution
With the help of Sun Professional Services and third-party vendor Stantive Technologies Group, HPCVL built a symmetric multiprocessing (SMP) cluster and a chip multithreading (CMT) cluster to provide researchers with a choice of server architectures to better support varied application requirements. The SMP cluster delivers up to 16 TB of RAM, and the CMT cluster provides nearly 10,000 processing threads.
Business Results
Minimized server and datacenter costs
Reduced job turnaround time
Increased available processing threads by 900%
Boosted available memory by 129%
Increased capacity for more simultaneous users
Expanded researchers’ capabilities and provided for greater data precision
Every day, Canadian researchers rely on the High Performance Computing Virtual Laboratory (HPCVL) to help transform our lives by studying topics such as the flow of global capital, weather patterns, new drug therapies, and aircraft design. In 2008, HPCVL faced challenges surrounding system performance, user capacity, energy consumption, and datacenter costs. The laboratory’s main server cluster at Kingston was not delivering enough memory or throughput for some applications to run. Built on seven Sun Fire E25K symmetric multiprocessing (SMP) servers, the cluster delivered up to 4 TB of RAM and 1,000 processing threads. Researchers needed a lot more performance to stay competitive. A growing user base also meant that HPCVL needed to expand its capacity so that researchers would not have to wait weeks or months to run jobs. In addition, researchers lacked access to servers with chip multithreading (CMT) processors. Not only was CMT an emerging technology that the laboratory wanted to offer for educational purposes, but CMT servers also delivered throughput advantages for some types of applications. CMT servers also occupied less space and used less energy than SMP systems, which could help offset budget restrictions and safeguard the environment.
Designing and Deploying Two New Clusters
As a Sun Center of Excellence that uses Sun technologies to further research, HPCVL worked with its Sun representatives to determine what mix of SMP and CMT servers could best meet its requirements — without exceeding the budget. After extensive evaluations, HPCVL decided to replace its existing cluster with an SMP cluster built with eight Sun SPARC Enterprise M9000 servers. Chosen for its industry-leading performance, a SPARC Enterprise M9000 server with 64 quad-core SPARC64 VII processors, reached 2.023 TFLOPS on the Linpack HPC benchmark test — the highest ever score achieved by a standalone commercial server. In addition, the laboratory designed a CMT cluster built with 78 Sun SPARC Enterprise T5140 servers. Like the existing Sun systems, all of the new servers run the Solaris 10 Operating System. “The Solaris Operating System is thread safe, so it is an ideal platform for developing and running scalable applications that can take advantage of multiple threads,” explains Dr. Ken Edgecombe, executive director at HPCVL. “The Solaris 10 OS also provides many features like Dynamic Tracing, which some of our researchers like to use to troubleshoot issues.”
"
The capabilities of our new Sun SMP and CMT clusters give researchers a real competitive advantage over what I call the Intel-trained way of doing things which is, ‘Wait until we give you a faster CPU.’ Sun allows us to operate in the throughput world, where you can get a lot of work done very quickly, simply because you’re multithreading over a large number of CPU cores.
"
— Dr. Ken Edgecombe, Executive Director, High Performance Computing Virtual Laboratory (HPCVL)
Each core on a SPARC Enterprise M9000 is 3.9 times faster than a single core on a Sun Fire E25K system. In addition, the new SMP cluster, protected by SunSpectrum Gold Support, delivers more than twice the memory and 300% more threads compared with the old cluster. “We’ve essentially gone from 4 TB and 1,000 threads on our E25K server cluster to 16 TB and 4,100 threads on the M9000 cluster,” notes Edgecombe. “This has lowered the power-per-thread requirements roughly by a factor of two.” The CMT cluster, protected by SunSpectrum Silver Support, offers about ten times the throughput of the old cluster, with nearly 10,000 threads. In addition, each T5140 server occupies only one rack unit — compared with the old servers, which each occupied an entire rack. “I can pick up one of the T5140s and hold it in my arms,” notes Edgecombe. “I can’t do that with a Sun Fire E25K.”
In March 2008, HPCVL engaged Sun partner Stantive Technologies Group to help deploy the 78 SPARC Enterprise T5140 servers in a cluster with Sun Solaris Cluster and Sun Grid Engine software. The process was quick and easy. “We had existing power and air conditioning all set up, so it was just a matter of bringing in the servers and racks and plugging them in,” explains Edgecombe. “Most of this cluster uses 10 GB Ethernet and the remainder uses 1 GB Ethernet interconnects.”
Meanwhile, Sun Professional Services helped HPCVL design a cluster with the SPARC Enterprise M9000 servers that uses a 10 GB Ethernet switch-network topology as well as Sun Cluster and Sun Grid Engine software. In June 2008, Sun delivered eight M9000 servers to HPCVL’s main datacenter. HPCVL installed the servers, which were available to researchers by July 2008.
How the Grid Works
Researchers can access HPCVL’s clusters from any Internet-enabled location using any desktop client through a portal that runs the Sun Java System Portal Server and Sun Secure Global Desktop software. Because security is critical, HPCVL works with third-party security provider Entrust to issue users roaming digital certificates that verify their identity and provide for a fully encrypted session.
Once connected to the portal, users can schedule a job for processing. Sun Grid Engine software manages all job scheduling and allocation of datacenter resources. Because some research is conducted over many days, weeks, or even months, HPCVL gives researchers several options for storing data. Sun StorageTek Storage Archive Manager (SAM) software, which runs on Sun Fire V890 servers, manages the laboratory’s short- and long-term hardware storage solutions. A storage area network built on 46 Sun StorageTek 3510 arrays with 160 TB of available capacity provides for short-term storage. More than 500 TB of long-term storage is supported by a Sun StorageTek L1400 tape library. To provide for disaster recovery, all data is regularly backed up to a StorageTek SL8500 modular library system which HPCVL shares with Queen’s University.
Why SMP and CMT?
Researchers use a variety of applications to conduct research. Some are third-party applications provided by HPCVL such as Amber from the University of California, which simulates biomolecular systems, or Adina's ADINA System, which analyzes building structures. Many other applications are created by researchers who use development tools provided by the laboratory including Sun Cluster Tools and Sun Studio 12 software.
Depending on the type of application, researchers can choose to use either symmetric multiprocessing (SMP) systems or chip multithreading (CMT) systems. SMP architectures are preferable for applications that require large amounts of memory or that use floating-point algorithms. For example, computational-fluid-dynamics applications such as FLUENT and Abacus perform better on SMP systems. Computational chemistry applications, such as Gaussian, might also run better on the larger-memory, SMP systems. CMT architectures are ideal for applications that need high throughput and less memory, or applications with large instructions or data sets that can take advantage of multiple threads to run numerous processes simultaneously. Such applications might otherwise experience performance delays as a result of a single process waiting for data from memory. For example, bioinformatics applications which conduct a lot of pattern searching and matching, perform very well on the CMT servers. To help determine which platform best supports their application, researchers can use the open-source CoolThreads Selection Tool software.
When requesting cluster resources for a job, researchers can specify how many threads their application requires. Determining the ideal number of threads for an application is important because additional threads can boost performance only to a point. After that threshold is met, an application’s throughput does not increase, but the cost of running the job does. “If you’re running an application on one machine, you may not want to take advantage of all 128 threads on a T5140 because your performance may decrease as you increase the number of threads,” explains Edgecombe. “So our rule of thumb is to tell people that if you’ve been running an application on an SMP box, and it’s running well with a certain number of threads, try four times the number of threads on one of these CMT boxes.”
Benefits
As a result of deploying its new SMP and CMT clusters, HPCVL has implemented a more cost-effective server infrastructure that balances the varied and growing demands of researchers with a limited operating budget and datacenter capacity. “Unlike other systems, threads on a CMT system are virtually free,” explains Edgecombe. “If you can increase an application from eight threads to 64 threads, the cost advantage is just tremendous because you’re not spending any extra money on hardware, but you are realizing improved application throughput.”
In addition, the new systems deliver more memory, CPUs, and threads which equates to greater data precision. This means that researchers can make finer-grain grids with more elements, and they can pinpoint actual values rather than using approximations. The new clusters also support more simultaneous users and speed job turnaround times so that the world does not have to wait as long for new scientific or mathematical breakthroughs. “We had some researchers who were waiting for two months for their code to execute at another HPC site,” explains Edgecombe. “We made some T5140s available to them, and they were able to complete their job in just two weeks.”
The SPARC Enterprise T5140 servers are smaller, lighter, and more economical than HPCVL’s previous servers. “Compared with one Sun Fire E25K, four T5140s cost much less, consume far less power, and occupy a fraction of the space,” notes Edgecombe. “People who use low-memory applications or applications that don’t scale well on large SMP boxes often find that these applications scale quite well on the T5140s, and they can get a tremendous amount of work done.”
To learn more about how it can take advantage of Sun technologies to support future goals, HPCVL regularly sends its employees to training through Sun Learning Services. For example, HPCVL is evaluating the use of Solaris Containers, a built-in software virtualization tool in the Solaris 10 OS, to help meet researchers’ varied requirements. The center is also analyzing how it can modify its architecture so that it complies with the CFR 21 Part 11 regulation, which manages digital data submissions to the FDA for drug approval. HPCVL is also working with Sun Professional Services on architecting a new flash-storage solution built on the Sun Storage 7000 Unified Storage System to accelerate the time required to store and access data, and to reduce energy consumption.