![]() |
![]() |
|||
|
THROUGHPUT COMPUTING FOR WEB SERVICES
The UltraSPARC T1 processor delivers massive throughput for transaction-heavy Web services workloadsTo handle ever-increasing demands of customer and partner interactions over the Web, many enterprises have pressed into service an array of diverse products and technologies. But while the products vary, most of today's Web services deployments rely on a common multi-tiered and clustered architecture. The typical three-tiered architecture involves tens, hundreds, or even thousands of Web servers at the edge, a smaller number of application servers in the middle, and a small number of powerful database and transaction processing servers in the back end. While this approach allows companies to scale up to serve an increasing number of concurrent users, it also comes at a cost. Today, the edge and middle tier of Web services promotes horizontal scaling adding more and more servers to meet growing demands. Horizontal scaling allows sites like Google, Amazon, MLB.com, and eBay to add capacity, but this type of scaling has several challenges rising power bills, rising cabling and software patching expenses, and increased management overhead. Small, medium, and large enterprises are running into a similar challenge the proliferation of discrete multi-tier Web services deployments that require management, security, and scalability. On top of this, data center administrators must balance the risks between over-provisioning of assets and having under-utilized assets versus compromising the customer experience with under-provisioning. "Web services architectures need to be very dynamic and flexible," explains Sreeram Duvur, engineering lead for the Web services grid initiative at Sun Microsystems. "Enterprises must manage spikes in traffic, while maintaining a certain level of response time, optimizing the utilization of IT assets, and lowering set-up and operation costs. Today, when administrators see service levels deteriorate, they tend to buy and deploy more servers. "But, in the long run, this makes a Web services infrastructure costly to operate and control," Duvur says. "Because Web and application servers access secure data stored in the back end, network topology, firewalls, and encryption also add to the operating and administration costs. In addition, there might be several dozen Web services deployments in a large enterprise, so the problems tend to be replicated many times over." "Memory Wall" and Chip Multi-threading One of the key metrics for performance of Web applications is the number of transactions that can be processed in a given span of time. In order to ratchet up server performance, chip designers have traditionally focused on faster processor speeds. But higher clock speeds result in a small, incremental increase in transaction throughput because memory speed is not growing at the same rate as processor pipeline speed. Processor architects call this phenomenon "hitting the memory wall."
The relatively slim performance gains of higher clock speed servers force enterprises to procure and deploy ever-increasing numbers of seemingly inexpensive servers at the edge of Web services architectures. To further complicate matters, because the processors consume a lot of power and generate a lot of heat, they cannot be put in close proximity to each other. Thus the gains heralded by the clock speed are not realized they are offset by increased real estate and cooling costs. "If you look at current microprocessor designs, CPU designers are unable to crank up the speed of the processor much beyond 2 to 3 GHz. Plus, today there are at most two cores per chip, and there is typically just one or two hardware threads per core," explains Duvur. "At the same time, Java and Web technologies have made threads pervasive. Application and Web servers have lots of threads, but they are constrained by the lack of thread parallelism in the hardware." Ultimately, the lack of hardware threading inhibits scalability. "If an application server or a Web server has to scale to meet the needs of thousands of concurrent users, the processor architecture needs to enable several software threads to make progress simultaneously," Duvur continues. "Ideally, if one thread goes to memory, the other threads should continue processing. In fact, massive thread-level concurrency and memory access concurrency is the only way to break through the 'memory wall' performance barriers. Sun calls this type of processing chip multi-threading, and it's perfect for delivering massive throughput for Web services applications." While commodity servers have been hailed as a cheap way to scale Web services architectures, the costs add up over time. The servers run hot and are expensive to cool. The servers end up taking up a lot of data center real estate. A large, heterogeneous server environment is difficult to manage. Many parts and vendors need to be coordinated and integrated. In the end, burgeoning customer demand has led to growing energy costs, increasing real estate demands, escalating complexity, and lagging levels of utilization. Powerful Processors Sit Idle The chief culprit of hidden costs in Web services architectures is that today's processors were not designed for the kind of throughput that Web services demand. Traditional processor design has emphasized single-threaded hardware performance using scientific benchmarks, and chip designers have focused almost exclusively on processor speed improvements. Aside from the fact that the focus on clock speeds results in power hungry chips, memory latency the time it takes a chip to pull information from memory -- makes the clock speed gains largely irrelevant. In transaction-heavy environments, processors spend most of the time idle, waiting to receive the information from memory. And in a multi-processor system, the board interconnects have become very complex and fault-prone. As a result, systems development costs are high for vendors, and system acquisition and repair costs are steep for customers. The fixation on gigahertz has resulted in a situation where clock speeds double every two years while memory access speeds double every six years. As a result, the ultra-fast processors in today's commodity servers spend nearly 85 percent of their power-consuming hours idle. Ultimately, for a transaction-heavy Web services architecture, GHz is a largely irrelevant measure of performance. A more meaningful metric is the transaction throughput of commercial applications and, increasingly, performance per watt due to growing data center demands of power and heat dissipation. UltraSPARC T1 Processor Delivers Throughput Computing The objective of the Sun Throughput Computing initiative was to design a chip architecture that would maximize the throughput of commercial workloads, instead of trying to ratchet up the speed of single thread execution. The result of this endeavor is the UltraSPARC T1 processor. The UltraSPARC T1 processor features eight cores. Each core supports four distinct processing threads. As a result, the UltraSPARC T1 processor is a 32-way chip meaning it can coordinate 32 concurrent processing threads. In other words, the UltraSPARC T1 processor can simultaneously execute 32 threads from the same process, as many threads from a combination of processes, or even the threads from 32 distinct processes. "The UltraSPARC T1 chip is built in such a way that the performance can be exploited by any application that has a throughput performance goal," explains Duvur. "The sweet spot is for applications that have a modest to large number of threads or simply rely on multiple processes for scaling. "All popular Web and application servers and databases fall in this sweet spot, making the new UltraSPARC T1-based systems perfect for many Web infrastructure applications," Duvur notes. "Large transactional databases and business integration platforms can also shine on these systems. The designers were able to put eight cores on a chip and yet run it cool enough that a lot of these servers can be put very close together." The fast crossbar switch between cores and the shared L2 cache speed up thread synchronization and locks by a large factor. As typical, the Solaris 10 OS scheduler manages the threads as each of the eight cores pushes four threads through the pipeline in a round-robin fashion. If a thread stalls due to memory latency, the other threads can use that slot or a new one can be scheduled by the OS, so the processing pipeline remains active. In other words, the UltraSPARC T1 processor mitigates memory latency and allows for near-constant throughput. With the ability to scale to handle a lot of transactions, the UltraSPARC T1 processor also enables Web and application tier consolidation. If a Web server architecture is using multiple small Web servers in a cluster, the same service quality and reliability can be achieved with a single UltraSPARC T1-based server. In some cases, it may be feasible to merge different tiers in the same server. For example, database and application server tiers can be run from the same box. This reduces hardware procurement and operation costs, lowers cooling expenses, shrinks data center real estate requirements, and eases server management. Ultimately, the UltraSPARC T1 processor offers the ability to replace numerous power hungry servers with fewer, more efficient, and easier to manage servers. These benefits can be achieved in enterprises that run several smaller Web services, as well as organizations that run large applications. "In today's enterprises, there are a lot of small departmental applications running on islands in the Web infrastructure. These grids proliferate and can be hard to catalog, secure, and manage," says Duvur. "As it stands today, an IT administrator would have to go set up multiple mini-grids to host various applications," he adds. "But with built-in hardware support for virtualization, the UltraSPARC T1 processor gives enterprises the opportunity to consolidate multiple Web services grids in one place, without compromising the individualized needs of the various departments. Even if organizations aren't hosting large scale Web applications, they can consolidate and gain efficiencies from UltraSPARC T1-based servers." On-chip Data Security In addition to server consolidation, the UltraSPARC T1 processor is designed for other requirements of Web Services architectures. For example, it is no secret that security and encryption are vital to Internet commerce. Typically, Web transactions are encrypted using SSL. This approach places a significant overhead on CPUs, because a request needs to be encrypted, decrypted, load balanced, and re-encrypted before it can be sent back to the customer. This requires a lot of processing, and it is not uncommon to see more than 50% of the CPU capacity used for security processing. The common strategy for dealing with encryption and decryption is to buy and integrate third-party SSL acceleration hardware into Web and application servers. UltraSPARC T1-based servers take a different approach. Each of the eight cores in an UltraSPARC T1-based server contains an integrated Modulo Arithmetic Unit (MAU) that works in conjunction with the Solaris Encryption Framework (SEF) to provide dedicated, on-chip SSL acceleration. Because of the multi-threaded nature of the UltraSPARC T1 processor, when the single thread dedicated to an SSL process stalls, the other three threads in the core can continue to make progress. What's more, each of the eight cores houses an MAU -- adding up to eight MAUs per UltraSPARC T1 processor. The result is an extremely high throughput for RSA operations, and it allows for SSL processing entirely on the chip. From internal testing data, the UltraSPARC T1 processor was able to deliver significantly more RSA operations per second than the current generation x86 chips and a competitive dual-processor RISC system. In addition, SSL overhead has been observed to drop from 50 percent to 6 percent when compared to current generation SPARC processors. And for deployments where SSL must be terminated earlier, the Sun Secure Application Switch may be used to combine SSL acceleration with load balancing functions. The Total Package By taking into account the commercial workload of Web services, the UltraSPARC T1 processor is the first processor designed specifically for transaction throughput. As opposed to purchasing high horsepower processors that end up sitting idle, the UltraSPARC T1 processor allows enterprises to maximize investment in and utilization of server resources. Along the way, it promises to help enterprises consolidate Web services architectures, lower data center real estate requirements, decrease energy costs, and simplify server management while witnessing a whole new level of throughput and performance. Simply put, the UltraSPARC T1 processor makes transactions happen. |
| |||||||||||||||||||||||||||||||