Q: What is the FireEngine project all about? A: FireEngine is the internal code name for the rearchitecture of the networking stack to improve single CPU performance and scalability across a large number of CPUs. The new architecture will support 10-Gbit Ethernet, dynamic switching between interrupt and polling, and various kinds of offload technologies. For more information, see an article in The Register, Sun gives glimpse of revised Solaris TCP/IP stack. Q: What kind of workloads does this redesign improve? A: All kinds. The biggest improvements are for short-lived connections typical in web workloads. Q: What does the application need to do to benefit? A: It is transparent to the application. Almost all applications that use networking can benefit. Q: What's so special about this redesign? A: It's the next evolution in multi-threadedness. Instead of a large number of threads contending for resources across all CPUs, FireEngine imposes some order on the chaos by partitioning workloads across CPUs and within a CPU, letting a thread finish a reasonable amount of work before it relinquishes control. Think of the old stack as freeways with no lane discipline. The new design introduces the concept of lanes so traffic flows more smoothly. In a multi-connection environment, performance is all about flows and removing bottlenecks. Q: Will the new FireEngine code support session/connection trunking on a single network connection? I have been able to increase WAN network throughput by greater than 10X when establishing multiple connections simultaneously to move data. It would be nice to have the OS do this automatically and make it transparent to applications. A: Are you talking about multiple TCP connections or is it some layer 6/7 service that will do this for you? FireEngine is new architecture to speed up TCP/IP processing and connection setup and teardown but is restricted to network and transport layer for the time being. Q: Do you have any benchmarks and comparisons to Linux on the same hardware? A: We are starting to get numbers on Solaris and Linux on the same hardware (v20z). On web like workloads, we are seeing Solaris outperforming Linux by ~20% using Sun One webserver and ~30% using Apache. Its important to note that we used real webserver and not TUX (RedHat content Accel). We were also noticing that when we pushed Linux to 100%, we started getting some connections hangs etc. We are trying to work out these issues on Linux but our feeling is that Solaris 10 will continue to outperform Linux because we have some more performance improvement work in the pipeline which will get delivered to Solaris 10 or its update soon. We are also having a hard time coming up with any Linux number of established benchmarks like SPEC99 which don't use TUX. Our customers tell us that they don't use TUX and want to see performance on real webserver and real workloads. We are also in the process of doing benchmarks on file serving etc. Q: Based upon the Sun Blueprint 'Understanding Gigabit Ethernet Performance on Sun Fire Servers' (PDF) it suggests a minimum of 3-4 CPU's are required to be able to fully utilize a single gigabit interface. How much can we expect the CPU utilization required to fully utilize a single gigabit interface to improve by with the FireEngine code and how many CPU's do you feel will be required to fully utilize a single gigabit interface? A: With Ultra Sparc III+ CPUs, a single CPU sender can saturate a 1 Gb ethernet link (cassini). We still need 1.1 CPU on receiver side to handle the incoming data. Along with FireEngine, we have something called Multidata in Solaris 10 which does wonders for single connection throughput and works in conjunction with cassini driver. We made this measurement with 'ttcp' using large writes (1Mb). So if you were a single CPU sender, the utilization was close to 41% and a two CPU receiver was showing a utilization of 110%. Q: Do you have any tests for Sun Webserver, Apache, ftp server, etc. showing improvment over Solaris 9? A: Yes, we did test Sun webserver and Apache with the new architecture and noticed performance improvements in range of 45% on SPARC and 30% on x86. Sun marketing and benchmarking groups are deciding whether to publish results on some standard web benchmarks but are concerned with the lack of non TUX based publications which might result in customers misinterpreting our results because FireEngine improves general TCP/IP performance but has no features like in-kernel webserver and caches to compare against Linux with TUX. Q: What equipment changes (switches, catalyasts, etc.) or network standards will need to be in place to take advantage of this new architecture? A: FireEngine is new architecure for processing TCP/IP connections and improves performance by an IP based Classifier, vertical perimeters which provide better locality and general improvement in TCP/IP implementation. As such, it still conforms to existing RFCs and doesn't require any changes in switches or standards. The changes are transparent to applications and network administrators and all applications using networks will benefit. Q: What kind of improvements can we expect to see on something like a V210, with 2 CPUs and 4 GigE interfaces? How much CPU power will be required to saturate a 10 GigE interface? A: Depends on the workload. For web type workload, you should see something around 45% improvement over Solaris 9. For fileserving or SSL, expect 10% or so improvement. The single connection throughput can increase anywhere from 20-40% depending on the write size. Given the current Ultra Sparc IV (dual core) CPUs, we should be able to saturate a 10Gb NIC using 2-3 CPUs on transmitter side. The receiver side will require somewhere between 3-4 CPUs. This is using ttcp. Q: Can you explain how FireEngine supports other new Sun technologies - specifically CMT and Grid Containers? A: FireEngine has been specifically designed keeping CMT in mind. With the future SPARC processors which will have a very high degree of CMT support and NIC close to CPU (or on the same die), we will be running networking code on dedicated thread(s)/cores to achieve a different kind of highly scalable and high performance protocol offload engine. For the Grid Containers, maintaining connection affinity to CPU is more important and the global IP classifier takes the container ID into account when classifying packets and deciding which CPU to schedule the conenction on. Q: The interrupts from a 10Gb NIC will overwhelm a CPU. How do you load spread? A: We use the multiple Rx/Tx Ring buffers on the NIC and bind them with various vertical perimeters attached to available CPUs such that the connection locality is maintained (packets for the same connection will always get processed on the same vertical perimeter/CPU) and the vertical perimeter controls the NIC by dynamically switching between interrupts and polling. This allows the load to spread out in a very graceful manner. Q: How much will my UDP based applications benefit from FireEngine? A: The UDP performance is being worked on as we speak. By the time Solaris 10 ships, UDP applications will see similar improvements as TCP applications. Again, the changes will be transparent to the application i.e. the application or system administrator doesn't need to do anything to take advantage of these improvements. Q: Will the NFS client/server performance be improved by using this new architecture? In previous releases of Solaris OE it was a problem if a NFS client was disconnected from network - the machine cannot handle other tasks at all. A: Yes, the NFS client/server performance also gets improved significantly by using the new architecture. We have even more improvements coming soon in that area (should be in by the time Solaris 10 ships). The 2nd part seems more like a bug and not a performance problem. I will check with NFS folks to make sure this is addressed in S10. Q: Our application uses UDP to transport isochronous data across a network. Typical packet periods are 10ms, 20ms, or 30ms with payload of 80 B, 160 B and 240 B respectively. This is a Voice Over IP application (RTP). Our application typically runs hundreds of these independent streams concurrently, how will FireEngine Help? A: In Solaris 10, we have already improved the flow control for UDP such that if you were seeing UDP packets drops due to high load, that will get alleviated. We also are working on fully integrating UDP to the new architecture which will improve UDP performance even further. The receive side chaining support will help UDP applications where remote does large writes causing UDP fragmentation. Q: Is support for Jumbo packets available? If so, how large can they be for 10GigE? (Some of us need this for high speed TCP flows.) A: Yes, Solaris 10 will support Jumbo packets which can be 64k for IPv4. If you have a NIC which is GLD based, this will come automatic. Apart from Jumbo frames, the 10GigE NICs supported on Solaris 10 will have very fancy load spreading mechanism which will support packet chaining and dynamic polling for each Tx/Rx ring such that data locality will be preserved and you will get the best possible throughput out of your 10GigE NIC. Q: We are using Apache 1.3.26. As users use the web portal we see them loading the same files (e.g. *.gif, etc.) over and over again. Can NCA cache the files that are requested all time in order to improve performance? A: Yes, for static contents, NCA can work wonders to your performance. The improvements are directly proportional to the amount of memory the system has so that NCA can cache more stuff in the kernel. Even if you don't have static requests but you are mostly serving files via dynamic request, use sendfile and sendfilev() system calls via your plugins. NCA has support for caching fds also so even though teh request goes all the way to webserver, the file is still cached in the kernel and you will get a good performance increase. Q: I have 6 new Blade 1000 1.15Ghz machines. Is there a patch for the onboard network cards? I ran snoop and get alot of UDP IP fragment. A: Whenever UDP does large write it gets fragmented into multiple MSS sized IP packets and thats why you see UDP fragments. You can change your application to do smaller writes. Also in Solaris 10, we will have packet chaining for inbound for UDP also. This will allow us to pick up all the fragments at same time. We are seeing very impressive (over 120%) performance improvements with ttcp doing UDP stream with 8k buffer size. Q: When will Sun give us a kernel-level sub millisecond network interface failover mechanism like multipathing? A: Solaris 10 will have 802.3ad link aggregation support as part of GLD driver which will allow you to trunk any homogeneous set of GLD drivers. As part of this support, you can choose to make a trunk of two NICs with one active and one standby. As soon as the active link down notification is received, we can immediately switch to the standby link. The switch time after receiving link down message should be order of milliseconds. Of course we also have IPMP support in Solaris which allows you to do IP level load balancing and failover allthough the failover time is more than millisec range. Q: Can you allocate how much memory the NCA uses? A: Yes, you can configure how much memory NCA uses via /etc/system and 'ndd' tunables (nca_ppmax and nca_vpmax which controls the physical and virtual memory). Look for NCA tunables at http://docs.sun.com. |
| |||||