BigAdmin System Administration Portal
Feature Article
Print-friendly VersionPrint-friendly Version

Migration From Platform Computing's Load Sharing Facility to Sun N1 Grid Engine Software

Kirk Patton (Transmeta) and Omar Hassaine (Sun Microsystems), June, 2005


Abstract

This paper presents a case study covering Transmeta's migration process from Platform Computing's Load Sharing Facility (LSF 4.0.1) to Sun N1 Grid Engine 6 software (N1GE6). Transmeta initially started experimenting with the open source version of Grid Engine (SGE5.3.x) for a period of time and subsequently decided to upgrade to the latest Sun N1GE6 product.

This study covers the following stages of this significant project:

  • Compare the feature set of the two competing products
  • Retool the general job submission and work flow
  • Re-engineer the queue configuration
  • Configure the scheduling policies
  • Implement additional customizations such as:
    • License proxy
    • Job status script
    • Queue customizations

1. Introduction (by Kirk Patton)

I joined Transmeta in the first half of 2002. As part of my responsibilities, I inherited the day-to-day administration and maintenance of the LSF cluster and soon began a project to streamline and simplify the job workflow.

I worked with Omar Hassaine, Grid engineer at Sun, to replace LSF with a Sun N1GE6 system.

The cluster environment consisted of approximately 72 users and 524 processors running Red Hat Linux. The jobs submitted to the cluster were a diverse mixture of in-house tools and third-party applications such as FinSim, HDLScore, Arcadia, Simplex, and so on.

The cluster had grown over the years with an ever-increasing number of queues. Queues were added as new projects were started and to satisfy user requests for a simple means to submit jobs to a Distributed Resource Manager (DRM). The number of queues reached more than 80 by the time the migration to SGE had begun.

The majority of the queues were set up because not all users fully understood which submission options were appropriate to route their jobs. Many users simply did not want to learn the job submission semantics.

Queues were added with the appropriate job options needed to route jobs, hard-wired into the queue definition.

Since nearly all of the queue-level job routing options could be specified at job submission time, it was apparent that the many batch queues could be eliminated by wrapping the native submission program with a script.

The script would parse the job options from an external file and build the correct submission command on the fly.

During the development of the script, it was quickly realized that the batch system could be abstracted from the users entirely and replaced with very little disruption.

Given licensing costs, replacing the batch system was not only possible, it made economic sense.

Please note: The LSF product we used was LSF4.0.1, an older version. Also, this article describes a specific migration project; results may vary.


2. Grid Cluster Description

The N1GE6 master scheduler is running on a dual AMD Opteron 248 processor-based system with 2 Gbyte of memory from the vendor ASL[1]. The operating system is Red Hat Linux 9. The Linux grid nodes were all installed using a customized version of Red Hat Kickstart.

A shadow master scheduler was configured and tested under SGE 5.3p6. The shadow master was later retasked to serve as the new N1GE6 master scheduler. Currently, no shadow master host is in use, but one is planned for installation in the near future.

For managing and monitoring the day-to-day activities of the cluster, several tools are used.

The syslog message files from each machine on the grid are aggregated to a central syslog server. Messages are parsed by the tool Swatch[2], and alerts are automatically dispatched for common errors.

The history of grid system maintenance is tracked using a MySQL[3] database which stores the system hostname, its current status and a log entry. The database is updated using a simple command-line script any time a system is worked on.

The monitoring of the grid's health is managed by a tool called Mon[4]. A small Perl script is run periodically that looks for queues in the alarm state. When the number of queues in the alarm state exceeds a predetermined threshold, Mon sends an alert at administrator-defined periods.

The number of jobs in the cluster per queue over time is tracked by the graphic package Cricket[5], which collects and displays data.


3. Feature Comparison

Before we could implement N1GE6 as described in a Sun white paper[6], we had to evaluate it to see if it had a similar feature set as compared to LSF[7].

The major requirement was that N1GE6 had to be a virtual drop-in replacement with as little disruption to the users as possible.

The areas identified for comparison were:

  • Resource Selection
  • Job Dependencies
  • Array Jobs
  • Queues
  • Pre/Post-Execution Scripts
  • Load Sensors
  • Interactive Jobs
  • Administration Differences

Resource Selection

LSF has a very flexible resource-selection mechanism. Resources can be selected by using Boolean operators to specify resources either at the queue level or at the time of job submission.

In N1GE6, the administrator creates a table of the Boolean relations to define resources. The user can then select these resources and specify the needed quantity.

Because of the flexible resource-selection syntax of LSF, users can specify complex resource requirements that could be difficult to specify in N1GE6. Specifically, a user submitting to LSF can exclude some hosts from consideration by inverting a requested resource with the "!" operator. This exclusion could be in conjunction with many other Boolean operators that select the correct set of resources. Selection of resources could be further clarified by using parentheses to control the order in which the resources are evaluated.

LSF has a rich set of Boolean operators that may be combined in any number of ways by the user at job submission time or by the administrator when defining queues.

While N1GE6 does not offer all of the Boolean resource operators that LSF offers, it does offer users the ability to do regular expression matching on complex resource attributes that are defined using the RESTRING type.

Job Dependencies

Both LSF and N1GE6 support job dependencies. LSF supports complex job dependencies utilizing Boolean operators, and N1GE6 supports simple job dependencies.

Array Jobs

Array jobs are supported by both systems, and we have found more or less the same ease of use.

Queues

The queue concept in N1GE6 and its predecessors is simply described as follows:

  • Each execution host in the grid cluster is assigned a queue, so the queue implementation is of a distributed nature.
  • Jobs are not actually queued inside of the per-host queues, but rather they are ordered for dispatch by the scheduler and begin executing once they are sent to a host-level queue.

LSF uses global queues. Jobs are ordered and dispatched from within each of the global queues.

Note: The latest version of N1GE6 improved in this area by introducing the concept of cluster queues, which was thoroughly described in a white paper[6].

Pre/Post-Execution Scripts

These scripts are supported by both systems.

LSF allows the user to specify a pre-execution script at job submission time. Additionally, the administrator may specify pre/post-execution scripts at the queue level.

N1GE6 allows for pre/post scripts only at the queue level. Neither pre- nor post-execution scripts can be specified at job submission time without some modification to the default way N1GE6 implements this functionality.

Load Sensors

LSF and N1GE6 both have the ability to run external programs that return values to the scheduler. These return values can be license counts, host information or just about anything else that can be monitored. N1GE6 does not come with as many built-in load sensors as does LSF. It was necessary to build a few scripts to report additional information about the execution hosts.

Specifically, sensor scripts were written to report the following:

  • CPU Speed
  • OS Vendor
  • Host Architecture
  • Register Width

Interactive Jobs

LSF has excellent seamless interactive job support. Submitting interactive jobs is accomplished using the same LSF submission tool that is used when submitting batch jobs. Users need only specify that a job is to be interactive by including a command-line switch in their job submission. Interactive jobs are queued and scheduled in the same manner as batch jobs. The entire process is very seamless to the end user.

N1GE6 uses a modified version of rsh for submitting interactive jobs. Optionally, ssh can be used. There are several different submission tools a user must choose from to begin an interactive session. Some offer terminal support but require the user to provide a login, while others provide un-authenticated login, but no terminal support. Additionally, the output from interactive sessions cannot be logged to a file.

Some applications require a pseudo-terminal to operate properly. A good example is Vi. Without a proper pseudo terminal, the cursor does not properly update. Getting N1GE6 to work with these types of applications can be tricky.

Using ssh as the job transport provides the benefit of secured end-to-end communication with the job, but has its own administrative difficulties with key management. Additionally using ssh means that job statistics are not collected for interactive jobs.

Administrative Differences

The differences in administration of LSF and N1GE6 are significant, in that LSF uses ASCII files that can be edited directly and saved using a version control tool such as RCS. Changes are then pushed out to the clients using a command.

N1GE6 uses a command (qconf) that invokes an editor on a copy of a configuration file that is fetched from the server. The files are never edited directly. Most of the changes take place as soon as the edits are saved except for the global execd_spool_dir where rebooting of all SGE daemons is needed. A local execd_spool_dir needs the reboot of the sge_execd only.

N1GE6 has a convenient means to backup and restore configuration that can be used as a substitute. N1GE6 allows for a lot of flexibility by moving the queue to the individual hosts. It is possible to attach complexes to queues or hosts, and/or globally. Because of this flexibility, there are more configuration files than with LSF, but they are all accessed in the same manner.

While N1GE6 is not as mature in some areas as LSF, they both have very close feature parity. The exception is the seamless interactive session support that LSF provides.


4. Job Submission and Work Flow

To prepare users for the migration to N1GE6, it was necessary to create a wrapper script[8] that mimicked the job submission behavior of LSF.

The wrapper script was key to the migration from LSF to N1GE6. Without it, the migration would not have been practical as the users would have had to significantly modify their work flow.

The wrapper script did the following:

  • Allowed users to submit jobs to either LSF or N1GE6.
  • Enabled users to submit both binaries and scripts by transparently wrapping their job submissions in a shell script.
  • Made complex job submission easy by moving the job submission specifics to a configuration file.
  • Allowed users to submit to both batch systems and test N1GE6 without repeated modifications to the users' own job scripts.

5. Re-engineering Queues

The LSF installation at Transmeta had grown complex since its initial installation. Over the years, administrators gave in to the repeated demands of users for new queues to support additional projects.

At one point, as was mentioned earlier, there were over 80 LSF queues. Because N1GE6 uses a distributed model for queue configuration, it was not desirable to port so many queues. Therefore, it was decided to deprecate the existing queues in favor of a new priority-based queue structure.

With the priority-based queue structure, the number of queues needed went down from 80 to three. The queues were made as generic as possible so that any job type could be submitted to them. With SGE5.3, we created three queues per execution host, and with N1GE6, we created three priority-based cluster queues for the whole grid.


6. Scheduling Policies

Both LSF and N1GE6 can be configured to control job scheduling priorities on the cluster. In our migration project, N1GE6 has been strong in this important category.

LSF uses queue-based priorities as the primary means to control the users' access to the resources of the grid.

Additionally, a fair-share policy adjusts priorities of users based on their past usage of resources. We had used the host partition model with the fair-share policy.

N1GE6 has four policies that can be mixed and matched to fit almost any scenario. The policy that was the best fit at Transmeta was the Share Tree policy. This policy is described in the N1GE6 Administration guide[9]. The Share Tree policy allows the administrator to configure priorities by project group. Within each project group users can have different priorities. Using the Share Tree policy, it is possible for users to submit to the same queue, but under different policy restrictions. With the addition of the other three policies, the administrator is allowed very fine-grained control over the scheduler.


7. Customizations

License Proxy

Migrating from one batch system to another meant that both would have to co-exist until the migration was complete.

The problem of sharing FlexLM licenses between the two clusters had to be effectively dealt with.

Both LSF and N1GE6 could be configured to keep a static count of licenses allocated to each system, but that would leave the potential that licenses could go unused on an idle cluster while being unavailable to the cluster which was busy.

Additionally, LSF and N1GE6 could both track licenses using dynamic counters to directly poll the license server, but this too was discounted because of potential race conditions.

The solution was to use a license proxy. The proxy would be consulted using pre/post-execution scripts. To keep the number of queues to a minimum, both LSF and N1GE6 had to be configured to allow user-specified scripts.

In a standard N1GE6 configuration, it is assumed that scripts would be specified at the queue level. To get around this limitation in both systems, a generic shell script was installed for each queue that queried the user's environment for the actual script to run. By communicating with all job submission systems and the FlexLM[10] server, the license proxy became the authoritative source for license information.

Job Status Script

Users had become accustomed to the job status output of LSF. To lessen the number of questions to the HelpDesk, the job status tools from both LSF and N1GE6 were wrapped by a single script. This script was given the same name as the original LSF job status command to keep a user interface that was similar to the one used before.

The information from both systems was aggregated by the script and displayed in a format that was consistent with the original LSF job status reports.

Queue and Scheduler Customizations

A number of customizations have been implemented and they are listed as follows:

1. Job Preemption

The primary customizations to the queues had to do with job preemption, machine selection and the license proxy hooks.

Job preemption in N1GE6 was straightforward to implement by using the queue properties' subordinate_list parameter. It was necessary to adjust the number of jobs allowed per queue and per host, in order to duplicate the policy of one job per CPU, which had been used in LSF.

2. Load sensor script

To configure N1GE6 to use the fastest available host first instead of the default sort order that was simply based on system load, the load_thresholds parameter was changed to include a custom complex that reflected the relative speed of the host. The relative speed was calculated by timing a computationally expensive task on a host before it was added to the cluster. This was standardized by using a script to automate the process and update the host configuration in N1GE6.

3. Job termination script

This was the last customization we needed for N1GE6 to behave similarly to LSF.

N1GE6 is very flexible in allowing the administrator to customize default behaviors with user-created scripts. The terminate_method is one such parameter. The terminate_method was changed to reference an external script that duplicates the job terminate functionality provided by LSF.


8. Stability and Performance

LSF and N1GE6 use two different approaches to fail-over of the master scheduler.

In the event of a scheduler crash under LSF, the responsibilities of the master scheduler falls to the next server system listed in the main cluster configuration file.

With N1GE6, the administrator must specifically list one or more servers to take over this responsibility. These hosts are referred to as shadow master hosts and they are added to a separate configuration file. The shadow master hosts need to run a separate daemon process that monitors the status of the current master scheduler, and handles the startup of the master daemons in the event of the original master's failure.

Both systems work in the event of a failure of the master host. LSF gets points for relieving the administrator of specifically needing to configure fail-over.

Since switching to N1GE6 there have been several noted enhancements in stability and performance and no difference in job throughput. Both LSF and N1GE6 perform well in keeping systems busy processing jobs. As long as the jobs submitted are not overly restrictive in their requested resources, there is no reason a grid cannot achieve near 100 percent utilization using either system.

In terms of performance, the most important aspect of this migration project was the responsiveness of N1GE6 batch systems when accepting newly submitted jobs.

When LSF was under a sizable load, a submission delay was noted, which often resulted in the message batch daemon not responding.

The next most notable result of the migration project was the amount of resources consumed by the master scheduler. Under heavy load, LSF pushed the machine to its limits. The CPU would be pegged most of the time. Since switching to N1GE6, the machine the scheduler has been running on has not been under as high a load even when several thousand jobs were in the system. The load imposed on the master scheduler was so low in fact that we were able to run the master daemons of both LSF and N1GE6 on the same host system simultaneously.

N1GE6 accepts and removes jobs from the system without delay. Commands to check job status do become less responsive under very heavy loads, but they have never timed out, and there have been no complaints about the system responsiveness under N1GE6 from any of the users.

Stability is another area of importance in the Transmeta migration project where N1GE6 has performed well.

After having upgraded our Linux infrastructure to the 2.4 series kernel, we saw frequent crashes of the LSF master scheduler. We eventually traced the cause of the crashes back to its source. Jobs submitted with the rerunnable flag caused a corruption in the master scheduler's job log.

The crashes encountered were painful to recover from. Under heavy load LSF could take up to a half hour or more to parse through the job log and get back to scheduling jobs. These crashes played a significant part in speeding the adoption of N1GE6 at Transmeta.

While N1GE6 has been quite stable, the scheduler has crashed on a couple of occasions. In each instance the crash occurred when a large number of jobs was removed from the system, and while running under SGE5.3. The restart of the scheduler did not suffer from the long delays that we observed with LSF. The system recovered gracefully in each case and took only a few moments to become fully functional.

In the event of a host crash, the job is lost, but both systems have the capability to restart the failed job. Both systems are capable of using check pointing to save job state and resume in the event of a system crash.

N1GE6 comes with the Accounting and Reporting Console (ARCo)[11]. This tool dumps the job logs collected by the scheduler to an SQL database. This database is very useful for locating trends within the cluster. With some simple queries, it can be quickly determined which machines are processing the greatest number of jobs, which projects are using the greatest amount of resources, and so on. The database can also assist in locating bad machines by analyzing the number of failed jobs vs. the total processed for a given period of time.

During the evaluation of the ARCo tool, one host was found to have had a job failure rate of 90 percent. Upon closer inspection of this host, it was determined that the system had encountered a kernel issue and all jobs that were processed by the host failed immediately.

LSF has the capability to analyze job history, but it is a product that must be purchased separately, and cost must be considered.


9. Conclusion

Both LSF and N1GE6 are excellent Distributed Resource Management systems that greatly increase job throughput. N1GE6's comparable feature set, flexible policy management and open development led to its adoption at Transmeta. For sites looking to implement a DRM tool for the first time or migrate from another product, we believe that N1GE6 is a worthy contender.


Versions Used
  • Platform Computing's Load Sharing Facility (LSF 4.0.1)
  • Sun Grid Engine (SGE) version 5.3p6
  • Sun N1 Grid Engine 6 software (N1GE6)

References
  1. ASL - Monarch 1800
  2. Swatch: The Simple Watcher of Logfiles
  3. MySQL
  4. mon - Service Monitoring Daemon
  5. Cricket Home
  6. Sun N1 Grid Engine - White Papers
  7. Platform Computing
  8. gridengine: Documents and files
  9. N1GE6 Administration Guide, Chapter 5
  10. Macrovision
  11. Sun N1 Grid Engine 6

  12. Acknowledgments

    The authors would like to thank the reviewers and managers for their help and support in making this paper a reality.


    About the Authors

    Kirk Patton (kpatton@transmeta.com) is the architect and UNIX systems administrator in charge of grid computing at Transmeta Inc. where he is in charge of day-to-day monitoring of the engineering data center. Previous employment includes seven years at Advanced Micro Devices, where he supported the company's data center and LSF cluster.

    Omar Hassaine (omar.hassaine@sun.com) is an HPC/Grid engineer working in the Product Technical Support Engineering group at Sun Microsystems. Omar has developed and taught HPC/Grid-related courses and developed technical content and presentations in the area of HPC/Grid system administration. He was invited to the latest Global Grid Forum as a speaker on the "Steps to Grid Adoption" panel.


    Trademarks

    Sun, N1 and Solaris are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries.

    LSF is a registered trademark of Platform Computing in the United States, the European Union and other countries.

    MySQL is a registered trademark of MySQL AB in the United States, the European Union and other countries.

    Opteron is a trademark of Advanced Micro Devices, Inc. in the United States and other countries.

    Red Hat is a registered trademark of Red Hat, Inc. in the United States and other countries.

    FLEXlm is a registered trademark of Globetrotter Software, Inc. in the United States and other countries.

    Cricket is distributed under the GNU General Public License.

    Swatch is distributed under the GNU General Public License.


    Unless otherwise licensed, code in all technical manuals herein (including articles, FAQs, samples) is provided under this License.


BigAdmin