Using Configuration Management to Reduce RiskPeter McNab, Sun Microsystems, Inc., February 2007 Contents:
IntroductionThis paper examines configuration management and demonstrates specific benefits to applying good configuration management techniques to a build engineering process. Proper application of these techniques will lead to a more robust and scalable system which is critical to the success of software development efforts. It can also lead to a better relationship between build engineering and development teams by increasing trust and allows for greater efficiency within the build engineering team. The guidelines presented here have been distilled from years of designing reliable build systems for global development teams on a wide range of operating systems and environments. While much of my specific experience is with Java technology-based products, the principles here can be applied to any build system. The primary benefit of configuration management is to maintain control of a system while minimizing risks. The most expensive system failures are intermittent ones, as they often require extensive analysis and repeated build cycles to solve. The risk of intermittent failures can be greatly reduced when the build system has been inventoried. All components of a build system should be identified, tracked, and accessible. Increased control provides a more reliable process that benefits both the build engineering team and its internal customers. After defining configuration management and identifying the goals of build engineering, this paper will describe the benefits and provide specific examples of how to make improvements to a system. What Is Configuration Management?Configuration management is the process of identifying system components, tracking relationships between components, and communicating the status and flux in those components. Typically, this consists of storing detailed information about versions and characteristics for each component, as well as keeping the records current and complete. It is optimal to also track all changes to any components, so that a full history of the environment at any point can be reconstructed. Reporting the state of components is another important aspect, as any interested parties can then review the information. The first step is to define the components that will be tracked. Traditionally, configuration management was often applied only to tracking source code revisions, but in recent years it has expanded to cover the complete set of components that go into a system. This includes the source code, third party binaries, loose files, build tools, user environments, machine configurations, and any other items that are consumed by the system. Once the types of components are identified, a baseline should be established that details the components as of that moment. While there is no perfect rule on what specific data should be recorded, there should be sufficient information that if there were a catastrophic loss, the system could be reconstructed from scratch. As a result, it is common to have bootstrapping procedures documented and included in the record. As time passes and the system evolves, it is critical to keep the records up to date. While it is possible to do this by updating the documentation as necessary, it is better to have a change control process so that any modifications are clearly identified. Instead of having to remember to update system documentation when, for example, a new JDK is added, the process of updating the JDK should trigger an explicit documentation review. This can be done by having a component upgrade procedure that is followed whenever a component is modified, added, or removed. This is particularly important when resources are short and changes are urgently needed, when it would be easy to forget to update documentation. Periodic audits should be performed to ensure the record is accurate and complete. This is less necessary when using a mature change control process, but can still be beneficial to assess new risks and put new components under proper management. The final step in configuration management is proper communication of the system components to interested groups. The primary use of this is for the maintainers of the system to be able to restore or duplicate environments. However, internal customers such as development and quality assurance can also benefit as they are able to see the complete system and verify that they have compatible environments. Meeting the Goals of Build EngineeringThe primary goal of build engineering is to assemble a set of delivered components into a consumer-ready product. In its simplest form, this means compiling the source code and packaging the built binaries. To provide the best service to internal customers, the build process should be robust, reliable, and repeatable. It should operate in a well-defined environment and be deterministic and return consistent results when run with the same components. Failed builds are very expensive, both for build engineering to debug and for downstream customers like quality assurance (who have to wait for a good build) and development (who can't successfully complete test builds of current code). It's impossible to eliminate broken builds, but it is possible to reduce their likelihood. Customers of build engineering include most contributors to a product, including development engineering, quality assurance, program management, marketing, sustaining, documentation, training, and support groups. Those groups expect reliable service from build engineering to meet their internal deadlines and minimize wasted effort. One of the benefits that a good build engineering organization can provide is to be gatekeepers and protect the customers from the chaos of a globally-distributed development team. For example, if there is a critical defect in the product source code, a good build system will identify the problem and report it back to the developer who committed the defect, rather than passing it on to downstream organizations like QA or documentation to discover. It's always cheapest to recognize a problem as close to the source of the problem as possible. The best way to have a reliable and robust build process is to identify areas at risk, and reduce the risks. A common source of risk is external dependencies, which are defined as components that are consumed by the build process, but are not controlled by the build engineering team. These items are much more likely to be moved, altered, or deleted without the build engineering team being aware. By placing these items into a build engineering-controlled area and tracking and communicating changes it reduces the risk of breaking product builds. It becomes possible to identify every change that has gone into the system since the last good build. Buildmasters can then focus on the changed areas in locating the source of the problem and requesting a fix. For example, when faced with the requirements of identifying and tracking a legal file that needs to be included in a product build, it might be common to point to a file under a user's home directory. In order to properly identify and maintain control over the file, it may be necessary to add it to a build engineering-controlled repository. This removes a dependency of the system on an external source, and makes the build process more reliable. Another benefit to this approach is that it can raise the visibility of non-source code changes to the system. In short, configuration management provides the structure necessary to identify risks and mitigate them. Other specific examples of risk reduction follow later in this document. Build engineering also should provide a defined infrastructure to improve customer efficiency. This is best demonstrated by providing tools and environment specifications to developers so that they are able to perform test builds that are consistent with the build engineering builds. Developers can then test their code changes and commit changes with confidence that the production build will succeed. Configuration management can also increase the efficiency of build engineering and thus provide a better return from less resources. For example, a system that is clearly identified and managed is much easier to automate. In addition, system and environment documentation simplify installation of a new build host. Since all components must be tracked by build engineering, it also makes it possible to do pre-processing on incoming files to ensure they have the correct contents or format before they are ever used by the build, which again reduces the chance of build failures. While there is clearly an increased cost to identifying and tracking build components, the long-term benefits can be substantial. It provides better scalability of services within the build engineering team, helps reduce the number of broken builds, and focuses debugging efforts on changed components. Also, unlike some systems, there is an incremental benefit from each component that is brought under configuration management. It is not required to overhaul an entire system to see improvements in reliability and support costs. What Are the Benefits to a Software Development Organization?As well as the benefits to meeting direct build engineering goals, proper configuration management can also lead to many advantages for the rest of the software development organization, including developers, quality assurance, documentation, and support. Developers can configure their own build environments to match the well-documented build engineering environment, and use the same tools. This reduces the chance of bad builds due to the environment or tools in the developer's local path. Properly insulated build environments make it easy for developers to support multiple codelines. Since every changed component going into the build system is tracked, developers will only be contacted when they've committed into a newly broken build. Rather than a widespread email, buildmasters can contact the specific developers who made modifications, so developers who don't commit into that build aren't distracted from their work. Change control also benefits developers when the changes are communicated to the team. Developers should be able to review the builds and see exactly what source changes made it into each build, so that they can see that the build containing their changes was successful. Furthermore, a well-designed set of build tools and appropriate infrastructure can allow people to work in locations independent of network domains, geographical locations, and machine architectures. If the behavior of the build tools is uniform across platforms, it becomes cheaper to develop and support the system, and to allow it to scale as the team grows. For quality assurance, having a more reliable build process means they can get access to the latest source changes as soon as possible, instead of spending effort testing an older build. They also benefit from seeing what source changes went into a build, as they can focus their testing on those areas, instead of having to start a full new round of testing. This allows QA to work more efficiently. Because the QA team can trust the build system to be repeatable, they are able to drastically reduce testing time and focus on specifics areas of the product that have been modified since the previous build. Documentation teams can see similar benefits from being able to track changes and more easily identify what documents need to be updated. Support teams can track bugs and see when they are fixed, and then access a specific build to verify the fix. Each of these teams will benefit directly when the build engineering team applies configuration management techniques to improve their system. Specific Improvements to Reduce the Risk
While there are many theoretical benefits from applying configuration management techniques, it's also worth looking at specific examples of risks that can be identified and mitigated using the process. Each of the examples below has been seen in a live production build system, and includes a suggested improvement. In Figure 1, we see a risky system, where the build system picks up loose files and binaries from unknown sources. Each of them is outside the control of the build engineering team and could easily change without notice. Similarly, the build host is not clearly identified and would be hard to replace if there was a system failure. Source Code Source code is the most common component that configuration management is applied to. However, some teams have done development work in local source trees, and share code among developers by manually copying files. Not only does this not provide any mechanism to track changes, but it is much more likely to have one developer's changes overwritten by another and break the build. All product source code should be stored in a centralized source code repository. At a bare minimum, the repository must track revisions. Ideally, the system would also be able to produce a change log of all modifications since a certain time, support multiple branches, and be accessible for users in multiple physical locations working on multiple operating systems. For example, if the system relies on NFS to access the source, such as when using a TeamWare workspace, it means that users on non-UNIX systems will need special configurations to access the source code. (However, if NFS access is a central theme of a build system, then it can be managed as a primary requirement, and the risk of using it is reduced.) If the source control system has a multi-platform client/server architecture, users can work on any operating system and work with the source. Tracking changes is particularly important, as the most common failures of well-designed build systems are due to source code changes. It is imperative that build engineers be able to see what source code changes have been put into the system since the last good build in order to focus their troubleshooting efforts on the appropriate developer. Each build should report on the list of changed components since the previous build and identify the person who made the change. This allows the build engineering team to contact the developers who committed troublesome source code, and also allows developers to verify that their source code changes are complete and picked up by a specific build. Non-Source File Dependencies A common threat to a reliable build system is the use of files in the build that are not considered part of the product source, yet are important to a successful overall build. Common examples of these are documents that are handed off manually, or license files that are taken from another user's home directory. Files that are stored in locations outside the control of build engineering are much more likely to be moved, deleted, or renamed and cause a build failure. Any file that is consumed by the build in any way should be stored on a build engineering-owned machine. Ideally those files would be stored along with the product source code in a source code repository. Regardless of the exact location of the files, they should be under the control of build engineering. Changes to the files should be managed similar to product source code, where all changes are communicated, and it's possible to access older revisions of the files. Tools (Binaries and Automation) Care should be taken to avoid using build tools that are not under the control of build engineering. For example, relying on executables that are made accessible on a wide-scale within the company mean that if those executables are changed or moved, the build could fail. Since finding a specific version of a tool can be difficult, once a good version is found, it should be stored under the control of the build engineering team, and distributed to the development teams. Like other components that go into the build, the tools should be identified and tracked. Ideally, a build engineering group can provide a distribution of tools that are produced from source code or stored binaries in a repeatable fashion. Reliance on system versions of tools like “ls” or “ps” can cause compatibility problems if a different shell environment or operating system type is used. The tools distribution should also contain “third party” binaries like the Java SDK and other build-related binaries like Ant, make, or compilers. If the build engineering team provides a set of build tools, then both the build engineering team and the developers can use the same set of tools. Uniform behavior across different machine operating systems means that test builds are more reliable. It's also cheaper for build engineering to maintain complex build scripts, as they can rely on a consistent base of low-level binaries as the foundation of their tools. Machine Configuration It's fairly common for build systems to run on top of machines that are inadequately documented. This covers both the current state of the system, as well as a procedure for configuring the machine. Common machine elements that can affect the build include the operating system version and patch level and hardware configuration. When a machine's state is unknown, there could be changes to the system that would cause a problem for a build. If there is a catastrophic data loss and the build machine needs to be replaced, the lack of a machine inventory and configuration documentation makes it much harder for the build engineer to set up a replacement machine in a timely manner.
To reduce the risk of problems due to the machine environment, any host that is used by build engineering must be clearly described. Easily-accessible documentation should contain a list of the hardware specifications, operating system version, patch level, and additional software that has been installed. It should also include a configuration procedure sufficient to rebuild the machine in the case of a catastrophic failure. The best method for testing the documentation is to have someone unfamiliar with the system follow the procedure and then clean up any confusing or incorrect sections. In many cases these procedures can be made available to downstream customers such as development to be used in setting up their own systems. This makes it easier for developers to work in an environment that is consistent with the build engineering environment, which then reduces the likelihood of compatibility issues. In Figure 2, we see a safer system. System components are checked into managed repositories for sources and binaries. The build system gets the components from those repositories and builds on a well defined system. The repositories allow revision history and change notification. ConclusionDevelopment of a mature configuration management process is a critical investment in a scalable and reliable system that meets build engineering goals. The main steps of configuration management are:
These steps can be applied incrementally to specific subsections of a system and still reap benefits. It is not necessary to overhaul an entire system to see improvements. The application of proper configuration management techniques can improve the stability of a build system and provide direct benefits to the service level and efficiency of the build engineering team. It is strongly encouraged that build engineering teams adopt these processes to improve their efficiency and reliability. I welcome comments and questions at peter.mcnab [at] sun.com. Unless otherwise licensed, code in all technical manuals herein (including articles, FAQs, samples) is provided under this License. |
BigAdmin SubscriptionsBigAdmin Areas
BigAdmin Sun Center
BigAdmin Topics | ||