The Challenges of Reproducibility and Control

Mark Kampe

1. Introduction

Based on your experience thus far, it is easy to assume that the major problems in software development are figuring out what changes need to be made, making them correctly, and testing that they work. It might, therefore, surprise you to learn how much time and effort is spent (in larger projects) on processes that are intended to restrict changes. The discipline of managing change in software projects is commonly called Software Configuration Management (SCM). The capabilities it provides are generally summed up under the two terms reproducibility and control.

The primary activities and tools of Software Configuration Management fall into the categories of:

The following sections introduce the motivations, capabilities, tools and techniques in each of these categories. Other readings will expand on these subjects, providing more in-depth discussions of issues and practices.

2. Version Control

Imagine that you are the developer responsible for a popular utility. The latest released version is 3.1, and you are already working on features that will go into version 3.2. You just got a call from support that a key customer has encountered a serious bug ... in a three year old version (2.5).

This (typical) story is just one illustration of why software projects need version control tools. There are a myriad of situations that require us to be able to go back to previous versions of the code, whether to
  1. investigate problems old versions
  2. study the detailed history of the code in order to understand how it got to be the way it is.
  3. back-out changes that no longer make sense.

2.1 Capabilities of a Version Control System

A version control system is a sort of database. It may be as simple as an archival system for named object versions, a tool for tracking changes to text files, or it may actually be implemented on top of a relational database. It may be capable of handling only textual data (e.g. SCCS, CVS) or it may be capable of arbitrary files (e.g. svn). There are, however, a few basic capabilities that almost all version control systems have in common:

Any Version can be Recovered at Any Time

Given a change transaction ID, it is possible to recover, at any time, the affected file, as it was at the time that change was made. This is the primary capability of any version control system.

All Changes are Tracked

Whenever any change is made to any object, a record is made of:

This
  1. gives us a handle on every change.
  2. helps us understand (after the fact) why each change was made.
  3. tells us who to talk to if we need more information about the change.
  4. enables us to prepare reports of all of the changes that have been made or incorporated into a product.

Extracted Objects can be Versioned

If we have the ability to reconstruct any file as it was at any point in time, how can we tell what version of a file we are looking at now? Just as some copiers and fax machines have the ability to automatically label every output page with information about when it was printed, most version control systems have a means of automatically putting identification information into every extracted file.

Many of these mechanisms involve some sort of macro expansion capability, where:

In this way, the creator of a file can ensure that every extracted instance of the file will be appropriately labeled. This capability is so important that I have known several executives who will not read any proposal or specification that does not contain such version labels:
  1. because these labels mean the document is under version control.
  2. because these labels make it possible to confirm (when talking to someone else) that we are indeed talking about the same version.

It has, however, been argued that "version number stamps" are an over-simplified notion, increasingly difficult to apply to real products:

Thus, it has become increasingly common to associate version numbers with builds rather than with components, and to create (for each build) a detailed manifest of all of the individual component versions that went in to it.

Versioned Objects Exist in Multiple Branches

It is tempting to think that a file experiences a pure linear sequence of updates, each new change an improvement on the last. As the example at the beginning of this section illustrated, that is an overly simplistic model. There are a few reasons for files to exist in parallel branches:

Branches are merely parallel threads of development, and can be created for any purpose. There are, however, two common branching strategies:

These models may not, however, be as quite as different as they seem. In the promotion model, each subsequent release obviously benefits from work done in the previous releases. Omitted from the (over-simplified) mainline diagram is the ongoing feed-back of enhancments from each release and project back into the main branch.

They may support distinct, hierarchically related, work-spaces

With a very large product, there may be dozens or even hundreds of different sets of developers working on parts of the same system. If everyone were working in a single work space (where everyone saw everyone else's changes) there might be many situations where a mistake made by one group, could block all of the other groups. For this reason it is desirable to be able to create temporary copies of a version control database, where I am free to make all of the changes that I want (and have them tracked), but other people will not see those changes until I am done.

This is typically done by creating a child or clone of the original workspace.

Systems of parent and child workspaces can exist in very complex hierarchies. There might be an official release version, which has a child for the current release candidate, which has children for each development organization, each of which has children for each distinct group of developers.

2.2 Use of a Version Control System

A version control system is much like a database. It may actually be built on top of a database. They are also organized into repositories (projects, directories, etc). The basic usage cycle is pretty simple:

  1. It may be necessary to register yourself with a particular repository (which may verify your ability to read or write these files).
  2. When documents are created (or first imported into the version control system) a command is run to add each file to the repository.

    If your version control mechanism uses ID macros, it is a good idea to ensure that each file contains the appropriate version ID/macro/strings to ensure that every copy extracted from the repository will contain appropriate version identification information.

  3. When you want to look at (or work on) files, there is usually a simple command to extract current (or any specified versions) of all (or any subset) of the files in the repository. These become your working copies.
  4. You can view, edit, and delete these using any tools you choose.
  5. There are usually commands to audit your collection of files against the repository, and tell you (a) what files you have changed and (b) what other changes have been made to the repository in the interim.
  6. There are commands to help you identify and reconcile potentially conflicting updates.
  7. There is a command that checks your changes in to the repository.

    It may ask you for a list of files to be checked in, or it may detect them automatically.

    It will probably ask you to provide a summary of these changes (why they have been made and what they accomplish).

  8. There are commands to view the change history for selected files.
  9. There is also usually some way of backing out a change that proves to be undesirable.

Version control systems are normally used to keep track of created documents (like requirements, specifications, plans, software, test cases, user manuals, etc).

Newer version control systems are capable of handling non-textural data, and may also be used to archive copies of generated objects (program binaries, databases, trace logs, test plan results, etc).

2.3 Distributed Version Control

The preceding discussion has been biased towards centralized version control, where there is an official master repository and a definitive lineage for any branch. This is the most common model for most commercial and personal software projects. It is not the only model.

In a centralized version control system, workspaces exist in a relatively stable tree-like hierarchy. Changes are submitted up from child to parent (with the parent's permission), and propagated back down from parent to child (when the children are ready). In a distributed version control regime some workspaces may exist in tree-like hierarchies, but (in principle) changes can propagate from any workspace to any workspace (with cycles allowed).

Consider an open source product under active development by many people to support a wide range of needs ... Linux for example. Numerous people are working on their own versions, and regularly fix bugs and add features. Not all valuable changes are (or should be) accepted into Linus' main branch, and it may take many months for particular change to find its way into the main branch. People who need an update are not willing to wait until that change (perhaps) appears in the main branch. Rather, these changes (patches) are continuously and freel passed around the Linux community (e.g. like restaurant recommendations).

In a centralized version control system version one might:

If I wanted these changes, I would check out that (1.3.3) version.

In a distributed version system one might start with the same version, make the same changes, and commit them back with the same comments. The difference is how I would get these changes into my own source tree.

The trick is that middle step. If my code was identical to the creator's version 1.3.2, applying the same changes would be trivial. If my code has greatly diverged from the creator's version, that process might be quite complex ... but that is the price I pay for:
  1. wanting to have my own variant version of the product.
  2. wanting to take advantage of other peoples' ongoing work.

In centralized version control, the sacred operation is commit, and most of the complexity surrounds the creation and maintenance of branches. In distributed version control, commits are a dime a dozen. The interesting operations are merge or rebase (consider this change to be relative to a different starting point). To support these operations, distributed version control systems tend to have sophisticated tools for assisting developers with the process of merging patches into divergent branches:

Note, however, that the problem of code merging is not at all unique to distributed version control:

Although distributed version control arose to support different development paradigms, many people feel that distributed version control tools are simply more general/powerful. One can use (more powerful) distributed version control tools, but still choose to do manage change propagation in centrally managed model with hierarchically structured workspaces.

3. Build Automation

As programmers, it is natural to think of programs as things that human beings create in text editors ... but this is only the first step in its creation. Our source files will be run through a several passes of compilers, library builders, linkage editors, database builders, packagers, and other such tools before our software is ready to run. It is important that this build process be automated. There are several reasons for this:

There is also a reproducibility goal associated with build automation:

Build automation might involve anything from shell or perl scripts, to macro-processors, to make files, to auto-makefiles, to integrated development environments. Their capabilities might be any combination of:

The more complex the software to be built, the more valuable the extra capabilities become. A good build automation mechanism is one that is:
  1. completely automated
  2. easy to use
  3. robust in the face of errors
  4. yields highly predictable and dependable results

Two of the most common build automation tools are make and Apache Ant. For large projects, even building a makefile can be a daunting task. A good example of a tool to assist developers with this problem is Automake.

4. Build Environment

Consider the reproducibility goal we stated for build procedures:

This may not be strictly achievable ... because the bits created by the compilation process are not uniquely determined by the source code and the compilation options:

A version control system may be able to ensure that we can reconstruct the original versions of our source modules. If we want to be able to reliably reconstruct the same binaries, it we may also need to be able to reassemble the same versions of all of the build tools that were originally used to build the software. This is not merely an issue for legacy system support. Imagine what would happen if different developers and the release engineering group were all using different versions of the compilers and libraries. If there was a bug in a library module, two different developers could compile the same source and get different behavior.

To avoid such problems, many software projects settle on a standard build environment. A set of basic tools that will always be used to build a particular release of the product. What if there is a bug in one of the tools in the standard build environment? It can be fixed, but then a new version of the standard build environment has to be defined, and everybody must upgrade to the new version.

Many organizations are (quite rightly) paranoid about the possibility of different people in the development using different versions of the build environment. It is common for organizations that are concerned about build correctness and reproducibility to adopt some very conservative practices:

So how does one get a copy of the standard build environment?

In situations where developers might reasonably be expected to build many different versions of the same software, it is common to install each version of the build environment into a different directory, and to use environment variables to tell the standard build tools where they can find the correct tools.

5. Change Control

There are many possible reasons to limit peoples abilities to make changes:

5.1 Capabilities of Change Control Systems

It is possible to implement change control disciplines with purely administrative procedures (written rules about who is allowed to do what when). Many version control systems, however, include mechanisms to ensure the enforcement of change control policies:

6. Conclusions

Software Configuration Management procedures make it possible for us to know precisely what our product is (all of the components that went in to it, and the exact means by which they were processed to yield the final product. They also make it possible for us to exactly reproduce any version of the product that has ever existed. These capabilities are often critical for major software products. But what of less formal development situations (e.g. a little utility I write for myself)? The requirement for total reproducibility would seem to be gross overkill.
This is true, but ...

Larger projects may impose more formal constraints on the development process, and more completely specify the ways in which various tasks should be performed, but the basic techniques of Software Configuration Management are applicable to even the simplest of software projects. Most modern SCM tools are flexible enough to support the full spectrum of software development projects. An understanding of these issues, tools and techniques will prepare you to create a Software Configuration Management regime to meet the needs of each project you work on.