With the spreading of revision control systems (CVS, PRCS, Subversion etc) and the benefits of using the autotools for project management and portability, there has been some debate on whether to put automatically generated files in project repositories.

Personally, I've had this argument several times, with different projects leaders (most notably with the BBDB maintainer at the time I upgraded BBDB to autoconf 2.5*). In this article, I'll try to explain why (certain) generated files should go in the repository.

If you're wondering whether such or such generated file should go in your repository, there is a simple rule of thumb:

"what goes in the distribution goes in the repository"


This applies consequently to files generated by the autotools, like the configure script, but not only to them. In the discussion below, I might be speaking of configure in particular, but you should keep in mind that my points apply to many other files.

Some people think that no generated file should go in repositories. They are wrong. If you're not convinced (by my simple rule of thumb), read further, I hope to have you change your mind. I'll give some personal arguments below, and I'll also show how irrelevant arguments that have been thrown at my face are.

Personal arguments


Different kinds of "generated files"


Makefile's are generated at build time, and depend on the user environment. They're very likely to be different from build to build so it is obvious to exclude them from the archive. configure, on the contrary, is the same file for everybody so it makes a lot of sense to have it in the archive. Put it another way, why would you require every single person using the repository to rebuild the same file locally?

Building configure requires autoconf


And this might be a problem in several ways. There is an undeniable fact these days: more and more people give up tarballs and use repositories directly, either to stay on the bleeding edge, or just to simplify the process of getting the latest patchlevel for their stable version of the software. It's easy to realize: just look at sourceforge and other similar hosting services to see how common and popular public repository access has become. And this is normal: if you upgrade frequently, you save a lot of bandwith by downloading only the parts that have changed.

So the current trend is to actually use repositories not only as development facilities, but as a distribution mean as well. In other words, not only developers but also end-users make use of repositories. Given this fact, it becomes obvious that you need to put in the repository exactly what goes in the distribution. The whole point of configure is precisely that once generated, it becomes independant from autoconf. So you don't want to require a working installation of autoconf from each of your users.

Building configure requires the right version of autoconf


The situation is even worse than that: having a working autoconf installation is not sufficient in general. You have to use preferably the same version as the one configure.[in|ac] was written for. For example, upgrading from the 2.14 to the 2.5* series is a major backward incompatible process. Of course, autoconf attempts to maintain compatibility on surface. But remember that it's just a macro package. This means that, almost by nature (the same applies to TeX or LaTeX development BTW), it is nearly impossible to use just the surface of autoconf without having the need to hack something on top of the internals, thus bypassing the standard API. And the internals evolve constantly. That's why in actuality, you often need to follow very closely the version number.

Not distributing configure in the repository thus means that you actually force your users to upgrade their installation of autoconf before upgrading the software they're interested in. And what if they also follow another piece of software that makes use of another version ? How many versions will they be obliged to maintain at the same time?


What I've been retorted


"configure is hudge. It's silly to keep generable files that big in a repository."


OK. The biggest configure script I know of is probably XEmacs's. It's about 500Ko. Most other scripts will probably be smaller. Now, given that we're speaking of a machine large enough to run a revision control system service such as CVS or Subversion, what is exactly the point in trying to save 500Ko?

"configure is hudge. It's silly to download generable files that big at each checkout."


First of all, configure scripts do not change very often compared for example to changes made to the project's sources. And the point of downloading from a revision control system is precisely to download only the things that have changed. So you actually don't download configure at each checkout.

Second, remember that running the complete autotools suite (most of the time, autoconf is not enough. You also need to run automake, aclocal, libtool or whatever through a bootstrap script) takes time. I wouldn't be surprised if on most boxes, the time to download already generated files is actually smaller than the time required to regenerate them.


"I don't want to see those generated files appear in diff outputs"


Abso-fraggin-lutely. And this is not very difficult to achieve: first, it happens only when you do a plain diff, without specifying the involved files. Otherwise, there are plenty of wrappers around diff commands that filter out unwanted files, plus other bells and whistles, and even options for ignoring files in many revision control systems. So this is not a problem in practice.

"Having configure in the repository generates a lot of conflicts."


That's the most frequent argument that I've heard, and it is as irrelevant as the others. If you're just an end-user, you won't ever have to touch configure. Hence, an update of your working copy will just download the delta and apply it. No conflict.

Still, if for some obscure reason, you're getting conflicts, these are not conflicts you want to edit by hand anyway, since the file is generated! So assuming the developers have done their job correctly (meaning: checking in the new configure script as well as configure.[in|ac], actually "fixing" the conflict is as painful as typing rm configure && cvs update. Not much of a burden, is it?

Now, if you are a developer, you might actually get conflicts because of parallel development. But the point is you don't care because this file is generated. What you're interested in is conflicts on configure.[in|ac]. And if you do your job correctly, each tweaking of the source file should lead to regeneration of configure and friends. So the previous paragraph applies here too.



Well, I think that's about it. I hope that you're convinced now. And remember: what goes in the distribution goes in the repository.. It is as simple as that.