Saturday, December 15, 2012

Recursive Maven considered harmful

I have been strongly influenced by Peter Miller's article Recursive Make Considered Harmful. Peter showed that if you used the language of make carefully, you could achieve two very useful properties:
  • You never need to run make clean.
  • You can pick any make target, in your entire tree of code, and confidently tell make just to build that one target.

Most people don't use make that way, but they should. More troubling, they're making the same mistakes with newer build tools.

What most people do with Maven, to name one example, is to add a build file for each component of their software. To build the whole code base, you go through each component, build it, and put the resulting artifacts into what is called an artifact repository. Subsequent builds pull their inputs from the artifact repository. My understanding of Ant and Ivy, and of SBT and Ivy, is that those build-tool combinations are typically used in the same way.

This arrangement is just like recursive make, and it leads to just the same problems. Developers rebuild more than they need to, because they can't trust the tool to build just the right stuff, so they waste time waiting on builds they didn't really need to run. Worse, these defensive rebuilds get checked into the build scripts, so as to "help" other programmers, making build times bad for everyone. On top of it all, even for all this defensiveness, developers will sometimes fail to rebuild something they needed to, in which case they'll end up debugging with stale software and wondering what's going on.

On top of the other problems, these manually sequenced builds are impractical to parallelize. You can't run certain parts of the build until certain other parts are finished, but the tool doesn't know what the dependencies are. Thus the tool can't parallelize it for you, not on your local machine, not using a build farm. Using a standard Maven or Ivy build, the best, most expensive development machine will peg just one CPU while the others sit idle.

Fixing the problem

Build tools should use a build cache, emphasis on the cache, for propagating results from one component to another. A cache is an abstraction that allows computing a function more quickly based on partial results computed in the past. The function, in this case, is for turning source code into a binary.

A cache does nothing except speed things up. You could remove a cache entirely and the surrounding system would work the same, just more slowly. A cache has no side effects, either. No matter what you've done with a cache in the past, a given query to the cache will give back the same value to the same query in the future.

The Maven experience is very different from what I describe! Maven repositories are used like caches, but without having the properties of caches. When you ask for something from a Maven repository, it very much matters what you have done in the past. It returns the most recent thing you put into it. It can even fail, if you ask for something before you put it in.

What you want is a build cache. Whereas a Maven repository is keyed by component name and version number (and maybe a few more things), a build cache is keyed by a hash code over the input files and a command line. If you rebuild the same source code but with a slightly different command, you'll get a different hash code even though the component name and version are the same.

To make use of such a cache, the build tool needs to be able to deal sensibly with cache misses. To do that, it needs a way to see through the cache and run recursive build commands for things that aren't already present in the cache. There are a variety of ways to implement such a build tool. A simple approach, as a motivating example, is to insist that the organization put all source code into one large repository. This approach easily scales to a few dozen developers. For larger groups, you likely want some form of two-layer scheme, where a local check-out of part of the code is virtually overlaid over a remote repository.

Hall of fame

While the most popular build tools do not have a proper build cache, there are a couple of lesser known ones that do. One such is the internal Google Build System. Google uses a couple of tricks to get the approach working well for themselves. First, they use Perforce, which allows having all code in one repository without all developers having to check the whole thing out. Second, they use a FUSE filesystem that allows quickly computing hash codes over large numbers of input files.

Another build tool that gets this right is the Nix build system. Nix is a fledgling build tool built as a Ph.D. project at the University of Delft. It's available open source, so you can play with it right now. My impression is that it has a good core but that it is not very widely used, and thus that you might well run into sharp corners.

How we got here

Worth pondering is how decades of build tool development have left us all using such impoverished tools. OS kernels have gotten better. C compilers have gotten better. Editors have gotten better. IDEs have gotten worlds better. Build tools? Build tools are still poor.

When I raise that question, a common response I get is that build tools are simply a fundamentally miserable problem. I disagree. I've worked with build tools that don't have these problems, and they simply haven't caught on.

My best guess is that, in large part, developers don't know what they are missing. There's no equivalent, for build tools, of using Linux in college and then spreading the word once you graduate. Since developers don't even know what a build tool can be like, they instead work on adding features. Thus you see build tool authors advertising that they support Java and Scala and JUnit and Jenkins and on and on and on with a very long feature list.

Who really cares about features in a build tool, though? What I want in a build tool is what I wrote to begin with, what was described over a decade ago by Peter Miller: never run a clean build, and reliably build any sub-target you like. These are properties, not features, and you don't get properties by accumulating more code.

9 comments:

Thomas Broyer said...

I think you over-simplify Maven's build process, particularly in a multi-module build, which is what "recursive Maven" is about, and using it as a strawman. The situation is not as bad as you're implying it is.

In C, as in Peter Miller's examples, "modules" have several outputs: *.o and *.h files, and those are used by different modules depending on the use (compiling vs. linking.

Mapping it to Java, *.c and *.h are akin to *.java, and *.o are like *.class or *.jar, but Java is easier than C to build: there's no need for *.h files, there's no compilation vs. linking, so a module's only output is a *.jar (I think it's no different in Google Build System, dependencies are at the jar level).

A multi-module Maven build looks very much like the modularized top-level Makefile proposed by Peter Miller. One big difference is that you can run Maven from within a module, but that doesn't mean you should. In most cases, you'll run the aggregator project without ever installing the produced artifacts in your local repository, they'll be resolved from the so-called reactor build.

Maven is not perfect, you're absolutely right about this: its compiler plugin is as dumb as Ant's javac task or a generic .java.class rule in a Makefile: it doesn't take the classpath (dependencies on other modules, or external dependencies) or the other classes from the module into account, so it tends to build too few; but most IDEs would rebuild all classes as soon as a dependency is changed, so it's not really an issue in practice as most developers use an IDE. Apparently, Gradle's JavaCompile task doesn't suffer from this issue for module dependencies, and SBT seems to try to rebuild classes that haven't changed but whose classes they depend on have had a change in their public API.

Maven's compiler plugin also won't suppress *.class files from a previous run if you deleted or renamed a file, and the same is true for most other plugins (resources come to mind); and this is the reason you have to do clean builds regularly.

Finally, there's also the issue with the tests: Maven's surefire plugin doesn't do any dependency tracking either and runs all tests no matter what. This is probably what slows down most builds.

AFAICT, all the other flaws of “recursive make” are taken care of by tools like Maven.

And the good news is that the above issues are not in the build tools themselves but their plugins, so they can be fixed/enhanced incrementally, without the need for an entirely new build system. They'd need to track class dependencies and use them to avoid build too few cases when compiling and build too many when running tests.

You could probably cite one flaw in Maven though: while a Maven project is seen as a whole with its sources and tests, when running a reactor build Maven will build each module up to the given goal; a module is thus built only partially whereas it's used as a dependency later in the build. Depending on the given goal, this can be an issue (e.g. if you run “mvn test”, you're skipping the “prepare-package” and “package” phases, so you're not packaging your modules, and they might thus be different than with a “mvn package” when used as dependencies later in the build. This is something to be aware of, and always use “mvn package” or “mvn verify” (if you also want to run integration tests).

Lex Spoon said...

I can easily believe that Maven is flexible enough to be used in a way that it can see the whole dependency tree, if you are careful about how you use it. In particular, using modules looks like a good way to go.

I wrote the article the way I did to keep it simple; it's a subtle point and a long article already. I think my depiction of Maven's default pattern of usage is accurate.

Thomas Broyer said...

It's not that Maven is flexible enough, it's that it leaves the responsibility of behaving incrementally to its plugins.

I spawned an interesting discussion over at the Maven Users list: http://users.markmail.org/thread/lc7z2q34bl44xxc5 (there's an Atom feed if you want to follow the thread).

I learned that the Maven team has started tackling the issue, and was reminded that there are things that would still need to be tracked at the Maven Core level (such as which profiles are in use, that possibly influences which modules are built and which plugins are run).

Thanks for that article though, food for thought. It led me to look more closely at Gradle and I start to really like the way it handles task dependencies (compared to the linear build lifecycle of Maven) and how the "build cache" you're mentioning seems to be anchored in its foundations (though not perfect/complete AFAICT). Still Maven has the largest install base and probably a better IDE integration so I'll stick to Maven for now, while keeping an eye on Gradle and possibly starting using it on toy projects.

Thomas Broyer said...

So, I carefully re-read the Google EngTools blog about Google's build system and quickly re-read Mike Bland's blog too, and it seems like Google's blaze is more like Gradle (the build tool decides whether to trigger actions) than Maven (actions themselves decide whether they have something to do). In the comments, a Google engineer says that dependencies are not inferred from sources (as Peter Miller proposed in his article) but much more like Maven et al. dependencies. This implies the actions are actually dumber than those from Maven or even Ant: given those source files and dependencies, that I determined to have changed, compile them all and give me the output files. This is simpler, but fixes the issues of Ant, Maven et al. (yes, Ant's javac task is no better).

As already said, Gradle seems to handle this the same (once the compile action is triggered though, I don't know how it behaves; it seems like earlier versions of Gradle relied on Ant's task). SCons is what looks closest to Blaze, but it will fail to detect those files generated by annotation processors as it uses heuristics.

That's for actions. Then there are package dependencies, and that's what you were comparing to Maven/Ivy and their repositories. Your issue with them probably stems from the fact that you're accustomed to Google's unique repo and build everything from trunk rule.

There are two kinds of dependencies in Maven and all similar tools: other modules from the project (multi-module build), and external dependencies. As long as you don't use SNAPSHOT external dependencies and always run full builds or always combine -pl with -am, you shouldn't have any problem (apart from the other issues at the action level). Maven and most (if not all) similar tools are designed around releases: your external dependencies should always be released versions (not SNAPSHOTs) or you'll effectively be tracking a moving target and risking a build failure without changing anything in your code. You'd better pull the dependency's source into your tree (git submodule, svn externals, etc.) and build them as modules in your multi-module project.

Stephen Connolly summarized it like that: if your build depends on installing artifacts on your repository, then you're doing it wrong. Don't blame Maven.

That doesn't mean you're wrong in your article about Maven though.

David Fox said...

Loved this post. Particularly the section "How did we get here?"

I agree with most of what you wrote. And I also think that there may be arguments in addition to the ones you've given that support your position. Some of these may be very fundamental.

One comment I have is that a good build system is like a good API. A good API doesn't just allow you to use it correctly; it is designed to make it hard to use it incorrectly. A great build system is the same way. Build systems such as Google's internal build system or Buck have that quality of being hard to use incorrectly -- the requirement of forcing developers to express the dependency graph in an acyclic way help give it that quality.

Another comment I have is that there is an interesting relationship between the strategy used to manage source code and the build system. As an example, imagine if you were delivering a large Web application and you attempted to compose that application from JAR files whose source all resided in separate source repositories. Could you still make sense of your build system in that environment?

David Fox said...

Loved this post. Particularly the section "How did we get here?"

I agree with most of what you wrote. And I also think that there may be arguments in addition to the ones you've given that support your position. Some of these may be very fundamental.

One comment I have is that a good build system is like a good API. A good API doesn't just allow you to use it correctly; it is designed to make it hard to use it incorrectly. A great build system is the same way. Build systems such as Google's internal build system or Buck have that quality of being hard to use incorrectly -- the requirement of forcing developers to express the dependency graph in an acyclic way help give it that quality.

Another comment I have is that there is an interesting relationship between the strategy used to manage source code and the build system. As an example, imagine if you were delivering a large Web application and you attempted to compose that application from JAR files whose source all resided in separate source repositories. Could you still make sense of your build system in that environment?

CG said...

Thanks for posting this... even if it took me over a year to find it.

I loved Peter Miller's Cook, and even wrote a whole system around it, but it never really found traction, sadly.

The whole java ecosystem just had to re-invent that wheel, and only now with buck and perhaps gradle has it regained the ground occupied by cook over 20 years ago.

So sad.

Tim Boudreau said...

I found this article via a comment posted on my recent controversial blog about Maven.

Build caches are nice, but the Java language is hardwired to make them fail: Constants are inlined. Perhaps that's tolerable since such failures will be rare - but they'll also be hell when they do fail (I run into similar things sometimes using ccache on Gentoo Linux).

Thomas Broyer said...

Why would it fail ? Build caches work by computing a key of the inputs and storing the output under that key. Dependencies are part of the inputs. So if you change a constant, you're changing the input of one task, therefore changing the inputs of all dependents.

Constant inlining makes it hard to do incremental compilation where you try to recompile as few classes as possible, because tracking dependencies between classes is hard (there are projects that do it for years though –JMake, used in Pants for instance– and others trying to do it again in the more recent past: the Takari Lifecycle plugin for Maven, or the experimental support in Gradle). Ant's javac task or the maven-compiler-plugin a couple years back are/were broken, as they only recompile files that have changed; but this breaks not only with constants, but also any structural change that would break subclasses or clients of the class API. This is why recent versions of the maven-compiler-plugin for instance have reverted to pass the whole list of sources to javac as soon as one of them (or a dependency) has changed.