Wednesday, May 4, 2016

Supporting C in a build system

In a previous post, I showed that the standard build rules for C code are unreliable. Let me describe two ways to do better.

In the interest of brevity, I will describe the build rules using a toy build engine called Blcache (short for "build cache"). I initially tried to write this post using standard tools like Make, Ninja, Bazel, and Nix, but they all have one limitation or another that distracts from the main point of this post.

Example problem

Here is an example problem this post will work against. There is a test.c file, and it includes two separate header files:

// File test.c
#include <stdio.h>
#include "syslimits.h"
#include "messages.h"

int main() {
  printf("%s: %d\n", MSG_LIMIT_THREADS, LIMIT_THREADS);
}

// File localhdrs/messages.h
#define MSG_LIMIT_THREADS "Limit on threads:"

// File hdrs/syslimits.h
#define LIMIT_THREADS 10

After compiling this code, a third header file is added as follows:

// File localhdrs/syslimits.h
#define LIMIT_THREADS 500

The challenge is for the addition of this header file to trigger a rebuild, while still making the build as incremental as possible.

Method 1: Compile entire components

The simplest way to get correct C compiles is to compile entire components at a time, rather than to set up build rules to compile individual C files. I've posted before on this strategy for Java, and it applies equally well to C.

This approach seems to be unusual, but my current feel is that it should work well in practice. It seems to me that when you are actively working on a given component, you should almost always use an IDE or other specialized tool for compiling that given component. The build system should therefore not concern itself with fine-grained incremental rebuilds of individual C files. Rebuilds of whole components--executables, libraries, and shared libraries--should be plenty, and even when using a system like Gyp, there are advantages to having the low-level build graph be simple enough to read through and debug by hand.

Using such an approach, you would set up a single build rule that goes all the way from C and H files to the output. Here it is in JSON syntax, using Blcache:

{
    "environ": [
        "PATH"
    ],
    "rules": [
        {
            "commands": [
                "gcc -Ilocalhdrs -Ihdrs  -o test test.c"
            ], 
            "inputs": [
                "test.c", 
                "localhdrs",
                "hdrs"
            ], 
            "name": "c/test", 
            "outputs": [
                "test"
            ]
        }
    ]
}

The "environ" part of this build file declares which environment variables are passed through to the underlying commands. In this case, only PATH is passed through.

There is just one build rule, and it's named c/test in this example. The inputs include the one C file (test.c), as well as two entire directories of header files (localhdrs and hdrs). The build command for this rule is very simple: it invokes gcc with all of the supplied input files, and has it build the final executable directly.

With the build rules set up like this, any change to any of the declared inputs will cause a rebuild to happen. For example, here is what happens in an initial build of the tool:

$ blcache c/test
Started building c/test.
Output from building c/test:
gcc -Ilocalhdrs -Ihdrs  -o test test.c

$ ./target/c/test
Limit on threads: 10

After adding syslimits.h to the localhdrs directory, the entire component gets rebuilt, because the localhdrs input is considered to have changed:

$ blcache c/test
Started building c/test.
Output from building c/test:
gcc -Ilocalhdrs -Ihdrs  -o test test.c

$ ./target/c/test
Limit on threads: 500

As a weakness of this approach, though, any change to any C file or any header file will trigger a rebuild of the entire component.

Method 2: Preprocess as a separate build step

Reasonable people disagree about how fine-grained of build rules to use for C, so let me describe the fine-grained version as well. This version can rebuild more incrementally in certain scenarios, but that benefit comes at the expense of a substantially more complicated build graph. Then again, most developers will never look at the build graph directly, so there is some argument for increasing the complexity here to improve overall productivity.

The key idea with the finer-grained dependencies is to include a separate build step for preprocessing. Here's a build file to show how it can be done:

{
    "environ": [
        "PATH"
    ],
    "rules": [
        {
            "commands": [
                "gcc -c -o test.o target/c/preproc/test.i"
            ], 
            "inputs": [
                "c/preproc/test:test.i"
            ], 
            "name": "c/object/test",
            "outputs": [
                "test.o"
            ]
        },

        {
            "commands": [
                "gcc -Ilocalhdrs -Ihdrs -E -o test.i test.c"
            ],
            "inputs": [
                "test.c", 
                "localhdrs",
                "hdrs"
            ], 
            "name": "c/preproc/test",
            "outputs": [
                "test.i"
            ]
        },

        {
            "commands": [
                "gcc -o test target/c/object/test.o"
            ],
            "inputs": [
                "c/object/test:test.o"
            ],
            "name": "c/test",
            "outputs": [
                "test"
            ]
        }
    ]
}

This file has three rules in it that chain together to produce the final output file. When you build the test executable for the first time, all three rules will be executed:

$ blcache c/test
Started building c/preproc/test.
Output from building c/preproc/test:
gcc -Ilocalhdrs -Ihdrs -E -o test.i test.c

Started building c/object/test.
Output from building c/object/test:
gcc -c -o test.o target/c/preproc/test.i

Started building c/test.
Output from building c/test:
gcc -o test target/c/object/test.o

$ ./target/c/test
Limit on threads: 10

First, the test.c file is preprocessed, yielding test.i. Second, the test.i file is compiled to test.o. Finally, test.o is linked into the final test executable.

Adding the new syslimits.h file behaves as expected, causing the full chain of recompiles.

$ blcache c/test
Started building c/preproc/test.
Output from building c/preproc/test:
gcc -Ilocalhdrs -Ihdrs -E -o test.i test.c

Started building c/object/test.
Output from building c/object/test:
gcc -c -o test.o target/c/preproc/test.i

Started building c/test.
Output from building c/test:
gcc -o test target/c/object/test.o

$ target/c/test
Limit on threads: 500

Modifying an irrelevant header file, on the other hand, only causes the precompilation step to run. Since the precompilation yields the same result as before, rebuilding stops at that point.

$ touch localhdrs/irrelevant.h
$ blcache c/test
Started building c/preproc/test.
Output from building c/preproc/test:
gcc -Ilocalhdrs -Ihdrs -E -o test.i test.c

Using cached results for c/object/test.
Using cached results for c/test.

It's not shown in this example, but since each C file is compiled individually, a change to a C file will only trigger a rebuild of that one file. Thus, the technique here is fine-grained in two different ways. First, changes to one C file only trigger a recompile of that one file. Second, changes to the H files only trigger preprocessing of all C files, and then only compilation of those C files that turn out to be affected by the H files that were changed.

By the way, there's a trick here that generalizes to a variety of cached computations. If you want to add a cache for a complicated operation like a C compile, then don't try to have the operation itself be directly incremental. It's too error prone. Instead, add a fast pre-processing step that accumulates all of the relevant inputs, and introduce the caching after that pre-processing step. In the case of this example, the fast pre-processing step is, well, the actual C preprocessor.

Coda

Before realizing the problem with C compilation, I used C as an example of why you might want to break the rules a little bit about the most strict and simplistic version of a build cache. However, now it seems to me that you find the best set of build rules if you strictly adhere to a build-cache discipline. I'm sorry, build cache. I should never have doubted you.

Standard build rules for C are unreliable

The standard way of integrating C into a build system is to use automatic dependencies generated from the compiler. Gcc and Clang can emit a list of the header files they read if you run them with the -M option. Visual Studio can do it as well, using the /showIncludes option. What I will call the "standard approach" in this post is to use the dependencies the user explicitly declared, and then to augment them with automatic dependencies generated by options like -M or /showIncludes.

Until a few years ago, I just took this approach as received wisdom and didn't think further about it. It's a neat trick, and it works correctly in the most obvious scenarios. Unfortunately, I have learned that the technique is not completely reliable. Let me share the problem, because I figure that other people will be interested as well, especially anyone else who ends up responsible for setting up a build system.

The root problem with the standard approach is that sometimes a C compile depends on the absence of a file. Such a dependency cannot be represented and indeed goes unnoticed in the standard approach to automatic dependencies. The standard approach involves an "automatic dependency list", which is a file listing out the automatically determined dependencies for a given C file. By its nature, a list of files only includes files that exist. If you change the status of a given file from not existing, to existing, then the standard approach will overlook the change and skip a rebuild that depends on it.

To look at it another way, the job of a incremental build system is to skip a compile if running it again would produce the same results. Take a moment to consider what a compiler does as it runs. It does a number of in-memory operations such as AST walks, and it does a number of IO operations including reading files into memory. Among those IO operations are things like "list a directory" and "check if a file exists". If you want to prove that a compiler is going to do the same thing on a second run as it did on the first, then you want to prove that those IO operations are going to do the same thing on a second run. That means all of the IO operations, though, not just the ones that read a file into memory.

Such a situation may seem exotic. At least one prominent source has declared that the standard approach is "correct" up to changes in the build command, which suggests to me that the author did not consider this scenario at all. It's not just a theoretical problem, though. Let me show a concrete example of how it can arise in practice.

Suppose you are compiling the following collection of files, including a single C file and two H files:

// File test.c
#include <stdio.h>
#include "syslimits.h"
#include "messages.h"

int main() {
  printf("%s: %d\n", MSG_LIMIT_THREADS, LIMIT_THREADS);
}

// File localhdrs/messages.h
#define MSG_LIMIT_THREADS "Limit on threads:"

// File hdrs/syslimits.h
#define LIMIT_THREADS 10
Using automatic dependencies, you set up a Makefile that looks like this:
CFLAGS=-Ilocalhdrs -Ihdrs

test.o test.d : test.c
 gcc $(CFLAGS) -M test.c > test.d
 gcc $(CFLAGS) -c test.c

test: test.o
 gcc -o test test.o

-include test.d

You compile it and everything looks good:

$ make test
gcc -Ilocalhdrs -Ihdrs -M test.c > test.d
gcc -Ilocalhdrs -Ihdrs -c test.c
gcc -o test test.o
$ ./test
Limit on threads: 10
Moreover, if you change any of the input files, including either of the H files, then invoking make test will trigger a rebuild as desired.
$ touch localhdrs/messages.h
$ make test
gcc -Ilocalhdrs -Ihdrs -M test.c > test.d
gcc -Ilocalhdrs -Ihdrs -c test.c
gcc -o test test.o

What doesn't work so well is if you create a new version of syslimits.h that shadows the existing one. Suppose you next create a new syslimits.h file that shadows the default one:

// File localhdrs/syslimits.h
#define LIMIT_THREADS 500

Make should now recompile the executable, but it doesn't:

$ make test
make: 'test' is up to date.
$ ./test
Limit on threads: 10

If you force a recompile, you can see that the behavior changed, so Make really should have recompiled it:

$ rm test.o
$ make test
gcc -Ilocalhdrs -Ihdrs -M test.c > test.d
gcc -Ilocalhdrs -Ihdrs -c test.c
gcc -o test test.o
$ ./test
Limit on threads: 500

It may seem picky to discuss such a tricky scenario as this one, with header files shadowing other header files. Imagine a developer in the above scenario, though. They are doing something tricky, yes, but it's a tricky thing that is fully supported by the C language. If this test executable is part of a larger build, the developer can be in for a really difficult debugging exercise to try and understand why their built executable is not behaving the way that's consistent with the source code. I dare say, it is precisely such tricky situations where people rely the most on their tools behaving in an intuitive way.

I will describe how to set up better build rules for this scenario in a followup post.

Tuesday, April 12, 2016

Two little things I wish Java would add

When geeking out about language design, it's tempting to focus on the things that require learning something new to even understand how it works. SAM types require understanding target typing, and type members require understanding path-dependent types. Fun stuff.

Aside from these things that are fun to talk about over beers, I really wish Java would pick up a few things from Scala that are just plain more convenient.

Multi-line string literals

A great way to structure a unit test is to feed in a chunk of text, run some processing that you want to verify, convert the actual output to text, and then compare it against another chunk of text that's included in the test case. Compared to a dense string of assertEquals calls, this testing pattern tends to be much easier to read and understand at a glance. When such a test fails, you can read a text diff at a glance and possibly see multiple different kinds of failure that happened with the test, rather than stare into the thicket of assertEquals calls and try to deduce what is being tested by the particular one that failed.

The biggest weakness of this style is very mundane: it's hard to encode a multi-line chunk of text in Java. You have to choose between putting the text in an external file, or suffering through strings that have a lot of "\n" escapes in them. Both choices have problems, although the latter option could be mitigated with a little bit of IDE support.

In Scala, Python, and many other languages, you can write a multi-line string by opening it with triple quotes (""") rather than a single quote mark ("). It's a trivial feature that adds a lot to the day to day convenience of using the language.

As one trick to be aware of, it's important to help people out with indentation when using triple quotes. In Scala, I lobbied for the stripMargin approach to dealing with indentation, where you put a pipe on each continuation line, and anything up to the pipe is considered leading indentation and removed. In retrospect, I wish I had pushed for that to simply be the default behavior. If you need to insert a literal continuation character, you can always write it twice. Making people write stripMargin on almost every multi-line string is a form of boilerplate.

Case classes

There are philosophers who disagree, but I find them a little too philosophical for my taste. Sometimes you really want to write a class that has no hidden internal state. Sometimes it would be a breach of the API to retain any internal state, or to implement the public API as anything other than plain old final fields. Some motivating examples are: tiny types, data structure nodes such as links in a linked list, and data-transfer objects.
In such a case, it takes a tremendous amount of code in Java to implement all the odds and ends you would really like for such a class. You would really like all of the following, and they are all completely mechanical:
  • Constructors that copy their parameters to a series of final fields.
  • A toString() implementation.
  • Comparison operations: equals(), hashCode(), and compareTo(). Ideally also helpers such as isLessThan().
  • Copy constructors that make a new version by replacing just one of the fields with a new value.
The equals() method is particularly painful in Java because there is a lot of advice going around about how to write them that is not consistent. I've been drug into multi-day debates on equals() methods where people cite things I published in the past to try and use against me; I'm pretty sure I meant what I said then and mean what I say now. Above all, though, I'd rather just have a reasonable equals() method and not spend time talking about it.

Thursday, November 5, 2015

The mystics are coming out

Discussion on the Scala collections revamp is starting to get mystical. It really bugs me: good language design makes a huge difference, but it's hard to see unless you actually deal with thousands or more developers on millions or more lines of code. Casual observers of language design discussions don't see it themselves, so they don't realize what these problems look like to the other people in the discussion. So they think everyone is just goofing around and throwing out random ideas just because they are possible.

I started to follow up on the issue itself, but I fear making the problem worse. So I'll post a couple of rebuttals here. I don't really think the people actually involved in the redesign will be distracted by the mystics, anyway. They are like gold-sellers in MMOs or beggars in San Francisco; unless you are specifically trying to engage with them, you just learn to tune them out. Well, okay, they are like really nerdy gold sellers who like to talk about higher math. Okay I better just drop this attempt at an analogy.

First, Python's indexing operator has a lot of practical problems. Here are a few concrete examples: https://plus.google.com/+MattMight/posts/BVSmNadKni4 . Thinking backward from those puzzlers, I think the root of the problem is that the [a:b] slicing operator means something different depending on whether a and b are positive, and whether a is smaller or larger than b. This is foundational syntax used everywhere, and if you want code to be readable, people need to know whether it's doing a forward or reverse slice without having to do mental data flow analysis on the parameters. Java avoids this trap, and Scala should, too. The sublist operation should only work on non-negative arguments, and only when the first argument is smaller than the second.

The other thing I will say is that exceptions, null, and -1 are also all fine, when used by a tasteful library designer. We tried at Semmle to get people using more Options, and we found that it harmed our reputation as a quality lint tool. I can only speak publicly about open-source code, but to give an example, Apache Spark has over a thousand occurrences where they use null but, with only local changes, they could use Option instead. It's too many. It means that the basic premise of the programming advice has some sort of problem.

As one stab at it--though it's really a huge topic--you have to think about what you want the callers to do to defend against a missing value. If Scala's basic get/apply methods starts returning an Option, then people will just litter their code with calls to .get, so the net result is that the code is more bloated but otherwise behaves exactly the same. Even in pure math, people will write notation like f'(x) as the derivative of f, but you know, derivative isn't always defined. So should smart mathematicians instead write get(f')(x)? Or (f') match { ... }?

That's my try, but you don't even have to understand this if you are willing to ape mature libraries in areas that they are working okay. It's not a practical problem in Java that the various .get() methods return null or throw exceptions; even if you say it's not perfect, it's certainly not bad. It is, however, a very big problem that Java collections are overly mutable. For example, see this Stack Overflow question: https://stackoverflow.com/questions/2842169/why-are-public-static-final-array-a-security-hole. Scala will be more attractive to more developers if it focuses on these widely recognized pain points. Which is why the mystics drive me crazy--if they get their way they will find that their playground is gradually becoming a ghost town, but only after it's too late.

Monday, October 12, 2015

Initial input on the Scala collections redesign

It looks like Martin Odersky is considering an overhaul of Scala's collections:
A redesign of the standard library is on the roadmap for one of the next Scala versions (could be as early as 2.13). So I think now is the time to start thinking about what we want to change, in particular in what concerns collections!

Some more details are available on the Dotty issue tracker. Paul Phillips has weighed in on this issue with a very interesting slide deck.

I think it will be a very positive change if done carefully. It's painful to change such a fundamental library, but the current state is a real pain point for practical Scala code. If no other contender can get full support, I would even go so far as to suggest going back to Matthias Zenger's original version and working forward from there more conservatively this time. It was really a gem of useful functionality, in particular the "persistent" collections which used to be contraversial back then. Since that initial version, there have been several waves of changes that added complexity while only supporting minority use cases: lazily evaluated views, concurrent collections, and the CanBuildFrom magic. These are all valuable but should all be done in a separate library that is opted into when you need it.

I cannot put together a complete straw man implementation given my professional duties right now. I can make a few high-level suggestions, though, based on my experience in developer tools at Google and LogicBlox. Mind you, these are just initial reactions. I may well be overlooking important implementation details. Also, I haven't had the pleasure of using Scala for the new "big data" growth area, so I may be overlooking some concerns compared to the kinds of code I am used to.

At a high level, I'd want to see the following in an updated collection library:

  • A focus on just the core collection types: maps, sets, linked lists, and some form of array-like sequences (maybe Array itself). Other collection types do come up in practice, but in a minority of contexts, and for those cases it should be fine to use a concrete collection type that does not necessarily inherit from any standard-library collection traits.
  • Simple old-fashioned classes and methods, without any implicit parameters or type members. The status quo is bad in numerous ways, including problems with all of the following: IDE auto-completion, single-step debugging, and and compiler error messages.

Non-goals include the following:

  • User extension of the library with new collection types. This is a small enough use case that it doesn't merit giving up on any other design goals of the library, e.g. performance.
  • Supporting the majority of the collection types that are available now. I wrote a 1-2 page description of each of them for the latest Programming in Scala, and I found it quite the slog. The library has simply grown very large over time. It would be better if the less common collection types were more obviously in a side library. They're all useful sometimes, but the grand total is just overwhelming.
  • Infinite collection types, in particular the lazy Stream type.

Some other things are desirable but might be unreasonable:

  • High performance in common use cases involving a for() loop. This is a complicated problem, but it is a persistent problem area that drives people away from writing idiomatic Scala code. If nothing else, perhaps for() should translate to iterators instead of to higher-order functions. Another approach would be to have the compiler recognize and optimize standard types like List, Set, and Map; this will work better if those types are sealed rather than user extensible.
  • Rename the three versions of Set and Map to have different names. For example, maybe AbstractSet, HashSet, and Set. There are numerous reasons for this; for example, automatic IDE import doesn't work as well if you have multiple classes with the same name. I put this in the "maybe" list, though, because it would require a lot of code to be rewritten. Even taking care to phase the change in over a period of months, it's a lot of code that would need to be turned over.

On a minor note, type Seq seems pretty useless. I know that Java has an interface for this, but maybe take a stand on this. For well-written code, it doesn't help anything to know that a type is a Seq and not just an Iterable. So delete the useless intermediate trait.

Wednesday, August 5, 2015

Ken Clark on shame mobs

Ken Clark has posted the top seven things he likes about shame mobs. Here's a taste:

5) Internet shame mobs weigh the evidence carefully and deliberately before attacking, so they only happen to people who deserve them. [...] 3) Internet shame mobs always make sure that the punishment is proportional to the crime.

There's a larger phenomenon here where problematic information spreads faster than the correction to it. If it spreads fast enough, then it can even pass a tipping point where it becomes very hard to get a hearing to object to any part of it. Everyone has already heard the idea from, well, everyone else, so they quickly dismiss anyone who objects without even really considering it.

The key to stopping such memetic chain reactions is to apply some filtering before propagating information that you read. It's still early days for the Internet, though, and we are all still learning to inoculate ourselves from being the wrong kind carrier.

There is some reason to have hope. Chain emails used to flourish, but are now mostly stamped out. In their heyday, 15-20 years ago, it was fairly common to open your mail program and see numerous messages that said something very exciting, and furthermore that the message should be passed on to everyone you know as fast as possible. Nowadays, the people I interact with just delete any email such emails. If an email explicitly says that it should be forwarded to everyone you know, then it triggers something like an antibody response. Such an email starts looking very bogus, and it gets deleted quickly and possible even flagged for followup by the email provider.

Intriguingly, people would likely not have developed that response had they not gone through the misery of dealing with chain emails earlier on. There are clear parallels to viruses and to building up antibodies!

Shame mobs are a place where it still goes on, though. I'm not entirely sure why it happens. In part, people just want to defend an idea, and they are happy to use real people as an example no matter the consequences. In part, people just enjoy being part of a mob. I hope that shame mobs go the same way as the chain email. We shall see.

Wednesday, July 29, 2015

Finding and fixing complicated code

It's hard to beat lines of code as a complexity metric: the subjective complexity of a piece of code seems to correlate very well with its length. Over the years, many other metrics have been devised and explored, but they tend to be more complicated yet less effective than the simple old lines-of-code.

One exception to the rule is McCabe Cyclomatic Complexity, a metric almost exactly as old as I am. McCabe can often find methods that aren't that many lines of long, but that nonetheless a typical maintainer would identify as being complicated and difficult to maintain. Small wonder that GNU Complexity puts so much emphasis on it.

What McCabe does is count the number of independent paths are needed to cover a method from beginning to end. The way McCabe described it, in his delightfully readable paper on the subject, is that it's the number of test cases you need to get statement coverage for a given method. It sounds like something that would be impractically hard to calculate--you are minimizing over arbitrary test cases!--but it's not complicated at all. He proved that, with a few extra assumptions, all you really have to do is count the number of branching options in a given body of code.

McCabe complexity is very effective in my experience. A method with complexity 10 is usually a dense, gnarly tangle of code, even if it's only 20 lines long. Yet, 20 lines of code is not enough to reliably be complicated just by its number of lines of code. If it's a simple linear sequence of non-branching code, it's usually reasonable to leave it alone. McCabe Complexity is therefore a real coup.

Even better, there's usually a straightforward way to act on code that has a poor McCabe complexity measure. If you see a method with a lot of branches in it, you can usually make the code more maintainable by factoring out some of the clusters of branches to their own smaller methods. The main method gets easier to understand, and if you do it tastefully, the smaller factored-out methods will also be easier to understand.

One place McCabe complexity falls down, though, is large switch statements. If you write network code that receives a message and dispatches it, it's perfectly fine style to write a switch statement for the dispatch with one case for each kind of message. This is especially true if each case simply calls a method that does the real work. Such a switch statement will easily have a dozen cases, one for each message type, but what alternative would be any more maintainable? Factoring out methods for subsets of the cases doesn't seem obviously better, and reducing the number of cases is often impossible.

McCabe is even worse for switch statements that consider pairs of types that are each drawn from some enum. To see some typical examples, look at Cast.scala in the Apache Spark source code. All the big switch statements in there are enumerating pairs of types, and they strike me as a very reasonable way to manage a problem that is just fundamentally complicated. This organizational style reminds me of how people organize definitions on paper when they are studying and explaining language semantics. Such people will typically define their relations and functions using a dozen or so cases, each of which has a number of assumptions (x is a class type, y is a class type, and x inherits from y), and a conclusion you can draw if all of the assumptions hold (x is a subtype of y). If this style is reasonable for theory papers, where people are expending enormous time and energy to bring a complicated subject within the reach of the human mind, it seems only reasonable to write such definitions in code when possible.

Overall, McCabe complexity is excellent and has stood the test of time. I wonder, though, if "number of branching statements" might be even more useful for anyone trying to simplify their code. "Number of branching statements" is the same as McCabe for if statements and while loops, but for switch statements, it only adds 1. Such a measure seems like it would better focus on code that is not merely complicated, but also possible to simplify.