Friday, September 1, 2023

Task-based build systems

Today I ran into a great book chapter by Erik Kuefler and Lisa Carey about Google's build system philosophy. It's all very good, but one part that stood out to me is the distinction between a task-based build system and an artifact-based build system. Ant is a task-based build system and lets you define a task in terms of other tasks. Google's build system, partially open sourced as the "Bazel" project, is based on artifacts. An important part of an artifact-based system, they explain, is that it honors build cache semantics. They write specifically the following:
For a remote caching system to work, the build system must guarantee that builds are completely reproducible. That is, for any build target, it must be possible to determine the set of inputs to that target such that the same set of inputs will produce exactly the same output on any machine. This is the only way to ensure that the results of downloading an artifact are the same as the results of building it oneself. Fortunately, Bazel provides this guarantee and so supports remote caching. Note that this requires that each artifact in the cache be keyed on both its target and a hash of its inputs—that way, different engineers could make different modifications to the same target at the same time, and the remote cache would store all of the resulting artifacts and serve them appropriately without conflict.

Sunday, June 4, 2023

Why the M4 macro syntax goes so wrong

I've been hacking in M4 recently, and I ran across a great article about M4 by Michael Breen. I highly recommend it for anyone either using M4 (e.g. for Autotools or for Bison), or considering M4 for a new project (probably a bad idea). This observation caught my eye:
While m4's textual rescanning approach is conceptually elegant, it can be confusing in practice and demands careful attention to layers of nested quotes.
Christman Brown writes something similar, in his review of M4:
It is difficult to debug. I quickly found my default behavior when encountering something unexpected is to just increase the escape quotes around something. That often fixed it.

What's going on, here? A macro language is supposed to let you write normal text and then sprinkle some macro expansions here and there. It's supposed to save you from the tedium of dealing with a full-fledged general purpose programming language. Also, sometimes this strategy works out. With the C preprocessor, you write ordinary C code most of the time, and it works totally fine to occassionally call a macro. Why does this approach work in C but not in M4?

I think Michael Breen is onto something. Macros are a form of subroutine, and with a well designed syntax for subroutine call, you want it to be straightforward and thoughtless to invoke a subroutine and pass it some arguments. You want the arguments to feel just like text in any other part of your system. Think how it feels to write HTML and to put a div around part of your page. You don't have to do any special encoding to the stuff you put inside the div; you can, any time you want, take any chunk of your code with balanced tags and put that inside another pair of tags. With M4, the basic facility of a subroutine call, which you use all over the place, is somehow tricky to use.

M4 is not wrong to have a quoting mechanism, but where it goes wrong is to require quoting on the majority of subroutine calls. Here's what that looks like in practice. M4 uses function-call syntax to invoke macros, so a call looks like foo(a, b, c). That's a great syntax to try, because function calls are a ubiquitious syntax that users will recognize, but it has a problem for a text macro language in that the comma is already a common character to use in the a, b, and c arguments. Having observed that, M4 could and should have moved away from the function call notation and looked for something else. Instead, the designers stuck with the function calls and augmented it with an additional kind of syntax, quotation. In practice, you usually quote all of your arguments when you pass them to an M4 macro, like this: foo([a], [b], [c]). Only usually, however, and you have to think about it every time. The quotation and the macro call are two different kinds of syntax, and the user has to control them individually.

The reason it works out better for C's preprocessor is that the C language already has a way to quote any commas that occur in the arguments, and the preprocessor understands those quoting mechanisms. For example, with sqrt(foo(x, y)), the preprocessor understands that the comma inside the (x, y) part should not count as separating the parameters to sqrt. Programmers already write those parentheses without thinking about it, because the function-call notation for foo(x, y) already requires parentheses. Unfortunately, C does not do the right thing for an example like swap(a[i,j], a[i,j+1]), because it does not treat square brackets the way it treats parenthesis. It could and it should, however. None of this maps over to M4 very well, because usually the arguments to an M4 macro are not code, and so the author isn't going to naturally escape any commas that occur. The function-call syntax just doesn't work well for the situation M4 is intended to be used in.

Fixing the local problem

If we wanted to write a next-generation M4, here are some observations to start from:

  • It is better if the quoting syntax is built into the subroutine call syntax. That way, users don't have to independently reason about both calls and quotes, and instead can just think about call-and-quote as a single thing that they do.
  • Look to markup languages for inspiration, for example XML or Latex. The general feel we are going for is that you mostly write text, and then occasionally you sprinkle in some macro calls. That's a markup language!

Based on these, a good thing to try for an M4-like system would be to use XML tags for macro invocation. XML is a culmination of a line of tag-based markup languages starting with SGML and HTML, and it is generally state of the art for that kind of language. Among other advantages, XML is a small, minimal language that you can learn quickly, and it has explicit syntax for self-closing tags rather than some tags being self-closing and others not, in a context-dependent way depending on the schema that is currently in effect for a given file.

Latex's macro syntax is also very interesting, and it has a big advantage in usually saying each tag name just once (\section{foo}) rather than twice (<section>foo</section>). However, my experience with Latex is that I am in constant doubt how much lookahead text a macro invocation will really look at; the braces-based syntax is just a convention, and you never know for sure which macros really look at those conventions or not. That said, the general syntax looks like a promising idea to me if it were locked down a little more rather than being based on the Tex macro language. A similar approach was used in Scribe, a markup language designed by Brian Reid in the 70s.

What to use for now

As things stand, I don't think M4 really has a sweet spot. Old projects that want to have an ongoing roadmap should probably move away from M4. New projects should never use it to begin with. What are the options right now, without having to build a new textual macro language?

It's not a bad option to use a general-purpose language like Python or Java. If you follow the links from the PP generic preprocessor that is used in Pandoc, they tell you they are replacing their templating by more and more usage of Lua, a general purpose language. When you use a general-purpose language to generate text, you can use the normal library routines your language already supports, plus a mature library of lists, maps, structs, and iteration routines on top of them. An example of this direction is the Jooq library for generating SQL code.

Another strong approach is to use XML, possibly augmented by XSLT. An example would be the query help format of the GitHub Code Scanner, a format that I designed many years ago at a startup called Semmle. We had an existing syntax based on HTML with regex-based rewrites applied to the HTML file, and we had a problem that people were typo-ing the syntax without realizing it, resulting in help files that were sometimes unreadable and were often missing a large portion of the text. I explored a few options for getting us a more rigorous format with tool support for common errors, and I landed on XML, which I feel like worked out pretty well. In addition to the format itself being nice to work with, we got to tap into the existing XML ecosystem, for example to use Eclipse's excellent XML editor.

I briefly explored JSON as well, which is another minimal syntax that is easy to learn, but I quickly realized why they call XML a markup language. Unlike with JSON, XML lets you mostly write normal text, and then as a secondary thing, add special syntax--hence, "marking up" your text. XML is also a very mature system in general, so for example we could configure Eclipse (which was a viable tool back then!) to auto-complete tags and give you errors within the editor if you used tags that aren't allowed. If I were to rewrite how Bison's skeletons work, I think something based on XML would be tempting to try. Bison already uses this approach for its debug output. I'm not sure, though; XSLT looks pretty voluminous in practice.

Some of the best options are embedded in an existing general-purpose language. JSX is embedded in JavaScript, and Scribe is embedded in Scheme. I'm not sure how practical these are if you aren't already working in those environments, but if you are, look for one that works with your current source language.

The larger lesson

An effective tool has to be evaluated in the context it will be used in. Both C and M4 use function-call notation for macro invocation, but in C it works well, while with M4 it becomes a nightmare. Achieving an effective tool, therefore, requires design thinking. You need to learn about the situation you are targeting, you need to adjust your solution based on that, and above all you need to be ready to critique your solutions and iterate on improvements. The critique can take many forms, and a really important one is to watch how your users are doing and to really reflect on why it's happening and how it could be different.

Surprises can happen anywhere, and you'll support your users more if you can act on those surprises and try something different.

Tuesday, May 9, 2023

A Sampling Trick

Here's a sampling trick that comes up all the time in industry, but that took me a while to work out. The problem goes like this:
Suppose a server is receiving a sequence of events, for any kind of event you might be interested in, and you want to save off a sample of the events for later analysis. How do you select which events to sample?

Let me show some solutions that I see people try, and then describe a way that seems to beat all those out.

Solutions that don't work well

Choose a fixed percentage. One approach is to choose a percentage, say 5%, and to log that percentage of incoming events. For each event that arrives, you check a random number generator against your percentage, and independently decide for each event whether to log a sample or not. This approach is pretty good except that it's often hard to choose a percentage ahead of time. If you get thousands of events per second, then 5% is going to be way too many. If you get one event per minute, then 5% is going to be way too few. In many cases, you are going to use the same sampling code to run against multiple streams of events, for example to sample access to multiple databases, or to sample requests to multiple RPCs on the same server. As another example, you may run the code in both staging and production environments, and you'll need a different percentage for each environment.

Look at historical data rates. Another way is to use a percentage, but to dynamically adapt the percentage based on historical traffic rates. This way works in theory, but in my experience, systems like this tend to be unstable. For example, the system will dial in a low percentage based on overnight traffic, or because the system paused due to a garbage collection or a spate of disk I/O on the underlying machine. Then a burst of activity will happen, and the system will use its old percentage for a while, thus sampling way too much data. Then the percentage updates, but the burst of activity disappears. Now the server is sampling at a very low percentage and throwing everything away, because the burst of traffic disappeared. In general, I can't say that this method can't work, but in a world where software always has bugs and surprises, it's much better to use algorithms that are more clearly stable and predictable.

Use time-based sampling. Instead of sampling a percentage of traffic, the system can select one event for each window of time, for example one event every 10 seconds. This method adapts to different rates of incoming events, including bursty traffic where the rate isn't steady. Now there's a new problem, though: the events are not randomly selected any more! This method will work fine if the server is receiving a homogenous series of events that are all spaced out the same from each other, but that may or may not be the case. A lot of times, there is some external process, perhaps generated by an end user, where one external event will create a clump of events that arrive at your server at almost the same time as each other. If you select one event every 10 seconds, you will tend to sample events that are at the beginning of the clumps, thus making some of the events more likely to be sampled than others.

A dumb way that works

Here's an approach you could follow that will reliably work, but at the expense of a lot of memory usage. What you could do is collect all of the events for a 10-second window into an in-memory buffer. Each time a new event arrives, you don't think twice, you simply add it to the buffer to be looked at later on. Once the ten-second window expires, you look through the buffer and randomly select one of the events in the buffer. Then you reset the buffer and start again for the next 10-second window.

This way meets all of the constraints, but now it uses a lot of memory. It's close, though, except for the memory problem. Is there a way to do this, but with a smaller buffer? It would need to be some kind of algorithm that discards events as it goes, rather than waiting to the end of the 10-second window to do a whole lot of discarding all at once.

The sampling trick

Here's a way that works. Keep the following two pieces of information in memory.

  • The number of events that have arrived, in the current 10-second window.
  • One single event from the window. This is the event that will be sampled for sure if no other events arrive.

Given this setup, here's what you do when a new event arrives.

  1. Increase the count of events by one.
  2. Select a random number from 1 to the new, updated count of events.
  3. If the random number is 1, then replace the event in the buffer by the new event; the new event is the one you need, and you know for sure you don't need the old event.
  4. If the random number is anything other than 1, then discard the new event; you're done with that event for good and aren't going to use it.

When the 10-second window elapses, emit your current sample and then reset the buffer to its original state: 0 events that have arrived, and null as the currently stored event.

An example walk-through

To see why this works, let's start with a scenario that exactly three events arrive during the window. Here's what happens.

Event 1 arrives. When the first event arrives, increase the count to 1. Roll a number from 1 to 1, which will always result in 1, so store the first event into the buffer. Here's what is in the buffer now:

  • Event 1, for sure.

Event 2 arrives. When the second event arrives, increase the count to 2. Roll a number from 1 to 2. This roll has a probability of 1/2 of being a 1, in which case event 2 will replace the first one. At this point, there is a 1/2 probability that event 2 took over the buffer, and a 1/2 probability that event 1 is still in there.

  • Event 1, at probability 1/2.
  • Event 2, at probability 1/2.

Event 3 arrives. Increase the count to 3, and roll a number from 1 to 3. This roll has a probability of 1/3 of being a 1 and causing event 3 to take over the buffer. At probability 2/3, some other number is rolled, and the existing event is brought forward. Therefore, the probability for event 1 to still be in the buffer is the 1/2 chance that the event already had, times the 2/3 chance that it survives this latest roll of the die. For event 2, the same reasoning applies as for event 1.

  • Event 1, at probability 1/2 * 2/3 = 1/3.
  • Event 2, at probability 1/2 * 2/3 = 1/3.
  • Event 3, at probability 1/3.

Proof of the general case

Property. When using The Sampling Trick described above, after the arrival of N events, each event has 1/N probability of being the one in the sample buffer.

The world's smallest proof by induction. In the base case, suppose N=1. Then there is just one event, and it is always selected. The one event has probability 1/1, which is what we wanted to prove.

For the inductive case, assume the property is correct for N-1, and prove it for N.

For the Nth event, it was the last event that arrived, and the rules of the algorithm are that it will replace all prior events at probability 1/N. So the theorem is true for the Nth event.

For any other event, we know by the inductive assumption that it has a 1/(N-1) chance of being in the buffer when the Nth event arrives. The Nth event will replace the buffered event at probability 1/N, so the probability of the existing event being left alone is (N-1)/N. Therefore, the probability that the prior event is both selected from the first N-1 rounds, and also stays in the buffer after the Nth round, is 1/(N-1) x (N-1)/N, which is 1/N.

Q.E.D.

How do you use this in practice?

Sampling is helpful for always-on computer systems that can never have any down time. In a case like that, you need to not only make the server internally robust for every kind of input it might receive, but also be externally compatible with all the other software that's around it in the current production ecosystem.

There are a lot of tricks and know-how for how to use the sample data once you have it, but the first step is to capture the data at all.