Sunday, March 24, 2024

Coevolving with new technology

A little bit of history

I remember the sinking feeling I had when the Communications Decency Act (CDA) passed, back in the 1990s. I was stunned. A veil was ripped off for me. I had grown up with the U.S. government doing lots of things I didn't understand but that sounded like things that were going to protect me. Our president was building up our defenses while also negotiating with the leader on the other side for disarmament and peace. (Though, our side didn't disarm?) I heard tales about how recreational drugs were going to destroy us, and the government was running these big campaigns to protect us all, because one taste is all it takes before your life will be destroyed. I was enough of a STEM student to be very confused why "drugs" will be the end of your life if you have one single taste, but everywhere around me people were drinking and smoking and not having such a bad result; how do you draw the line for such a thing? I went along with it, though, and sort of thought that smarter people than me must know the reasons for all this. Plus, as Paul Graham says, I was mindful of What you can't say. These topics I mention would get you instantly socially ostracized if you weren't on the right side of them.

The CDA was different, in that it covered a topic I do know about. I knew enough about computer software to see that one individual could set up and run a bulletin board system that could then host a billion people using it. I could picture how this would work, and whenever people talked about social forums, my mind's eye would fill in the gaps of what they are saying with a number of the details--servers, networks, databases, and so on. In the world of the CDA, these rooms are supposed to be monitored, so the picture in my mind can no longer be implemented by one person. The CDA vision was that a chat room would have maybe 1 chaperone for every 1000 users, so to support a chat room with a billion users, you would need one developer and one million chaperones. The CDA vision would make a basic chat room literally one million times more expensive to run than it was before the CDA.

The cost to humanity can be very high. I can't say what we are missing from social networking, because we live in a post-CDA world, and we don't know what we are missing out on by raising the bar on who can even run an experiment. I can given an example from another kind of high, however. Research and limited trials have begun, today, for mind-altering substances including a parade of the villains from my 1980s classrooms: ketamine, psilocybin, and LSD. This is a complex subject, but imagine for a moment that the new trials are at least sometimes heading in the right direction. If that is true, then the American people lost 60 years where we could not access something that, it is now looking like, can turn people's lives completely around. 60 years is a long time. My brother was born and then died during that 60-year window of dark-out.

When I think about how to regulate AI, we are in a similar situation to chatrooms of the 90s or psychedelics in the 60s. We don't yet know what is possible. We don't know what people will try, or what the results will be. Some of those experiments will lead to good results, and some of them will lead to bad results. How do we pursue these experiments while keeping ourselves safe?

My general sense, based on the history of technology and of governance to date, is that it's preliminary to have any blanket AI regulation right now that can do more good than harm. Instead, it is better to wait for those slow-burn problems that take a while to happen, but that don't get phased out all by themselves. Let me paint a picture of what this can look like, from my point of view as a technologist for four decades.

The concept of coevolution

I believe we can think of AI-based technology as co-evolving with humanity. We can think of technology as moving forward in a way akin to evolution, and we can approach our responses to that evolution based on our experience with other instances of coevolution that are by now very familiar.

The mycelial network. I enjoyed the new form of space travel in Star Trek: Discovery, but I have learned that there is a real-world mycelial network that is even more interesting. It turns out that fungi are everywhere. Tree root systems of the real world are permeated by mycelia that stabilize the soil and provide vital nutrients. All living plants include endophytic fungi that they can't live without. The larger ecosystem would not function as we know it without fungi recycling nutrients from dead plants and animals.

The living world would simply not exist without this weird third kingdom of life that permeates it in many different ways. Imagine, now, a national government trying to regulate the evolution of fungi, and the challenges they would face. We need fungi for human life as we know it to exist, but also fungi can harm us. There's no one simple way that fungi interact with us and with our environment, so there's also no one simple rule that fungi need to follow in order to be safe for us. In a territory like this, the perspective that seems best to me is one of growing together. We are here, and fungi are here, and we need to move forward by inches, with diversity in each kingdom and some way to respond if a change looks harmful.

The Butlerian Jihad. Movies like Dune are always grasping for a reason to make heroes with fantastically advanced technology still resort to low-grade sword fighting. In the case of the Dune universe, the in-story answer is called the Butlerian Jihad, and it's based on a real-world novelist named Samuel Butler. Butler explored the idea that the machines around us are developing from a selection mechanism that is similar to the natural selection of Charles Darwin.

Butler didn't limit this idea to AI. Butler meant the idea very broadly as applying to all of the machines we create and market to each other in order to automate our lives. People are always making machines that don't work very well; for a humorous example, check out Thoren Bradley's review of a 4-way splitting wedge. The machines that catch on are produced more widely and become the basis of the next round of machines. Any software engineer can tell you that new software is created by modifying old software. As a result, the development of machines follows the two ingredients that are needed for natural selection to occur: new entities are replicated from old ones, with a modest rate of mutation; and entities are selected for continued existence by some kind of selection criteria.

Butler's idea seems right to me. Machines are evolving along with us, and it doesn't really change anything that newer machines are doing tasks that used to be things only a human could do. On the contrary, the most helpful machines for humanity are those that remove some of our toil.

Of these two ideas, fungi are the scarier one to me. I know how to turn off a machine, but fungi work at a molecular level that cannot be directly controlled. And yet, fungi have not just gone okay but are an essential part of the world as we know it. Life can exist without fungi, but humans wouldn't be there. Yet, both fungi and machines have so far made human life much, much better than it ever was before.

How to regulate coevolving systems

If we follow the idea of coevolution, then we can think through the conditions where a U.S. intervention is likely to help more than it hurts.

Above all, one of the biggest reasons that we can feel safe about fungi, and less safe about other things, is that fungi exist in a large, diverse ecosystem where any damage will have a limited distance it can travel. It scares me more than any specific technological risk that several categories of technology are near monocultures right now. Chrome, Firefox, and Edge are the only real desktop web browsers; Facebook is ascendant for a certain kind of social media; Gmail is about the only email reader; and Amazon is by far the online market for physically shipped goods. These companies can cause tremendous harm with any mistakes they make, because those mistakes will propagate to all 8 billion of us within seconds of being launched. Monocultures are death traps, and I worry that we have so many of them right now.

For specific technological developments, it seems like there are three categories to bear in mind.

  • In some cases, direct contact with a human and the new machine will quickly kill both of them. In a case like this, there's nothing useful for the regulation to do, because the machine will go away on its own all by itself. An example would be the Clippy assistant for Microsoft Word. It died all on its own.
  • Many developments are unambiguously simple and positive. For these, as well, any attempt to regulate it is just going to reduce some of the benefits we gain. An example would be the original Google Web Search. Unless you've used some intranet search engine such as Atlassian's, you may have a hard time imagining how bad the web search engines used to be. It's a very good thing for all of us that Page and Brin were able to do their web search experiments without needing a lot of approvals. Imagine what kind of search engine we would have if it took 5-10 years to get an experiment approved, and if the search engine were only able to present results that incorporate today's suite of politically correct speech and representation. It would never have gotten off the ground, and very likely most people would assume that an effective web searcher just can't be made.
  • That leaves the category that is neither of the above. Something that is harmful, but not harmful enough that people immediately balk in terror and shut the whole thing down voluntarily. An example in my mind is the slot machine. The slot machine doesn't do anything dramatic like shoot out laser beams or organize armies to go marching around. Yet, it does enough harm, to a significant number of people, that I would say there is a significant practical benefit that a little bit of regulation could establish.

Regulating slot machines is tough after the fact, but take a moment to imagine if someone had tried to regulate them before they took off. Imagine looking at a slot machine before they really caught on, and imagine deciding just based on its design that it might cause some harm. I posit that essentially no one could effectively figure out what the rules should be. The negative effects of slot machines cannot be understood from the technology, which after all is pretty simple. To understand the effects, and therefore to know what to do about them, one has to reason about things like: human psychology; the kinds of establishments that slot machines end up at; the kind of clientele that end up at these establishments; and the larger market developments that would lead to situations of the mass-produced slot machines that have a large enough effect to be worth doing anything about. Based on all of that, a lot of the regulation wouldn't even be about the technology itself; some of the slot machine regulation on the books today is about fairness and about the maximum rate of money that the player will lose. It seems better to me, though, to do things like require payment up front, and to limit payments to once per hour; likewise, it seems likely helpful to put limits on the visual effects and on the advertising claims that surround these machines. It's tricky and is subtle, and it can only be done after the fact, and even after the fact, a lot of the attempts aren't going to go well. I would say that even today, after decades of tinkering, regulation around slot machines is not yet really figured out.

The good news is that slot machines haven't destroyed the world just yet. They have been a slow burn, long in development, because anything that was hyper-bad simply wouldn't have caught on at all. As well, our existing social immune systems are carrying a lot of the slack. Each of us that sees someone else in the herd get caught up in a gambling addiction is quick to raise an alarm to the rest of us. The regulation can be better, but it's more than fast enough to settle in and spend multiple decades working out what that regulation should look like.

It strikes me that generative AI can be thought of the same way. Lots of things will be really good, and we don't want to wait one year to get them, much less 5, 10, or, based on U.S. history, 60 years to get the benefits. Lots of things will be super-bad, and almost everyone will stop using them immediately. Lots of things will seem somewhat bad, and most people will stop using them, but a few will carry on; that small fraction is actually good for humanity as a whole, because it increases diversity in the gene pool. Then in some tiny remaining pie slice, after all the previous cases are covered, there will be a harmful technology that wasn't weeded out by itself and that wasn't stopped by our more general-purpose defense mechanisms. That tiny pie slice is where regulation can help, and it won't be machines shooting laser beams. It will be something we haven't thought of, that hardly anyone is talking about, because we haven't even tried yet and don't have any information to go on.

Saturday, March 2, 2024

The case for IPv4.1

In 2003, Dan Bernstein wrote about the IPv6 mess, focusing primarily on the lack of viable migration plan. Avery Pennarun wrote again about the problem in 2011, also with a focus on the lack of viable migration plan. Now it's been another decade, for a total of around 30 years on this project.

With the benefit of 20/20 hindsight, we now know that the IPv6 plan was never a very good one. If IPv6 had been accurately estimated to take 30+ years of migration, it would not have achieved the buy-in that it did, and most likely the designers would have kept going. The individuals involved are all top of class from what I can tell, but sometimes good people produce bad results, especially in a group context. I'm reminded, in all of this, about the Ada programming language, with the unusual way it was designed. I'm even more reminded of a certain saying about committees: "none of us is dumb as all of us".

Big migrations do not always go this way. The World Wide Web only needed nine years, if we count from Tim Berners-Lee's memo to the launch of the Google search engine. Under the right conditions, the whole Internet can upgrade in less than ten years.

The difference in these projects is easy to understand. Avery describes it as follows:

In short, any IPv6 transition plan involves everyone having an IPv4 address, right up until everyone has an IPv6 address, at which point we can start dropping IPv4, which means IPv6 will start being useful. This is a classic chicken-and-egg problem, and it's unsolvable by brute force; it needs some kind of non-obvious insight. djb apparently hadn't seen any such insight by 2002, and I haven't seen much new since then.

Here in 2024, it's not really too late to start on a backup plan. It's anyone's guess whether a more modest IPv4.1 proposal would finish faster than IPv6, but we can all be sure of one thing. Our chances are better if we have two dogs in the race.

How to do it

Here's the general idea of IPv4.1, just to show what I mean. It's been posted about before, so consider this to be my explanation rather than anything fundamentally new.

The core change for IPv4.1 is that the IP packet header needs room for a larger address. The packet format can be the one from the SIPP proposal, but with version "7" instead of "6", since 6 is now taken. When an address is written down, for example in a router's configuration UI, these 8-byte addresses can be written down the same way we are used to, just with 8 digits instead of 4. Existing addresses can be zero-padded on the left, e.g. 0.0.0.0.127.0.0.1.

Whenever two machines talk to each other, they use the existing IPv4 packet format if they both have 4-byte addresses. If either of them has an 8-byte address, then they switch to the IPv4.1 format. In this case, though, the packets still mean the same thing they always did! Part of the key to achieving a large migration is not to saddle it with additional ones.

8-byte addresses should be sufficient for this problem. The reason that IPv6 addresses use 16 bytes is that IPv6 plans for the machines on people's private internal networks to avoid reusing an address from anyone else's private network. This property will not be pursued for IPv4.1, and it would not do anything useful it were, because per the previous paragraph, all the packets in IPv4.1 are supposed to mean the same thing they did in IPv4.

Rollout

During the initial rollout of IPv4.1, many machines will not yet be able to talk to each other with the extended addresses. This will be much like the case with IPv6 right now.

From there, it will be a race. It's speculation which race will go faster, but IPv4.1 has some big advantages. It doesn't take any configuration file changes, and it doesn't require designing IP blocks. It definitely doesn't require thinking about link-local addresses, updates to how DHCP works, or any other fundamental aspects of networking. All that it takes for IPv4.1 to win is for software providers to add it into their latest version, and for the major nodes on the Internet to upgrade their software some time over the next 5-10 years. Major Internet nodes usually have to upgrade their software once in a while, already, due to security and compliance requirements.

We don't have to know for sure that IPv4.1 will win the race before we can make a decision about trying. We can simply observe that the problem is important, and we can observe that our chances are better with two ways to win.

Friday, September 1, 2023

Task-based build systems

Today I ran into a great book chapter by Erik Kuefler and Lisa Carey about Google's build system philosophy. It's all very good, but one part that stood out to me is the distinction between a task-based build system and an artifact-based build system. Ant is a task-based build system and lets you define a task in terms of other tasks. Google's build system, partially open sourced as the "Bazel" project, is based on artifacts. An important part of an artifact-based system, they explain, is that it honors build cache semantics. They write specifically the following:
For a remote caching system to work, the build system must guarantee that builds are completely reproducible. That is, for any build target, it must be possible to determine the set of inputs to that target such that the same set of inputs will produce exactly the same output on any machine. This is the only way to ensure that the results of downloading an artifact are the same as the results of building it oneself. Fortunately, Bazel provides this guarantee and so supports remote caching. Note that this requires that each artifact in the cache be keyed on both its target and a hash of its inputs—that way, different engineers could make different modifications to the same target at the same time, and the remote cache would store all of the resulting artifacts and serve them appropriately without conflict.

Sunday, June 4, 2023

Why the M4 macro syntax goes so wrong

I've been hacking in M4 recently, and I ran across a great article about M4 by Michael Breen. I highly recommend it for anyone either using M4 (e.g. for Autotools or for Bison), or considering M4 for a new project (probably a bad idea). This observation caught my eye:
While m4's textual rescanning approach is conceptually elegant, it can be confusing in practice and demands careful attention to layers of nested quotes.
Christman Brown writes something similar, in his review of M4:
It is difficult to debug. I quickly found my default behavior when encountering something unexpected is to just increase the escape quotes around something. That often fixed it.

What's going on, here? A macro language is supposed to let you write normal text and then sprinkle some macro expansions here and there. It's supposed to save you from the tedium of dealing with a full-fledged general purpose programming language. Also, sometimes this strategy works out. With the C preprocessor, you write ordinary C code most of the time, and it works totally fine to occassionally call a macro. Why does this approach work in C but not in M4?

I think Michael Breen is onto something. Macros are a form of subroutine, and with a well designed syntax for subroutine call, you want it to be straightforward and thoughtless to invoke a subroutine and pass it some arguments. You want the arguments to feel just like text in any other part of your system. Think how it feels to write HTML and to put a div around part of your page. You don't have to do any special encoding to the stuff you put inside the div; you can, any time you want, take any chunk of your code with balanced tags and put that inside another pair of tags. With M4, the basic facility of a subroutine call, which you use all over the place, is somehow tricky to use.

M4 is not wrong to have a quoting mechanism, but where it goes wrong is to require quoting on the majority of subroutine calls. Here's what that looks like in practice. M4 uses function-call syntax to invoke macros, so a call looks like foo(a, b, c). That's a great syntax to try, because function calls are a ubiquitious syntax that users will recognize, but it has a problem for a text macro language in that the comma is already a common character to use in the a, b, and c arguments. Having observed that, M4 could and should have moved away from the function call notation and looked for something else. Instead, the designers stuck with the function calls and augmented it with an additional kind of syntax, quotation. In practice, you usually quote all of your arguments when you pass them to an M4 macro, like this: foo([a], [b], [c]). Only usually, however, and you have to think about it every time. The quotation and the macro call are two different kinds of syntax, and the user has to control them individually.

The reason it works out better for C's preprocessor is that the C language already has a way to quote any commas that occur in the arguments, and the preprocessor understands those quoting mechanisms. For example, with sqrt(foo(x, y)), the preprocessor understands that the comma inside the (x, y) part should not count as separating the parameters to sqrt. Programmers already write those parentheses without thinking about it, because the function-call notation for foo(x, y) already requires parentheses. Unfortunately, C does not do the right thing for an example like swap(a[i,j], a[i,j+1]), because it does not treat square brackets the way it treats parenthesis. It could and it should, however. None of this maps over to M4 very well, because usually the arguments to an M4 macro are not code, and so the author isn't going to naturally escape any commas that occur. The function-call syntax just doesn't work well for the situation M4 is intended to be used in.

Fixing the local problem

If we wanted to write a next-generation M4, here are some observations to start from:

  • It is better if the quoting syntax is built into the subroutine call syntax. That way, users don't have to independently reason about both calls and quotes, and instead can just think about call-and-quote as a single thing that they do.
  • Look to markup languages for inspiration, for example XML or Latex. The general feel we are going for is that you mostly write text, and then occasionally you sprinkle in some macro calls. That's a markup language!

Based on these, a good thing to try for an M4-like system would be to use XML tags for macro invocation. XML is a culmination of a line of tag-based markup languages starting with SGML and HTML, and it is generally state of the art for that kind of language. Among other advantages, XML is a small, minimal language that you can learn quickly, and it has explicit syntax for self-closing tags rather than some tags being self-closing and others not, in a context-dependent way depending on the schema that is currently in effect for a given file.

Latex's macro syntax is also very interesting, and it has a big advantage in usually saying each tag name just once (\section{foo}) rather than twice (<section>foo</section>). However, my experience with Latex is that I am in constant doubt how much lookahead text a macro invocation will really look at; the braces-based syntax is just a convention, and you never know for sure which macros really look at those conventions or not. That said, the general syntax looks like a promising idea to me if it were locked down a little more rather than being based on the Tex macro language. A similar approach was used in Scribe, a markup language designed by Brian Reid in the 70s.

What to use for now

As things stand, I don't think M4 really has a sweet spot. Old projects that want to have an ongoing roadmap should probably move away from M4. New projects should never use it to begin with. What are the options right now, without having to build a new textual macro language?

It's not a bad option to use a general-purpose language like Python or Java. If you follow the links from the PP generic preprocessor that is used in Pandoc, they tell you they are replacing their templating by more and more usage of Lua, a general purpose language. When you use a general-purpose language to generate text, you can use the normal library routines your language already supports, plus a mature library of lists, maps, structs, and iteration routines on top of them. An example of this direction is the Jooq library for generating SQL code.

Another strong approach is to use XML, possibly augmented by XSLT. An example would be the query help format of the GitHub Code Scanner, a format that I designed many years ago at a startup called Semmle. We had an existing syntax based on HTML with regex-based rewrites applied to the HTML file, and we had a problem that people were typo-ing the syntax without realizing it, resulting in help files that were sometimes unreadable and were often missing a large portion of the text. I explored a few options for getting us a more rigorous format with tool support for common errors, and I landed on XML, which I feel like worked out pretty well. In addition to the format itself being nice to work with, we got to tap into the existing XML ecosystem, for example to use Eclipse's excellent XML editor.

I briefly explored JSON as well, which is another minimal syntax that is easy to learn, but I quickly realized why they call XML a markup language. Unlike with JSON, XML lets you mostly write normal text, and then as a secondary thing, add special syntax--hence, "marking up" your text. XML is also a very mature system in general, so for example we could configure Eclipse (which was a viable tool back then!) to auto-complete tags and give you errors within the editor if you used tags that aren't allowed. If I were to rewrite how Bison's skeletons work, I think something based on XML would be tempting to try. Bison already uses this approach for its debug output. I'm not sure, though; XSLT looks pretty voluminous in practice.

Some of the best options are embedded in an existing general-purpose language. JSX is embedded in JavaScript, and Scribe is embedded in Scheme. I'm not sure how practical these are if you aren't already working in those environments, but if you are, look for one that works with your current source language.

The larger lesson

An effective tool has to be evaluated in the context it will be used in. Both C and M4 use function-call notation for macro invocation, but in C it works well, while with M4 it becomes a nightmare. Achieving an effective tool, therefore, requires design thinking. You need to learn about the situation you are targeting, you need to adjust your solution based on that, and above all you need to be ready to critique your solutions and iterate on improvements. The critique can take many forms, and a really important one is to watch how your users are doing and to really reflect on why it's happening and how it could be different.

Surprises can happen anywhere, and you'll support your users more if you can act on those surprises and try something different.

Tuesday, May 9, 2023

A Sampling Trick

Here's a sampling trick that comes up all the time in industry, but that took me a while to work out. The problem goes like this:
Suppose a server is receiving a sequence of events, for any kind of event you might be interested in, and you want to save off a sample of the events for later analysis. How do you select which events to sample?

Let me show some solutions that I see people try, and then describe a way that seems to beat all those out.

Solutions that don't work well

Choose a fixed percentage. One approach is to choose a percentage, say 5%, and to log that percentage of incoming events. For each event that arrives, you check a random number generator against your percentage, and independently decide for each event whether to log a sample or not. This approach is pretty good except that it's often hard to choose a percentage ahead of time. If you get thousands of events per second, then 5% is going to be way too many. If you get one event per minute, then 5% is going to be way too few. In many cases, you are going to use the same sampling code to run against multiple streams of events, for example to sample access to multiple databases, or to sample requests to multiple RPCs on the same server. As another example, you may run the code in both staging and production environments, and you'll need a different percentage for each environment.

Look at historical data rates. Another way is to use a percentage, but to dynamically adapt the percentage based on historical traffic rates. This way works in theory, but in my experience, systems like this tend to be unstable. For example, the system will dial in a low percentage based on overnight traffic, or because the system paused due to a garbage collection or a spate of disk I/O on the underlying machine. Then a burst of activity will happen, and the system will use its old percentage for a while, thus sampling way too much data. Then the percentage updates, but the burst of activity disappears. Now the server is sampling at a very low percentage and throwing everything away, because the burst of traffic disappeared. In general, I can't say that this method can't work, but in a world where software always has bugs and surprises, it's much better to use algorithms that are more clearly stable and predictable.

Use time-based sampling. Instead of sampling a percentage of traffic, the system can select one event for each window of time, for example one event every 10 seconds. This method adapts to different rates of incoming events, including bursty traffic where the rate isn't steady. Now there's a new problem, though: the events are not randomly selected any more! This method will work fine if the server is receiving a homogenous series of events that are all spaced out the same from each other, but that may or may not be the case. A lot of times, there is some external process, perhaps generated by an end user, where one external event will create a clump of events that arrive at your server at almost the same time as each other. If you select one event every 10 seconds, you will tend to sample events that are at the beginning of the clumps, thus making some of the events more likely to be sampled than others.

A dumb way that works

Here's an approach you could follow that will reliably work, but at the expense of a lot of memory usage. What you could do is collect all of the events for a 10-second window into an in-memory buffer. Each time a new event arrives, you don't think twice, you simply add it to the buffer to be looked at later on. Once the ten-second window expires, you look through the buffer and randomly select one of the events in the buffer. Then you reset the buffer and start again for the next 10-second window.

This way meets all of the constraints, but now it uses a lot of memory. It's close, though, except for the memory problem. Is there a way to do this, but with a smaller buffer? It would need to be some kind of algorithm that discards events as it goes, rather than waiting to the end of the 10-second window to do a whole lot of discarding all at once.

The sampling trick

Here's a way that works. Keep the following two pieces of information in memory.

  • The number of events that have arrived, in the current 10-second window.
  • One single event from the window. This is the event that will be sampled for sure if no other events arrive.

Given this setup, here's what you do when a new event arrives.

  1. Increase the count of events by one.
  2. Select a random number from 1 to the new, updated count of events.
  3. If the random number is 1, then replace the event in the buffer by the new event; the new event is the one you need, and you know for sure you don't need the old event.
  4. If the random number is anything other than 1, then discard the new event; you're done with that event for good and aren't going to use it.

When the 10-second window elapses, emit your current sample and then reset the buffer to its original state: 0 events that have arrived, and null as the currently stored event.

An example walk-through

To see why this works, let's start with a scenario that exactly three events arrive during the window. Here's what happens.

Event 1 arrives. When the first event arrives, increase the count to 1. Roll a number from 1 to 1, which will always result in 1, so store the first event into the buffer. Here's what is in the buffer now:

  • Event 1, for sure.

Event 2 arrives. When the second event arrives, increase the count to 2. Roll a number from 1 to 2. This roll has a probability of 1/2 of being a 1, in which case event 2 will replace the first one. At this point, there is a 1/2 probability that event 2 took over the buffer, and a 1/2 probability that event 1 is still in there.

  • Event 1, at probability 1/2.
  • Event 2, at probability 1/2.

Event 3 arrives. Increase the count to 3, and roll a number from 1 to 3. This roll has a probability of 1/3 of being a 1 and causing event 3 to take over the buffer. At probability 2/3, some other number is rolled, and the existing event is brought forward. Therefore, the probability for event 1 to still be in the buffer is the 1/2 chance that the event already had, times the 2/3 chance that it survives this latest roll of the die. For event 2, the same reasoning applies as for event 1.

  • Event 1, at probability 1/2 * 2/3 = 1/3.
  • Event 2, at probability 1/2 * 2/3 = 1/3.
  • Event 3, at probability 1/3.

Proof of the general case

Property. When using The Sampling Trick described above, after the arrival of N events, each event has 1/N probability of being the one in the sample buffer.

The world's smallest proof by induction. In the base case, suppose N=1. Then there is just one event, and it is always selected. The one event has probability 1/1, which is what we wanted to prove.

For the inductive case, assume the property is correct for N-1, and prove it for N.

For the Nth event, it was the last event that arrived, and the rules of the algorithm are that it will replace all prior events at probability 1/N. So the theorem is true for the Nth event.

For any other event, we know by the inductive assumption that it has a 1/(N-1) chance of being in the buffer when the Nth event arrives. The Nth event will replace the buffered event at probability 1/N, so the probability of the existing event being left alone is (N-1)/N. Therefore, the probability that the prior event is both selected from the first N-1 rounds, and also stays in the buffer after the Nth round, is 1/(N-1) x (N-1)/N, which is 1/N.

Q.E.D.

How do you use this in practice?

Sampling is helpful for always-on computer systems that can never have any down time. In a case like that, you need to not only make the server internally robust for every kind of input it might receive, but also be externally compatible with all the other software that's around it in the current production ecosystem.

There are a lot of tricks and know-how for how to use the sample data once you have it, but the first step is to capture the data at all.

Wednesday, May 4, 2016

Supporting C in a build system

In a previous post, I showed that the standard build rules for C code are unreliable. Let me describe two ways to do better.

In the interest of brevity, I will describe the build rules using a toy build engine called Blcache (short for "build cache"). I initially tried to write this post using standard tools like Make, Ninja, Bazel, and Nix, but they all have one limitation or another that distracts from the main point of this post.

Example problem

Here is an example problem this post will work against. There is a test.c file, and it includes two separate header files:

// File test.c
#include <stdio.h>
#include "syslimits.h"
#include "messages.h"

int main() {
  printf("%s: %d\n", MSG_LIMIT_THREADS, LIMIT_THREADS);
}

// File localhdrs/messages.h
#define MSG_LIMIT_THREADS "Limit on threads:"

// File hdrs/syslimits.h
#define LIMIT_THREADS 10

After compiling this code, a third header file is added as follows:

// File localhdrs/syslimits.h
#define LIMIT_THREADS 500

The challenge is for the addition of this header file to trigger a rebuild, while still making the build as incremental as possible.

Method 1: Compile entire components

The simplest way to get correct C compiles is to compile entire components at a time, rather than to set up build rules to compile individual C files. I've posted before on this strategy for Java, and it applies equally well to C.

This approach seems to be unusual, but my current feel is that it should work well in practice. It seems to me that when you are actively working on a given component, you should almost always use an IDE or other specialized tool for compiling that given component. The build system should therefore not concern itself with fine-grained incremental rebuilds of individual C files. Rebuilds of whole components--executables, libraries, and shared libraries--should be plenty, and even when using a system like Gyp, there are advantages to having the low-level build graph be simple enough to read through and debug by hand.

Using such an approach, you would set up a single build rule that goes all the way from C and H files to the output. Here it is in JSON syntax, using Blcache:

{
    "environ": [
        "PATH"
    ],
    "rules": [
        {
            "commands": [
                "gcc -Ilocalhdrs -Ihdrs  -o test test.c"
            ], 
            "inputs": [
                "test.c", 
                "localhdrs",
                "hdrs"
            ], 
            "name": "c/test", 
            "outputs": [
                "test"
            ]
        }
    ]
}

The "environ" part of this build file declares which environment variables are passed through to the underlying commands. In this case, only PATH is passed through.

There is just one build rule, and it's named c/test in this example. The inputs include the one C file (test.c), as well as two entire directories of header files (localhdrs and hdrs). The build command for this rule is very simple: it invokes gcc with all of the supplied input files, and has it build the final executable directly.

With the build rules set up like this, any change to any of the declared inputs will cause a rebuild to happen. For example, here is what happens in an initial build of the tool:

$ blcache c/test
Started building c/test.
Output from building c/test:
gcc -Ilocalhdrs -Ihdrs  -o test test.c

$ ./target/c/test
Limit on threads: 10

After adding syslimits.h to the localhdrs directory, the entire component gets rebuilt, because the localhdrs input is considered to have changed:

$ blcache c/test
Started building c/test.
Output from building c/test:
gcc -Ilocalhdrs -Ihdrs  -o test test.c

$ ./target/c/test
Limit on threads: 500

As a weakness of this approach, though, any change to any C file or any header file will trigger a rebuild of the entire component.

Method 2: Preprocess as a separate build step

Reasonable people disagree about how fine-grained of build rules to use for C, so let me describe the fine-grained version as well. This version can rebuild more incrementally in certain scenarios, but that benefit comes at the expense of a substantially more complicated build graph. Then again, most developers will never look at the build graph directly, so there is some argument for increasing the complexity here to improve overall productivity.

The key idea with the finer-grained dependencies is to include a separate build step for preprocessing. Here's a build file to show how it can be done:

{
    "environ": [
        "PATH"
    ],
    "rules": [
        {
            "commands": [
                "gcc -c -o test.o target/c/preproc/test.i"
            ], 
            "inputs": [
                "c/preproc/test:test.i"
            ], 
            "name": "c/object/test",
            "outputs": [
                "test.o"
            ]
        },

        {
            "commands": [
                "gcc -Ilocalhdrs -Ihdrs -E -o test.i test.c"
            ],
            "inputs": [
                "test.c", 
                "localhdrs",
                "hdrs"
            ], 
            "name": "c/preproc/test",
            "outputs": [
                "test.i"
            ]
        },

        {
            "commands": [
                "gcc -o test target/c/object/test.o"
            ],
            "inputs": [
                "c/object/test:test.o"
            ],
            "name": "c/test",
            "outputs": [
                "test"
            ]
        }
    ]
}

This file has three rules in it that chain together to produce the final output file. When you build the test executable for the first time, all three rules will be executed:

$ blcache c/test
Started building c/preproc/test.
Output from building c/preproc/test:
gcc -Ilocalhdrs -Ihdrs -E -o test.i test.c

Started building c/object/test.
Output from building c/object/test:
gcc -c -o test.o target/c/preproc/test.i

Started building c/test.
Output from building c/test:
gcc -o test target/c/object/test.o

$ ./target/c/test
Limit on threads: 10

First, the test.c file is preprocessed, yielding test.i. Second, the test.i file is compiled to test.o. Finally, test.o is linked into the final test executable.

Adding the new syslimits.h file behaves as expected, causing the full chain of recompiles.

$ blcache c/test
Started building c/preproc/test.
Output from building c/preproc/test:
gcc -Ilocalhdrs -Ihdrs -E -o test.i test.c

Started building c/object/test.
Output from building c/object/test:
gcc -c -o test.o target/c/preproc/test.i

Started building c/test.
Output from building c/test:
gcc -o test target/c/object/test.o

$ target/c/test
Limit on threads: 500

Modifying an irrelevant header file, on the other hand, only causes the precompilation step to run. Since the precompilation yields the same result as before, rebuilding stops at that point.

$ touch localhdrs/irrelevant.h
$ blcache c/test
Started building c/preproc/test.
Output from building c/preproc/test:
gcc -Ilocalhdrs -Ihdrs -E -o test.i test.c

Using cached results for c/object/test.
Using cached results for c/test.

It's not shown in this example, but since each C file is compiled individually, a change to a C file will only trigger a rebuild of that one file. Thus, the technique here is fine-grained in two different ways. First, changes to one C file only trigger a recompile of that one file. Second, changes to the H files only trigger preprocessing of all C files, and then only compilation of those C files that turn out to be affected by the H files that were changed.

By the way, there's a trick here that generalizes to a variety of cached computations. If you want to add a cache for a complicated operation like a C compile, then don't try to have the operation itself be directly incremental. It's too error prone. Instead, add a fast pre-processing step that accumulates all of the relevant inputs, and introduce the caching after that pre-processing step. In the case of this example, the fast pre-processing step is, well, the actual C preprocessor.

Coda

Before realizing the problem with C compilation, I used C as an example of why you might want to break the rules a little bit about the most strict and simplistic version of a build cache. However, now it seems to me that you find the best set of build rules if you strictly adhere to a build-cache discipline. I'm sorry, build cache. I should never have doubted you.

Standard build rules for C are unreliable

The standard way of integrating C into a build system is to use automatic dependencies generated from the compiler. Gcc and Clang can emit a list of the header files they read if you run them with the -M option. Visual Studio can do it as well, using the /showIncludes option. What I will call the "standard approach" in this post is to use the dependencies the user explicitly declared, and then to augment them with automatic dependencies generated by options like -M or /showIncludes.

Until a few years ago, I just took this approach as received wisdom and didn't think further about it. It's a neat trick, and it works correctly in the most obvious scenarios. Unfortunately, I have learned that the technique is not completely reliable. Let me share the problem, because I figure that other people will be interested as well, especially anyone else who ends up responsible for setting up a build system.

The root problem with the standard approach is that sometimes a C compile depends on the absence of a file. Such a dependency cannot be represented and indeed goes unnoticed in the standard approach to automatic dependencies. The standard approach involves an "automatic dependency list", which is a file listing out the automatically determined dependencies for a given C file. By its nature, a list of files only includes files that exist. If you change the status of a given file from not existing, to existing, then the standard approach will overlook the change and skip a rebuild that depends on it.

To look at it another way, the job of a incremental build system is to skip a compile if running it again would produce the same results. Take a moment to consider what a compiler does as it runs. It does a number of in-memory operations such as AST walks, and it does a number of IO operations including reading files into memory. Among those IO operations are things like "list a directory" and "check if a file exists". If you want to prove that a compiler is going to do the same thing on a second run as it did on the first, then you want to prove that those IO operations are going to do the same thing on a second run. That means all of the IO operations, though, not just the ones that read a file into memory.

Such a situation may seem exotic. At least one prominent source has declared that the standard approach is "correct" up to changes in the build command, which suggests to me that the author did not consider this scenario at all. It's not just a theoretical problem, though. Let me show a concrete example of how it can arise in practice.

Suppose you are compiling the following collection of files, including a single C file and two H files:

// File test.c
#include <stdio.h>
#include "syslimits.h"
#include "messages.h"

int main() {
  printf("%s: %d\n", MSG_LIMIT_THREADS, LIMIT_THREADS);
}

// File localhdrs/messages.h
#define MSG_LIMIT_THREADS "Limit on threads:"

// File hdrs/syslimits.h
#define LIMIT_THREADS 10
Using automatic dependencies, you set up a Makefile that looks like this:
CFLAGS=-Ilocalhdrs -Ihdrs

test.o test.d : test.c
 gcc $(CFLAGS) -M test.c > test.d
 gcc $(CFLAGS) -c test.c

test: test.o
 gcc -o test test.o

-include test.d

You compile it and everything looks good:

$ make test
gcc -Ilocalhdrs -Ihdrs -M test.c > test.d
gcc -Ilocalhdrs -Ihdrs -c test.c
gcc -o test test.o
$ ./test
Limit on threads: 10
Moreover, if you change any of the input files, including either of the H files, then invoking make test will trigger a rebuild as desired.
$ touch localhdrs/messages.h
$ make test
gcc -Ilocalhdrs -Ihdrs -M test.c > test.d
gcc -Ilocalhdrs -Ihdrs -c test.c
gcc -o test test.o

What doesn't work so well is if you create a new version of syslimits.h that shadows the existing one. Suppose you next create a new syslimits.h file that shadows the default one:

// File localhdrs/syslimits.h
#define LIMIT_THREADS 500

Make should now recompile the executable, but it doesn't:

$ make test
make: 'test' is up to date.
$ ./test
Limit on threads: 10

If you force a recompile, you can see that the behavior changed, so Make really should have recompiled it:

$ rm test.o
$ make test
gcc -Ilocalhdrs -Ihdrs -M test.c > test.d
gcc -Ilocalhdrs -Ihdrs -c test.c
gcc -o test test.o
$ ./test
Limit on threads: 500

It may seem picky to discuss such a tricky scenario as this one, with header files shadowing other header files. Imagine a developer in the above scenario, though. They are doing something tricky, yes, but it's a tricky thing that is fully supported by the C language. If this test executable is part of a larger build, the developer can be in for a really difficult debugging exercise to try and understand why their built executable is not behaving the way that's consistent with the source code. I dare say, it is precisely such tricky situations where people rely the most on their tools behaving in an intuitive way.

I will describe how to set up better build rules for this scenario in a followup post.