Sunday, June 4, 2023

Why the M4 macro syntax goes so wrong

I've been hacking in M4 recently, and I ran across a great article about M4 by Michael Breen. I highly recommend it for anyone either using M4 (e.g. for Autotools or for Bison), or considering M4 for a new project (probably a bad idea). This observation caught my eye:
While m4's textual rescanning approach is conceptually elegant, it can be confusing in practice and demands careful attention to layers of nested quotes.
Christman Brown writes something similar, in his review of M4:
It is difficult to debug. I quickly found my default behavior when encountering something unexpected is to just increase the escape quotes around something. That often fixed it.

What's going on, here? A macro language is supposed to let you write normal text and then sprinkle some macro expansions here and there. It's supposed to save you from the tedium of dealing with a full-fledged general purpose programming language. Also, sometimes this strategy works out. With the C preprocessor, you write ordinary C code most of the time, and it works totally fine to occassionally call a macro. Why does this approach work in C but not in M4?

I think Michael Breen is onto something. Macros are a form of subroutine, and with a well designed syntax for subroutine call, you want it to be straightforward and thoughtless to invoke a subroutine and pass it some arguments. You want the arguments to feel just like text in any other part of your system. Think how it feels to write HTML and to put a div around part of your page. You don't have to do any special encoding to the stuff you put inside the div; you can, any time you want, take any chunk of your code with balanced tags and put that inside another pair of tags. With M4, the basic facility of a subroutine call, which you use all over the place, is somehow tricky to use.

M4 is not wrong to have a quoting mechanism, but where it goes wrong is to require quoting on the majority of subroutine calls. Here's what that looks like in practice. M4 uses function-call syntax to invoke macros, so a call looks like foo(a, b, c). That's a great syntax to try, because function calls are a ubiquitious syntax that users will recognize, but it has a problem for a text macro language in that the comma is already a common character to use in the a, b, and c arguments. Having observed that, M4 could and should have moved away from the function call notation and looked for something else. Instead, the designers stuck with the function calls and augmented it with an additional kind of syntax, quotation. In practice, you usually quote all of your arguments when you pass them to an M4 macro, like this: foo([a], [b], [c]). Only usually, however, and you have to think about it every time. The quotation and the macro call are two different kinds of syntax, and the user has to control them individually.

The reason it works out better for C's preprocessor is that the C language already has a way to quote any commas that occur in the arguments, and the preprocessor understands those quoting mechanisms. For example, with sqrt(foo(x, y)), the preprocessor understands that the comma inside the (x, y) part should not count as separating the parameters to sqrt. Programmers already write those parentheses without thinking about it, because the function-call notation for foo(x, y) already requires parentheses. Unfortunately, C does not do the right thing for an example like swap(a[i,j], a[i,j+1]), because it does not treat square brackets the way it treats parenthesis. It could and it should, however. None of this maps over to M4 very well, because usually the arguments to an M4 macro are not code, and so the author isn't going to naturally escape any commas that occur. The function-call syntax just doesn't work well for the situation M4 is intended to be used in.

Fixing the local problem

If we wanted to write a next-generation M4, here are some observations to start from:

  • It is better if the quoting syntax is built into the subroutine call syntax. That way, users don't have to independently reason about both calls and quotes, and instead can just think about call-and-quote as a single thing that they do.
  • Look to markup languages for inspiration, for example XML or Latex. The general feel we are going for is that you mostly write text, and then occasionally you sprinkle in some macro calls. That's a markup language!

Based on these, a good thing to try for an M4-like system would be to use XML tags for macro invocation. XML is a culmination of a line of tag-based markup languages starting with SGML and HTML, and it is generally state of the art for that kind of language. Among other advantages, XML is a small, minimal language that you can learn quickly, and it has explicit syntax for self-closing tags rather than some tags being self-closing and others not, in a context-dependent way depending on the schema that is currently in effect for a given file.

Latex's macro syntax is also very interesting, and it has a big advantage in usually saying each tag name just once (\section{foo}) rather than twice (<section>foo</section>). However, my experience with Latex is that I am in constant doubt how much lookahead text a macro invocation will really look at; the braces-based syntax is just a convention, and you never know for sure which macros really look at those conventions or not. That said, the general syntax looks like a promising idea to me if it were locked down a little more rather than being based on the Tex macro language. A similar approach was used in Scribe, a markup language designed by Brian Reid in the 70s.

What to use for now

As things stand, I don't think M4 really has a sweet spot. Old projects that want to have an ongoing roadmap should probably move away from M4. New projects should never use it to begin with. What are the options right now, without having to build a new textual macro language?

It's not a bad option to use a general-purpose language like Python or Java. If you follow the links from the PP generic preprocessor that is used in Pandoc, they tell you they are replacing their templating by more and more usage of Lua, a general purpose language. When you use a general-purpose language to generate text, you can use the normal library routines your language already supports, plus a mature library of lists, maps, structs, and iteration routines on top of them. An example of this direction is the Jooq library for generating SQL code.

Another strong approach is to use XML, possibly augmented by XSLT. An example would be the query help format of the GitHub Code Scanner, a format that I designed many years ago at a startup called Semmle. We had an existing syntax based on HTML with regex-based rewrites applied to the HTML file, and we had a problem that people were typo-ing the syntax without realizing it, resulting in help files that were sometimes unreadable and were often missing a large portion of the text. I explored a few options for getting us a more rigorous format with tool support for common errors, and I landed on XML, which I feel like worked out pretty well. In addition to the format itself being nice to work with, we got to tap into the existing XML ecosystem, for example to use Eclipse's excellent XML editor.

I briefly explored JSON as well, which is another minimal syntax that is easy to learn, but I quickly realized why they call XML a markup language. Unlike with JSON, XML lets you mostly write normal text, and then as a secondary thing, add special syntax--hence, "marking up" your text. XML is also a very mature system in general, so for example we could configure Eclipse (which was a viable tool back then!) to auto-complete tags and give you errors within the editor if you used tags that aren't allowed. If I were to rewrite how Bison's skeletons work, I think something based on XML would be tempting to try. Bison already uses this approach for its debug output. I'm not sure, though; XSLT looks pretty voluminous in practice.

Some of the best options are embedded in an existing general-purpose language. JSX is embedded in JavaScript, and Scribe is embedded in Scheme. I'm not sure how practical these are if you aren't already working in those environments, but if you are, look for one that works with your current source language.

The larger lesson

An effective tool has to be evaluated in the context it will be used in. Both C and M4 use function-call notation for macro invocation, but in C it works well, while with M4 it becomes a nightmare. Achieving an effective tool, therefore, requires design thinking. You need to learn about the situation you are targeting, you need to adjust your solution based on that, and above all you need to be ready to critique your solutions and iterate on improvements. The critique can take many forms, and a really important one is to watch how your users are doing and to really reflect on why it's happening and how it could be different.

Surprises can happen anywhere, and you'll support your users more if you can act on those surprises and try something different.