Lex Spoon: latex

I've been hacking in M4 recently, and I ran across a great article about M4 by Michael Breen. I highly recommend it for anyone either using M4 (e.g. for Autotools or for Bison), or considering M4 for a new project (probably a bad idea). This observation caught my eye:

While m4's textual rescanning approach is conceptually elegant, it can be confusing in practice and demands careful attention to layers of nested quotes.

Christman Brown writes something similar, in his review of M4:

It is difficult to debug. I quickly found my default behavior when encountering something unexpected is to just increase the escape quotes around something. That often fixed it.

What's going on, here? A macro language is supposed to let you write normal text and then sprinkle some macro expansions here and there. It's supposed to save you from the tedium of dealing with a full-fledged general purpose programming language. Also, sometimes this strategy works out. With the C preprocessor, you write ordinary C code most of the time, and it works totally fine to occassionally call a macro. Why does this approach work in C but not in M4?

I think Michael Breen is onto something. Macros are a form of subroutine, and with a well designed syntax for subroutine call, you want it to be straightforward and thoughtless to invoke a subroutine and pass it some arguments. You want the arguments to feel just like text in any other part of your system. Think how it feels to write HTML and to put a div around part of your page. You don't have to do any special encoding to the stuff you put inside the div; you can, any time you want, take any chunk of your code with balanced tags and put that inside another pair of tags. With M4, the basic facility of a subroutine call, which you use all over the place, is somehow tricky to use.

M4 is not wrong to have a quoting mechanism, but where it goes wrong is to require quoting on the majority of subroutine calls. Here's what that looks like in practice. M4 uses function-call syntax to invoke macros, so a call looks like foo(a, b, c). That's a great syntax to try, because function calls are a ubiquitious syntax that users will recognize, but it has a problem for a text macro language in that the comma is already a common character to use in the a, b, and c arguments. Having observed that, M4 could and should have moved away from the function call notation and looked for something else. Instead, the designers stuck with the function calls and augmented it with an additional kind of syntax, quotation. In practice, you usually quote all of your arguments when you pass them to an M4 macro, like this: foo([a], [b], [c]). Only usually, however, and you have to think about it every time. The quotation and the macro call are two different kinds of syntax, and the user has to control them individually.

The reason it works out better for C's preprocessor is that the C language already has a way to quote any commas that occur in the arguments, and the preprocessor understands those quoting mechanisms. For example, with sqrt(foo(x, y)), the preprocessor understands that the comma inside the (x, y) part should not count as separating the parameters to sqrt. Programmers already write those parentheses without thinking about it, because the function-call notation for foo(x, y) already requires parentheses. Unfortunately, C does not do the right thing for an example like swap(a[i,j], a[i,j+1]), because it does not treat square brackets the way it treats parenthesis. It could and it should, however. None of this maps over to M4 very well, because usually the arguments to an M4 macro are not code, and so the author isn't going to naturally escape any commas that occur. The function-call syntax just doesn't work well for the situation M4 is intended to be used in.

Fixing the local problem

If we wanted to write a next-generation M4, here are some observations to start from:

It is better if the quoting syntax is built into the subroutine call syntax. That way, users don't have to independently reason about both calls and quotes, and instead can just think about call-and-quote as a single thing that they do.
Look to markup languages for inspiration, for example XML or Latex. The general feel we are going for is that you mostly write text, and then occasionally you sprinkle in some macro calls. That's a markup language!

Based on these, a good thing to try for an M4-like system would be to use XML tags for macro invocation. XML is a culmination of a line of tag-based markup languages starting with SGML and HTML, and it is generally state of the art for that kind of language. Among other advantages, XML is a small, minimal language that you can learn quickly, and it has explicit syntax for self-closing tags rather than some tags being self-closing and others not, in a context-dependent way depending on the schema that is currently in effect for a given file.

Latex's macro syntax is also very interesting, and it has a big advantage in usually saying each tag name just once (\section{foo}) rather than twice (<section>foo</section>). However, my experience with Latex is that I am in constant doubt how much lookahead text a macro invocation will really look at; the braces-based syntax is just a convention, and you never know for sure which macros really look at those conventions or not. That said, the general syntax looks like a promising idea to me if it were locked down a little more rather than being based on the Tex macro language. A similar approach was used in Scribe, a markup language designed by Brian Reid in the 70s.

What to use for now

As things stand, I don't think M4 really has a sweet spot. Old projects that want to have an ongoing roadmap should probably move away from M4. New projects should never use it to begin with. What are the options right now, without having to build a new textual macro language?

It's not a bad option to use a general-purpose language like Python or Java. If you follow the links from the PP generic preprocessor that is used in Pandoc, they tell you they are replacing their templating by more and more usage of Lua, a general purpose language. When you use a general-purpose language to generate text, you can use the normal library routines your language already supports, plus a mature library of lists, maps, structs, and iteration routines on top of them. An example of this direction is the Jooq library for generating SQL code.

Another strong approach is to use XML, possibly augmented by XSLT. An example would be the query help format of the GitHub Code Scanner, a format that I designed many years ago at a startup called Semmle. We had an existing syntax based on HTML with regex-based rewrites applied to the HTML file, and we had a problem that people were typo-ing the syntax without realizing it, resulting in help files that were sometimes unreadable and were often missing a large portion of the text. I explored a few options for getting us a more rigorous format with tool support for common errors, and I landed on XML, which I feel like worked out pretty well. In addition to the format itself being nice to work with, we got to tap into the existing XML ecosystem, for example to use Eclipse's excellent XML editor.

I briefly explored JSON as well, which is another minimal syntax that is easy to learn, but I quickly realized why they call XML a markup language. Unlike with JSON, XML lets you mostly write normal text, and then as a secondary thing, add special syntax--hence, "marking up" your text. XML is also a very mature system in general, so for example we could configure Eclipse (which was a viable tool back then!) to auto-complete tags and give you errors within the editor if you used tags that aren't allowed. If I were to rewrite how Bison's skeletons work, I think something based on XML would be tempting to try. Bison already uses this approach for its debug output. I'm not sure, though; XSLT looks pretty voluminous in practice.

Some of the best options are embedded in an existing general-purpose language. JSX is embedded in JavaScript, and Scribe is embedded in Scheme. I'm not sure how practical these are if you aren't already working in those environments, but if you are, look for one that works with your current source language.

The larger lesson

An effective tool has to be evaluated in the context it will be used in. Both C and M4 use function-call notation for macro invocation, but in C it works well, while with M4 it becomes a nightmare. Achieving an effective tool, therefore, requires design thinking. You need to learn about the situation you are targeting, you need to adjust your solution based on that, and above all you need to be ready to critique your solutions and iterate on improvements. The critique can take many forms, and a really important one is to watch how your users are doing and to really reflect on why it's happening and how it could be different.

Surprises can happen anywhere, and you'll support your users more if you can act on those surprises and try something different.

Lance Fortnow laments that no matter how crotchety he gets he can't seem to give up LaTeX:

LaTeX is a great system for mathematical documents...for the 1980s. But the computing world changed dramatically and LaTeX didn't keep up. Unlike Unix I can't give it up. I write papers with other computer scientists and mathematicians and since they use LaTeX so do I. LaTeX still has the best mathematics formulas but in almost every other aspect it lags behind modern document systems.

I think LaTeX is better than he gives it credit for. I also think it could use a reboot. It really is a system from the 80s, and it's... interesting how many systems from the 70s and 80s are still the best of breed, still in wide use, but still not really getting any new development.

Here's my hate list for LaTeX:

The grammar is idiosyncratic, poorly documented, and context-dependent. There's no need for any of that. There are really good techniques nowadays for having a very extensible language nonetheless have a base grammar that is consistent in every file and supports self-documentation.
You can't draw outside the lines. For all the flexibility the system ought to have due to its macro system, I find the many layers of implementation to be practically impenetrable. Well written software can be picked up by anyone, explored, and modified. Not so with LaTeX--you have to do things exactly the way the implementers imagined, or you are in for great pain and terrible-looking output.
The error messages are often inscrutable. They may as well drop all the spew and just say, "your document started sucking somewhere around line 1308".
The documentation is terrible. The built-in documentation is hard to find and often stripped out anyway. The Internet is filled with cheesy "how to get started" guides that drop off right before they answer whatever question you have.
Installing fonts is a nightmare. There are standalone true-type fonts nowadays. You should be able to drop in a font and configure LaTeX to use it. That this is not possible suggests that the maintainers are as afraid of the implementation as I am.
Files are non-portable and hard to extract. This problem is tied up in the implementation technology. Layered macros in a global namespace are not conducive to careful management of dependencies, so everything tends to depend on everything.

However, as bad as that list is, the pros make it worth it:

Excellent looking output, especially if you use any math. If you care enough to use something other than ASCII, I would think that the document appearance trumps just about any other concern.
Excellent collaborative editing. You can save files in version control and use file-level merge algorithms effectively. With most document systems, people end up mailing each other the drafts, which is just a miserable way to collaborate.
Scripting and macros. While you can't reasonably change LaTeX itself, what you can easily do is add extra features to the front end by way of scripts and macros.
It uses instant preview instead of WYSIWYG. WYSIWYG editors lead to quirky problems that are easy to miss in proofreading, such as headers being at the wrong level of emphasis. While I certainly want to see what the output will look like all the time, I don't want to edit that version. I want to edit the code. When you develop something you really want to be good, you want very tight control.
Scaling. Many document systems develop problems when a document is more than 10-20 pages long. LaTeX keeps on chugging for at least up to 1000-page documents.

I would love to see a LaTeX reboot. The most promising contender I know of is the Lout document formatting system, but it appears to not be actively maintained.

Lex Spoon

Sunday, June 4, 2023

Why the M4 macro syntax goes so wrong

Fixing the local problem

What to use for now

The larger lesson

Friday, August 19, 2011

Why LaTeX?

Blog Archive

Labels