Sunday, October 21, 2012

Source Maps with Non-traditional Source Code

I recently explored using JavaScript source maps with a language very different from JavaScript. Source maps let developers debug in a web browser while still looking at original source code, even if that source code is not JavaScript. A lot of programming languages support them nowadays, including Dart, Haxe, and CoffeeScript.

In my case, I found it helpful to use "source" code that is different from what the human programmers typed into a text editor and fed to the compiler. This post explains why, and it gives a few tricks I learned along the way.

Why virtual source?

It's might seem obvious that the source map should point back to original source code. That's what the Closure Tools team designed it for, and for goodness' sake, it's called a source map. That's the approach I started with, but I ran into some difficulties that eventually led me to a different approach.

One difficulty is a technical one. When you place a breakpoint in Chrome on a file mapped via a source map, it places one and only one breakpoint in the emitted JavaScript code. That works fine for a JavaScript-to-JavaScript compiler, but I was compiling from Datalog. In Datalog, there are cases where the same line of source code is used in multiple places in the output code. For example, Datalog rules are run in two different modes: once during the initial bootstrapping of a database instance, and later during an Orwellian "truth maintenance" phase. With a conventional source map, it is only possible to breakpoint one of the instances, and the developer doesn't even know which one they are getting.

That problem could be fixed by changes to WebKit, but there is a larger problem: the behavior of the code is different in each of its variants. For example, the truth maintenance code for a Datalog rule has some variants that add facts and some that remove them. A programmer trying to make sense of a single-stepping session needs to know not just which rule they have stopped on, but which mode of evaluation that rule is currentlty being used in. There's nothing in the original source code that can indicate this difference; in the source code, there's just one rule.

As a final cherry on top of the excrement pie, there is a significant amount of code in a Datalog runtime that doesn't have any source representation at all. For example, data input and data output do not have an equivalent in source code, but they are reasonable places to want to place a breakpoint. For a source map pointing to original source code, I don't see a good way to handle such loose code.

A virtual source file solves all of the above problems. The way it works is as follows. The compiler emits a virtual source file in addition to the generated JavaScript code. The virtual source file is higher-level than the emitted JavaScript code, enough to be human readable. However, it is still low-level enough to be helpful for single-step debugging.

The source map links the two forms of output together. For each character of emitted JavaScript code, the source map maps it to a line in the virtual source file. Under normal execution, web browsers use the generated JavaScript file and ignore the virtual source file. If the browser drops into a debugger--via a breakpoint, for example--then it will show the developer the virtual source file rather than the generated JavaScript code. Thus, the developer has the illusion that the browser is directly running the code in the virtual source file.

Tips and tricks

Here are a few tips and tricks I ran into that were not obvious at first.

Put a pointer to the original source file for any code where such a pointer makes sense. That way, developers can easily go find the original source file if they want to know more context about where the code in question came from. Here's the kind of thing I've been using:

    /* browser.logic, line 28 */

Also, for the sake of your developers' sanity, each character of generated JavaScript code should map to some part of the source code. Any code you don't explicitly map will end up implicitly pointing to the previous line of virtual source that does have a map. If you can't think of anything to put in the virtual source file, then try a blank line. The developer will be able to breakpoint and single-step that blank line, which might initially seem weird. It's less weird, though, than giving the developer incorrect information.

Name your JavaScript variable names carefully. I switched generated temporaries to start with "z$" instead of "t$" so that they sort down at the bottom of the variables list in the Chrome debugger. That way, when an app developer looks at the list of variables in a debugger, the first thing their eyes encounter are their own variables.

Emit variable names into the virtual source file, even when they seem redundant. It provides an extra cue for developers as they mentally map what they see in the JavaScript stack trace and what they see in the virtual source file. For example, here is a line of virtual source code for inputting a pair of values to the "new_input" Datalog predicate; the "value0" and "value1" variables are the generated variable names for the pair of values in question.

    INPUT new_input(value0, value1)

Implementation approach

Implementing a virtual source file initially struck me as a cross-cutting concern that was likely to turn the compiler code into a complete mess. However, here is an approach that makes it not so bad.

The compiler already has an "output" stream threaded through all classes that do any code generation. The trick is to augment the class used to implement that stream with a couple of new methods:

  • emitVirtual(String): emit text to the virtual source file
  • startVirtualChunk(): mark the beginning of a new chunk of output

With this extended API, working with a virtual source file is straightforward and non-intrusive. Most compiler code remains unchanged; it just writes to the output stream as normal. Around each human-comprehensible chunk of output, there is a call to startVirtualChunk() followed by a few calls to emitVirtual(). For example, whenever the compiler is about to emit a Datalog rule, it first calls startVirtualChunk() and then pretty prints the code to the emitVirtual() stream. After that, it emits the output JavaScript.

With this approach, the extended output stream becomes a single point where the source map can be accumulated. Since this class intercepts writes to both the virtual file and the final generated JavaScript file, it is in a position to maintain a mapping between the two.

The main downside to this approach is that the generated file and the virtual source file must put everything in the same order. In my case, the compiler is emitting code in a reasonable order, so it isn't a big deal.

If your compiler rearranges its output in some wild and crazy order, then you might need to do something different. One approach that looks reasonable is to build a virtual AST while emitting the main source code, and then only convert the virtual AST to text once it is all accumulated. The startVirtualChunk() method would take a virtual AST node as an argument, thus allowing the extended output stream to associate each virtual AST node with one or more ranges of generated JavaScript code.