Lex Spoon

Tuesday, January 22, 2013

Virtual classes

Gilad Bracha has a great post up on virtual classes:

I wanted to share a nice example of class hierarchy inheritance....All we need then, is a slight change to ThreadSubject so it knows how to filter out the synthetic frames from the list of frames. One might be able to engineer this in a more conventional setting by subclassing ThreadSubject and relying on dependency injection to weave the new subclass into the existing framework - assuming we had the foresight and stamina to use a DI framework in the first place.

I looked into virtual classes in the past, as part of my work at Google to support web app developers. Bruce Johnson put out the call to support problems like Gilad describes above, and a lot of us thought hard on it. Just replace "ThreadSubject" by some bit of browser arcana such as "WorkerThread". You want it to work one way on App Engine, and a different way on Internet Explorer, and you want to allow people to subclass your base class on each platform.

Nowadays I'd call the problem one of "product lines", having had the benefit of talking it over with Kurt Stirewalt. It turns out that software engineering and programming languages have something to do with each other. In the PL world, thinking about "product lines" leads you to coloring, my vote for one of the most overlooked ideas in PL design.

Here is my reply on Gilad's blog:

I'd strengthen your comment about type checking, Gilad: if you try to type check virtual classes, you end up wanting to make the virtual classes very restrictive, thus losing much of the benefit. Virtual classes and type checking are in considerable tension.
Also agreed about the overemphasis on type checking in PL research. Conceptual analysis matters, but it's hard to do, and it's even harder for a paper committee to review it.
I last looked into virtual classes as part of GWT and JS' (the efforts tended to go in tandem). Allow me to add to the motivation you provide. A real problem faced by Google engineers is to develop code bases that run on multiple platforms (web browsers, App engine, Windows machines) and share most of the code. The challenge is to figure out how to swap out the non-shared code on the appropriate platform. While you can use factories and interfaces in Java, it is conceptually cleaner if you can replace classes rather than subclass them. More prosaically, this comes up all the time in regression testing; how many times have we all written an interface and a factory just so that we could stub something out for unit testing?
I found type checking virtual classes to be problematic, despite having delved into a fair amount of prior work on the subject. From what I recall, you end up wanting to have *class override* as a distinct concept from *subclassing*, and for override to be much more restrictive. Unlike with subclassing, you can't refine the type signature of a method from the class being overridden. In fact, even *adding* a new method is tricky; you have to be very careful about method dispatch for it to work.
To see where the challenges come from, imagine class Node having both an override and a subclass. Let's call these classes Node, Node', and LocalizedNode, respectively. Think about what virtual classes mean: at run time, Node' should, in the right circumstances, completely replace class Node. That "replacement" includes replacing the base class of LocalizedNode!
That much is already unsettling. In OO type checking, you must verify that a subclass conforms to its superclass. How do you do this if you can't see the real superclass?
To complete the trap, imagine Node has a method "name" that returns a String. Node' overrides this and--against my rules--returns type AsciiString, because its names only have 7-bit characters in them. LocalizedNode, meanwhile, overrides the name method to look up names in a translation dictionary, so it's very much using Unicode strings. Now imagine calling "name" on a variable of static type Node'. Statically, you expect to get an AsciiString back. However, at run time, this variable might hold a LocalizedNode, in which case you'll get a String. Boom.
Given all this, if you want type checking, then virtual classes are in the research frontier. One reasonable response is to ditch type checking and write code the way you like. Another approach is to explore alternatives to virtual classes. One possible alternative is to look into "coloring", as in Colored FJ.

Monday, January 21, 2013

Andreas Raab

It seems we had a bad week for programmers. I learned via James Robinson that Andreas Raab has died. There is an outpouring of messages on the Squeak mailing list.

I worked with Andreas on the Squeak project several years ago, where I got to see first-hand his outstanding work. Among many other tasks, he played a leading role in actually implementing the Croquet system for eventual consistency. At the time I worked with him, he was developing Tweak, a self-initiated project for rapid GUI development.

On a selfish note, I learned a lot working with Andreas. You only get better in this industry by practicing with good people. Andreas was one of the best.

I'd be remiss not to say he was also a blast to work with. He was always laughing, yet always insightful. He always found ways to get the people around him on a better path. He always found ways to tell them that they were happy to hear.

Rest in peace, Andreas. We are lucky to have had you, for as long as you could stay.

Saturday, December 29, 2012

Does IPv6 mean the end of NAT?

I frequently encounter a casual mention that, with the larger address space in IPv6, Net Address Translation (NAT)--a mainstay of wireless routers everywhere--will go away. I don't think so. There are numerous reasons to embrace path-based routing, and I believe the anti-NAT folks are myopically focusing on just one of them.

As background, what a NAT router does is allow multiplexing multiple private IP addresses behind a single, public IP address. From outside the subnet, it looks like the NAT router is a single machine. From inside the subnet, there are a number of machines, each with its own IP address. The NAT router allows communication between the inside and outside worlds by swizzling IP addresses and ports as connections go through the router. That's why it is a "net address translator" -- it translates between public IPs and private IPs.

My first encounter with NAT was to connect multiple machines to a residential ISP. It was either a cable company or a phone company; I forget which. The ISP in question wanted to charge extra for each device connected within the residential network. That is, if you connect two computers, you should pay more than if you connect one computer. I felt, and still feel, that this is a poor business arrangement. The ISP should concern itself with where I impose costs on it, which is via bandwidth. If I take a print server from one big box and move it onto its own smaller computer, then I need a new IP address, but that shouldn't matter at all to the ISP. By using NAT--in my case, Linux's "masquerading" support--the ISP doesn't even know.

This example broadens to a concern one could call privacy. What an organization does within its own network is its own business. Its communication with the outside world should be through pre-agreed protocols that, to the extent feasible, do not divulge decisions that are internal to the organization. It shouldn't matter to the general public whether each resident has their own machine, or whether they are sharing, or whether the residents have all bought iPads to augment their other devices.

For larger organizations, privacy leads to security. If you want to break into an organization's computer infrastructure, one of the first things you want to do is to feel out the topology of the network. Unless you use NAT at the boundary between your organization's network and the general internet, then you are exposing your internal network topology to the world. You are giving an attacker an unnecessary leg up.

You could also view these concerns from the point of view of modularity. The public network protocol of an organization is an interface. The internal decisions within the organization are an implementation. If you want everything to hook up reliably, then components should depend on interfaces, not implementations.

Given these concerns, I see no reason to expect NAT to go away, even given an Internet with a larger address space. It's just sensible network design. Moreover, I wish that the IETF would put more effort into direct support for NAT. In particular, the NAT of today is unnecessarily weak when it comes to computers behind different NATing routers making a direct connections with each other.

It is an understatement to say that not everyone agrees with me. Vint Cerf gave an interview earlier this year where he repeatedly expressed disdain for NAT.

"But people had not run out of IPv4 and NAT boxes [network address translation lets multiple devices share a single IP address] were around (ugh), so the delay is understandable but inexcusable."

Here we see what I presume is Cerf's main viewpoint on NAT: it's an ugly mechanism that is mainly used to avoid address exhaustion.

One of the benefits of IPv6 is a more direct architecture that's not obfuscated by the address-sharing of network address translation (NAT). How will that change the Internet? And how seriously should we take security concerns of those who like to have that NAT as a layer of defense? Machine to machine [communication] will be facilitated by IPv6. Security is important; NAT is not a security measure in any real sense. Strong, end-to-end authentication and encryption are needed. Two-factor passwords also ([which use] one-time passwords).

I respectfully disagree with the comment about security. I suspect his point of view is that you can just as well use firewall rules to block incoming connections. Speaking as someone who has set up multiple sets of firewall rules, I can attest that they are fiddly and error prone. You get a much more reliable guarantee against incoming connections if you use a NAT router.

In parting, let me note a comment in the same interview:

Might it have been possible to engineer some better forwards compatibility into IPv4 or better backwards compatibility into IPv6 to make this transition easier? We might have used an option field in IPv4 to achieve the desired effect, but at the time options were slow to process, and in any case we would have to touch the code in every host to get the option to be processes... Every IPv4 and IPv6 packet can have fields in the packet that are optional -- but that carry additional information (e.g. for security)... We concluded (perhaps wrongly) that if we were going to touch every host anyway we should design an efficient new protocol that could be executed as the mainline code rather than options.

It is not too late.

Thursday, December 27, 2012

Windows 8 first impressions

An acquaintance of mine got a Windows 8 machine for Christmas, and so I got a chance to take a brief look at it. Here are some first impressions.

Windows 8 uses a tile-based UI that was called "Metro" during development. As a brief overview, the home page on Windows 8 is no longer a desktop, but instead features a number of tiles, one for each of the machine's most featured applications. Double-clicking on a tile causes it to maximize and take the whole display. There is no visible toolbar, no visible taskbar, no overlapping of windows. In general, the overall approach looks very reasonable to me.

The execution is another story. Let me hit a few highlights.

First, the old non-tiled desktop interface is still present, and Windows drops you into it from time to time. You really can't avoid it, because even the file browser uses the old mode. I suppose Microsoft shipped this way due to concerns about legacy software, but it's really bad for users. They have to learn not only the new tiles-based user interface but also the old desktop one. Thus it's a doubly steep learning curve compared to other operating systems, and it's a jarring user experience as the user goes from one UI to the other.

An additional problem is a complete lack of guide posts. When you switch to an app, you really switch to the app. The tiles-based home page goes away, and the new app fills the entire screen, every single pixel. There is no title bar, no application bar, nothing. You don't know what the current app is except by studying it's current screen and trying to recognize it. You have no way at all to know which tile on the home page got you here; you just have to remember. The UI really needs some sort of guide posts to tell you where you are.

The install process is bad. When you start it, it encourages you to create a Microsoft account and log in using that. It's a long process, including an unnecessary CAPTCHA; this process is on the critical path and should be made short and simple. Worse, though, I saw it outright crash at the end. After a hard reboot, it went back into the "create a new account" sequence, but after entering all the information from before, it hits a dead end and says the account is already being used on this computer. This error state is bad in numerous ways. It shouldn't have evened jumped into the create-account sequence with an account already present. Worse, the error message indicates that the software knows exactly what the user is trying to do. Why provide an error message rather than simply logging them in?

Aside from those three biggies, there are also a myriad of small UI details that seem pointlessly bad:

The UI uses a lot of pullouts, but those pullouts are completely invisible unless you know the magic gesture and the magic place on the screen to do it. Why not include a little grab handle off on the edge? It uses a little screen space, and it adds some clutter, but for the main pullouts the user really needs to know they are there.
In the web browser, they have moved the navigation bar to the bottom of the screen. This breaks all expectations of anyone that has used another computer or smart phone ever in their life. In exchange for those broken expectations, I can see no benefit; it's the same amount of screen space either way.
The "support" tile is right on the home page, which is a nice touch for new users. However, when you click it the first time, it dumps you into the machine registration wizard. Thus, it interrupts your already interrupted work flow with another interruption. It reminds me of an old Microsoft help program that, the first time you ran it, asked you about the settings you wanted to use for the search index.

On the whole, I know I am not saying anything new, but it strikes me that Microsoft would benefit from more time spent on their user interfaces. The problems I'm describing don't require any deep expertise in UIs. All you have to do is try the system out and then fix the more horrific traps that you find. I'm guessing the main issue here is in internal budgeting. There is a temptation to treat "code complete" as the target and to apportion your effort toward getting there. Code complete should not be the final state, though; if you think of it that way, you'll inevitably ship UIs that technically work but are full of land mines.

Okay, I've tried to give a reasonable overview of first impressions. Forgive me if I close with something a little more fun: Windows 8 while drunk.

Saturday, December 15, 2012

Recursive Maven considered harmful

I have been strongly influenced by Peter Miller's article Recursive Make Considered Harmful. Peter showed that if you used the language of make carefully, you could achieve two very useful properties:

You never need to run make clean.
You can pick any make target, in your entire tree of code, and confidently tell make just to build that one target.

Most people don't use make that way, but they should. More troubling, they're making the same mistakes with newer build tools.

What most people do with Maven, to name one example, is to add a build file for each component of their software. To build the whole code base, you go through each component, build it, and put the resulting artifacts into what is called an artifact repository. Subsequent builds pull their inputs from the artifact repository. My understanding of Ant and Ivy, and of SBT and Ivy, is that those build-tool combinations are typically used in the same way.

This arrangement is just like recursive make, and it leads to just the same problems. Developers rebuild more than they need to, because they can't trust the tool to build just the right stuff, so they waste time waiting on builds they didn't really need to run. Worse, these defensive rebuilds get checked into the build scripts, so as to "help" other programmers, making build times bad for everyone. On top of it all, even for all this defensiveness, developers will sometimes fail to rebuild something they needed to, in which case they'll end up debugging with stale software and wondering what's going on.

On top of the other problems, these manually sequenced builds are impractical to parallelize. You can't run certain parts of the build until certain other parts are finished, but the tool doesn't know what the dependencies are. Thus the tool can't parallelize it for you, not on your local machine, not using a build farm. Using a standard Maven or Ivy build, the best, most expensive development machine will peg just one CPU while the others sit idle.

Fixing the problem

Build tools should use a build cache, emphasis on the cache, for propagating results from one component to another. A cache is an abstraction that allows computing a function more quickly based on partial results computed in the past. The function, in this case, is for turning source code into a binary.

A cache does nothing except speed things up. You could remove a cache entirely and the surrounding system would work the same, just more slowly. A cache has no side effects, either. No matter what you've done with a cache in the past, a given query to the cache will give back the same value to the same query in the future.

The Maven experience is very different from what I describe! Maven repositories are used like caches, but without having the properties of caches. When you ask for something from a Maven repository, it very much matters what you have done in the past. It returns the most recent thing you put into it. It can even fail, if you ask for something before you put it in.

What you want is a build cache. Whereas a Maven repository is keyed by component name and version number (and maybe a few more things), a build cache is keyed by a hash code over the input files and a command line. If you rebuild the same source code but with a slightly different command, you'll get a different hash code even though the component name and version are the same.

To make use of such a cache, the build tool needs to be able to deal sensibly with cache misses. To do that, it needs a way to see through the cache and run recursive build commands for things that aren't already present in the cache. There are a variety of ways to implement such a build tool. A simple approach, as a motivating example, is to insist that the organization put all source code into one large repository. This approach easily scales to a few dozen developers. For larger groups, you likely want some form of two-layer scheme, where a local check-out of part of the code is virtually overlaid over a remote repository.

Hall of fame

While the most popular build tools do not have a proper build cache, there are a couple of lesser known ones that do. One such is the internal Google Build System. Google uses a couple of tricks to get the approach working well for themselves. First, they use Perforce, which allows having all code in one repository without all developers having to check the whole thing out. Second, they use a FUSE filesystem that allows quickly computing hash codes over large numbers of input files.

Another build tool that gets this right is the Nix build system. Nix is a fledgling build tool built as a Ph.D. project at the University of Delft. It's available open source, so you can play with it right now. My impression is that it has a good core but that it is not very widely used, and thus that you might well run into sharp corners.

How we got here

Worth pondering is how decades of build tool development have left us all using such impoverished tools. OS kernels have gotten better. C compilers have gotten better. Editors have gotten better. IDEs have gotten worlds better. Build tools? Build tools are still poor.

When I raise that question, a common response I get is that build tools are simply a fundamentally miserable problem. I disagree. I've worked with build tools that don't have these problems, and they simply haven't caught on.

My best guess is that, in large part, developers don't know what they are missing. There's no equivalent, for build tools, of using Linux in college and then spreading the word once you graduate. Since developers don't even know what a build tool can be like, they instead work on adding features. Thus you see build tool authors advertising that they support Java and Scala and JUnit and Jenkins and on and on and on with a very long feature list.

Who really cares about features in a build tool, though? What I want in a build tool is what I wrote to begin with, what was described over a decade ago by Peter Miller: never run a clean build, and reliably build any sub-target you like. These are properties, not features, and you don't get properties by accumulating more code.

Saturday, November 24, 2012

Changing views toward recorded music

I frequently encounter the following argument, in this case voiced by Terrence Eden:

Imagine, just for a moment, that your Sony DVD player would only play Sony Movies' films. When you decided to buy a new DVD player from Samsung, none of those media files would work on your new kit without some serious fiddling. That's the walled garden that so many companies are now trying to drag us into. And I think it stinks.

I agree as far as it goes. Many people are involved in walled gardens, and they aren't as good as open versions. I am particularly worried about the rise of Facebook, a site that is openly dismissive of rights such as privacy and pseudonymity.

I am less worried about walled gardens for music because I think about music differently. Let me describe two relevant changes.

First, copies of music are now very easy to replace. Aside from the price being low, the time is now instant: you can click on a song on Amazon or iTunes and have that song right now. As such, the value of a stockpile of music copies is much lower than it used to be; I haven't pulled out my notebook of carefully accumulated and alphabetized CDs in well over a year.

I saw the same thing happen a decade ago in a much smaller media market: academic papers. For most of the 20th century, anyone who followed academic papers kept a shelf full of journals and a filing cabinet full of individual papers. That changed about a decade ago, when I started encountering one person after another who had a box full of papers that they never looked into. Note I said box, not cabinet: they had moved offices more recently than they'd gone fishing for a printed copy, so the papers were all still in a big box from their last move.

The second change is that I have been mulling over how a reasonable IP regime might work for music. While copies of music have been a big part of the music market in our lifetimes, it's a relatively recent development in the history of professional music. We shouldn't feel attached to it in the face of technological change. There are a number of models that work better for music than buying copies, including Pandora and--hypothetically--Netflix for music.

Selling copies has not been particularly good for music in our culture. Yes, it provides a market at all, and for that I am grateful. However, it's a market at odds with how music works. Music is transient, something that exists in time and then goes away. Copies are not: they are enshrined forever in their current form, like a photograph of a cherished moment. As listeners, the copy-based market has led to us listening to the same recordings over and over. On the performers side, we have a winner-takes-all market where the term "rock star" was born.

We would be better off with a market for music that is more aligned with performance than with recordings. Imagine we switched to something like Pandora and completely discarded digital copyright. Musicians would no longer be able to put out a big hit and then just ate the money in indefinitely. They'd have to keep performing, and they'd have to compete with other performers that are covering their works for free. I expect a similar amount of money would be in the market, just spread more evenly across the producers. Listeners, meanwhile, would have a much more dynamic and vibrant collection of music to listen to--a substantial public good. Yes, such a scenario involves walled gardens, but that's a lesser evil than digital copyright.

Sunday, October 21, 2012

Source Maps with Non-traditional Source Code

I recently explored using JavaScript source maps with a language very different from JavaScript. Source maps let developers debug in a web browser while still looking at original source code, even if that source code is not JavaScript. A lot of programming languages support them nowadays, including Dart, Haxe, and CoffeeScript.

In my case, I found it helpful to use "source" code that is different from what the human programmers typed into a text editor and fed to the compiler. This post explains why, and it gives a few tricks I learned along the way.

Why virtual source?

It's might seem obvious that the source map should point back to original source code. That's what the Closure Tools team designed it for, and for goodness' sake, it's called a source map. That's the approach I started with, but I ran into some difficulties that eventually led me to a different approach.

One difficulty is a technical one. When you place a breakpoint in Chrome on a file mapped via a source map, it places one and only one breakpoint in the emitted JavaScript code. That works fine for a JavaScript-to-JavaScript compiler, but I was compiling from Datalog. In Datalog, there are cases where the same line of source code is used in multiple places in the output code. For example, Datalog rules are run in two different modes: once during the initial bootstrapping of a database instance, and later during an Orwellian "truth maintenance" phase. With a conventional source map, it is only possible to breakpoint one of the instances, and the developer doesn't even know which one they are getting.

That problem could be fixed by changes to WebKit, but there is a larger problem: the behavior of the code is different in each of its variants. For example, the truth maintenance code for a Datalog rule has some variants that add facts and some that remove them. A programmer trying to make sense of a single-stepping session needs to know not just which rule they have stopped on, but which mode of evaluation that rule is currentlty being used in. There's nothing in the original source code that can indicate this difference; in the source code, there's just one rule.

As a final cherry on top of the excrement pie, there is a significant amount of code in a Datalog runtime that doesn't have any source representation at all. For example, data input and data output do not have an equivalent in source code, but they are reasonable places to want to place a breakpoint. For a source map pointing to original source code, I don't see a good way to handle such loose code.

A virtual source file solves all of the above problems. The way it works is as follows. The compiler emits a virtual source file in addition to the generated JavaScript code. The virtual source file is higher-level than the emitted JavaScript code, enough to be human readable. However, it is still low-level enough to be helpful for single-step debugging.

The source map links the two forms of output together. For each character of emitted JavaScript code, the source map maps it to a line in the virtual source file. Under normal execution, web browsers use the generated JavaScript file and ignore the virtual source file. If the browser drops into a debugger--via a breakpoint, for example--then it will show the developer the virtual source file rather than the generated JavaScript code. Thus, the developer has the illusion that the browser is directly running the code in the virtual source file.

Tips and tricks

Here are a few tips and tricks I ran into that were not obvious at first.

Put a pointer to the original source file for any code where such a pointer makes sense. That way, developers can easily go find the original source file if they want to know more context about where the code in question came from. Here's the kind of thing I've been using:

    /* browser.logic, line 28 */

Also, for the sake of your developers' sanity, each character of generated JavaScript code should map to some part of the source code. Any code you don't explicitly map will end up implicitly pointing to the previous line of virtual source that does have a map. If you can't think of anything to put in the virtual source file, then try a blank line. The developer will be able to breakpoint and single-step that blank line, which might initially seem weird. It's less weird, though, than giving the developer incorrect information.

Name your JavaScript variable names carefully. I switched generated temporaries to start with "z$" instead of "t$" so that they sort down at the bottom of the variables list in the Chrome debugger. That way, when an app developer looks at the list of variables in a debugger, the first thing their eyes encounter are their own variables.

Emit variable names into the virtual source file, even when they seem redundant. It provides an extra cue for developers as they mentally map what they see in the JavaScript stack trace and what they see in the virtual source file. For example, here is a line of virtual source code for inputting a pair of values to the "new_input" Datalog predicate; the "value0" and "value1" variables are the generated variable names for the pair of values in question.

    INPUT new_input(value0, value1)

Implementation approach

Implementing a virtual source file initially struck me as a cross-cutting concern that was likely to turn the compiler code into a complete mess. However, here is an approach that makes it not so bad.

The compiler already has an "output" stream threaded through all classes that do any code generation. The trick is to augment the class used to implement that stream with a couple of new methods:

emitVirtual(String): emit text to the virtual source file
startVirtualChunk(): mark the beginning of a new chunk of output

With this extended API, working with a virtual source file is straightforward and non-intrusive. Most compiler code remains unchanged; it just writes to the output stream as normal. Around each human-comprehensible chunk of output, there is a call to startVirtualChunk() followed by a few calls to emitVirtual(). For example, whenever the compiler is about to emit a Datalog rule, it first calls startVirtualChunk() and then pretty prints the code to the emitVirtual() stream. After that, it emits the output JavaScript.

With this approach, the extended output stream becomes a single point where the source map can be accumulated. Since this class intercepts writes to both the virtual file and the final generated JavaScript file, it is in a position to maintain a mapping between the two.

The main downside to this approach is that the generated file and the virtual source file must put everything in the same order. In my case, the compiler is emitting code in a reasonable order, so it isn't a big deal.

If your compiler rearranges its output in some wild and crazy order, then you might need to do something different. One approach that looks reasonable is to build a virtual AST while emitting the main source code, and then only convert the virtual AST to text once it is all accumulated. The startVirtualChunk() method would take a virtual AST node as an argument, thus allowing the extended output stream to associate each virtual AST node with one or more ranges of generated JavaScript code.