Thursday, June 9, 2011

Two kinds of type inference

There are two separate lines of work on type inference. While they are superficially similar, they face very different design constraints. Let me explain a few of those differences. To begin with, consider the following program. The two kinds of inference reach different conclusions about it.
  class Foo {
    var x: Object = "a string"
    var y: String = "initial value"
    def copy() {
      y = x    // type error, or no?
    }
  }

Should this example type check? There are two points of view.

One point of view is that the compiler is able to prove that x only ever holds strings. Therefore y only ever holds strings. Thus, there will never be a type error at run time, and so the code can be allowed to run as is. This point of view might be called information gathering. The tool analyzes the code, typically the whole program, and learns type information about that code.

Another point of view is that x holds objects and y holds strings, so the "x = y" line is a problem. Yes, the current version of x only holds circles. However, the "x = y" line is using x outside of its spec, and you don't want to use a variable outside of its spec even if you can temporarily get away with it. This point of view might be called slots and tabs. The slot y is not specced to hold the tab x. From this point of view, the indicated line is an error even though the program doesn't have any real problems.

Every user-facing type checker I am familiar with is based on the slots and tabs point of view. The idea is that programmers use types to provide extra structure to their programs. They don't want even nominal violations of the structure; they really want their types to line up the way they say they do. As a concrete example, imagine the author of Foo checks in their code, and someone else adds to the program the following statement: "foo.x = Integer.valueOf(12)". Now the nominal type error has become a real one, but the author of Foo has already gone home. It's better if the author of Foo found out the problem rather than someone else.

That's one example difference between the two kinds of type inference. A slots-and-tab checker will flag errors that an information-gatherer would optimize away. Here are three other design constraints that differ between the two.

Declarations are important for a type checker. For the type checker to know what the slots and tabs are specced as, it must have declared types. In the above example, if x and y did not have declared types on them, then the type checker for class Foo could not determine that there is a problem. To contrast, an information gatherer doesn't necessarily pay much attention to declarations. It can usually infer better information by itself, anyway.

Changing a published type checker breaks builds. For a language under development, once a type checker has been published, it takes great care to change it without breaking any existing builds. Consider the addition of generics to Java 1.5, where it took a great deal of energy and cleverness to make it backwards compatible with all the existing Java code in the world. To contrast, an information gathering type inference can be swapped around at whim. The only impact will be that programs optimize better or worse or faster or slower than before.

Type checkers must be simple. The type system of a slots-and-tabs type checker is part of the contract betwen the compiler and a human developer. Human beings have to understand these things, human beings that for the most part have something better to do with their time than study types. As a result, there is tremendous design pressure on a slots-and-tabs type checker to make the overall system simple to understand. To contrast, the sky is the limit for an information gatherer. The only people who need to understand it are the handful of people developing and maintaining it.

Overall, I wish there was some term in common use to distinguish between these two kinds of type inferencers. Alas, both kinds of them infer things, and the things both of them infer are types, so the terminology seems inevitable. The best that can be done is to strive to understand which kind of type inferencer one is working with. Developers on one or the other face different design constraints, and they will find different chunks of the published literature to be relevant.

No comments: