Friday, June 24, 2011

Against manual formatting

Programming languages are practically always expressed in text, and thus they provide the programmer with a lot of flexibility in the exact sequence of characters used to represent any one program. Examples of manual formatting are:
  • How many blank lines to put between different program elements
  • Whether to put the body of an if on the same line or a new line
  • Whether to sort public members before private members
  • Whether to sort members of a class top-down or bottom-up, breadth-first or depth-first
I've come to think we are better off not taking advantage of this flexibility. It takes a significant amount of time, and for well-written code, its benefits are small. First consider the time. The first place manual formatting takes time is in the initial entry of code. Programmers get their code to work, and then they have to spend time deciding what order to put everything in. It's perhaps not a huge amount of time, but it is time nonetheless. The second time manual formatting takes time is when people edit the code. The extract method refactoring takes very little time for a programmer using an IDE, but if the code is manually formatted, the programmer must then consider where to place the newly created method. It will take much longer to rearrange the new method than it did to create it.

Worse, sometimes the presentation the first programmer use doesn't make sense any longer after the edits that the second programmer made. In that case, the second programmer has to come up with a new organization. As well, they have to spend the time evaluating whether the new organization is worthwhile at all; to do that, they first have to spend time with the existing format trying to make it work. There is time being taxed away all over the place.

Meanwhile, what is the benefit? I posit that in well-written code, any structural unit should have on the order of 10 elements within it. A package should have about 10 classes, a class should have about 10 members, and a method should have about 10 statements. If a class has 50 members, it's way too big. If a method has 50 statements, it, too, is way too big. The code will be improved if you break the large units up into smaller ones.

Once you've done that, the benefit of manual formatting becomes really small. If you are talking about a class with ten members, who cares what order they are presented in? If a method has only 5 statements, does it matter how much spacing there is between them? Indeed, if the code is auto-formatted, then those 5 statements can be mentally parsed especially quickly. The human mind is an extraordinary pattern matcher, and it can match patterns faster that it has seen many times before.

I used to argue that presentation is valuable because programs are read more than written. However, then I tried auto-formatting and auto-sorting for a few months, and it was like dropping a fifty pound backpack that I'd been carrying around. Yes, it's possible to walk around like that, and you don't even consciously think about it after a while, but it really slows you down. What I overlooked in the past was that it's not just lexical formatting that can improve the presentation of a program. Instead of carefully formatting a large method, good programmers already divide large methods into smaller ones. Once this is done, manual formatting just doesn't have much left to do. So don't bother. Spend the time somewhere that has a larger benefit.

Wednesday, June 15, 2011

Universities behind borders

Ronaldo Lemos has a written a thought-provoking article about the role of non-Brazilians in a Brazilian university. He interviews Volker Grassmuck, a German professor working at a Brazilian university, who feels that the result is intellectually insular:
People read the international literature in the fields I’m interested in in. But without having actual people to enter into a dialogue with this often remains a reproduction or at best an application of innovations to Brazil.

That is what I would expect. Academics want to work with other academics in the same specialty, and you aren't going to be able to build such units if you can only hire from the immediate locale. The people you really want will be elsewhere, and the people you get will spend their time trying to emulate them.

I feel that the U.S. is currently too strict on foreign workers for our own good, but we seem to be better off than in Brazil. In the U.S., once they finish groping you and finish making sure you aren't going to take a job in manual labor, you can do as you will. In Brazil, they make you redo your Ph.D. examinations, much like American states do with professional certifications such as dentistry and accounting.

It all seems very dirty to me. If you want a good intellectual atmosphere, you need to admit people from all over.

Friday, June 10, 2011

Peek IPv4 ?

In its announcement about "IPv6 Day", the Internet Society (ISOC) casually remarked that IPv4 addresses are about to run out:
With IPv4 addresses running out this year, the industry must act quickly to accelerate full IPv6 adoption or risk increased costs and limited functionality online for Internet users everywhere.

This is highly misleading, and the recommended solution is not a good idea. I am sure if pressed the ISOC would respond that by "running out", they mean in the technical sense that some registrar or another has now given away all its addresses. However, the actual verbiage implies something far different. It implies that if you or I try to get an IPv4 address on January 1, 2012, we won't be able to do so. That implication is highly unlikely.

A better way to think about the issue is via price theory. From the price theory point of view, the number of IPv4 addresses at any time is finite, and each address has a specific owner. At the current time, every valid address is owned by some entity or another. (Thus, in some sense they "ran out" a long time ago.)

When a new person wants to get an IPv4 address for their own use, they must obtain the rights from some entity that already has one. While some large organizations can use political mechanisms to gain an IPv4 address, most people must purchase or rent the address from some entity that already owns one. Typically those IP addresses are bundled with a service contract that provides Internet bandwidth, though in some cases addresses can be purchased by themselves.

The price one pays gives us a way to think about the scarcity of addresses. Diamonds are relatively scarce, and their price is correspondingly high. Being 747s are even more scarce, and their price is even higher. For IP addresses, the price is surely rising over time as more and more people hook things up to the Internet. Already the price is high enough that, for example, most home users do not assign a separate publicly routable address to every IP device in their home. They make do with a single IP address from their Internet provider.

What is that price right now? The question is crudely phrased, because some addresses are more valuable than others, and all addresses come with some sort of strings attached. However, we can get a ballpark idea by considering a few data points:
  • Linode offers its subscribers an extra IP address for $1/month.
  • Linode offers an IP address along with an Internet hosting service for under $20/month.
  • Broadband providers such as Comcast and AT&T offer an IP address along with Internet connectivity for on the order of $50/month.
From these observations we can infer that the cost of an IP address is at most a few dollars per month. With the cost this low, I can't see any major site going to IPv6-only any time soon. A few dollars per month is a very low price to pay for a great deal of extra accessibility. With the protocols designed as they are right now, the reason to consider IPv6 is that it's a newer, better protocol, not because it has more available addresses.

I wish public communication about IPv6 would make this more clear. The Internet is important, and as such, it is important that the techies get it right. This isn't a minor technical detail.

Thursday, June 9, 2011

Two kinds of type inference

There are two separate lines of work on type inference. While they are superficially similar, they face very different design constraints. Let me explain a few of those differences. To begin with, consider the following program. The two kinds of inference reach different conclusions about it.
  class Foo {
    var x: Object = "a string"
    var y: String = "initial value"
    def copy() {
      y = x    // type error, or no?
    }
  }

Should this example type check? There are two points of view.

One point of view is that the compiler is able to prove that x only ever holds strings. Therefore y only ever holds strings. Thus, there will never be a type error at run time, and so the code can be allowed to run as is. This point of view might be called information gathering. The tool analyzes the code, typically the whole program, and learns type information about that code.

Another point of view is that x holds objects and y holds strings, so the "x = y" line is a problem. Yes, the current version of x only holds circles. However, the "x = y" line is using x outside of its spec, and you don't want to use a variable outside of its spec even if you can temporarily get away with it. This point of view might be called slots and tabs. The slot y is not specced to hold the tab x. From this point of view, the indicated line is an error even though the program doesn't have any real problems.

Every user-facing type checker I am familiar with is based on the slots and tabs point of view. The idea is that programmers use types to provide extra structure to their programs. They don't want even nominal violations of the structure; they really want their types to line up the way they say they do. As a concrete example, imagine the author of Foo checks in their code, and someone else adds to the program the following statement: "foo.x = Integer.valueOf(12)". Now the nominal type error has become a real one, but the author of Foo has already gone home. It's better if the author of Foo found out the problem rather than someone else.

That's one example difference between the two kinds of type inference. A slots-and-tab checker will flag errors that an information-gatherer would optimize away. Here are three other design constraints that differ between the two.

Declarations are important for a type checker. For the type checker to know what the slots and tabs are specced as, it must have declared types. In the above example, if x and y did not have declared types on them, then the type checker for class Foo could not determine that there is a problem. To contrast, an information gatherer doesn't necessarily pay much attention to declarations. It can usually infer better information by itself, anyway.

Changing a published type checker breaks builds. For a language under development, once a type checker has been published, it takes great care to change it without breaking any existing builds. Consider the addition of generics to Java 1.5, where it took a great deal of energy and cleverness to make it backwards compatible with all the existing Java code in the world. To contrast, an information gathering type inference can be swapped around at whim. The only impact will be that programs optimize better or worse or faster or slower than before.

Type checkers must be simple. The type system of a slots-and-tabs type checker is part of the contract betwen the compiler and a human developer. Human beings have to understand these things, human beings that for the most part have something better to do with their time than study types. As a result, there is tremendous design pressure on a slots-and-tabs type checker to make the overall system simple to understand. To contrast, the sky is the limit for an information gatherer. The only people who need to understand it are the handful of people developing and maintaining it.

Overall, I wish there was some term in common use to distinguish between these two kinds of type inferencers. Alas, both kinds of them infer things, and the things both of them infer are types, so the terminology seems inevitable. The best that can be done is to strive to understand which kind of type inferencer one is working with. Developers on one or the other face different design constraints, and they will find different chunks of the published literature to be relevant.

Thursday, June 2, 2011

Martin Odersky on the state of Scala

Martin Odersky talked today at ScalaDays about the state of Scala. Here are the highlights for me.

Tools and Documentation

He doesn't think documentation is a big issue at this point. People used to complain to him about it all the time, and that has largely died out. I know that this demand for documentation was a large motivation for him to develop a book about Scala. Perhaps it helped.

He still thinks IDE support is an important problem. However, he claims the Eclipse support has gotten good enough for him to switch his compiler hacking from Emacs to Eclipse. That's high praise--Emacs users don't switch! I will have to try it again.

He emphasizes binary compatibility. In practice, libraries need to be recompiled when a new version of Scala comes out, because inevitably some trait or another has a new method in it. He has a number of ways to address this. He's long talked about tools to detect and fix problems by analyzing the bytecodes, and that work is going to be emphasized at TypeSafe. Additionally, new today is that he plans to designate stable releases of Scala that stay valid for a number of years and never have binary-incompatible changes.

He also pointed out that style checking tools would be helpful in larger development groups. It's a good point. Such tools don't take a lot of code, but I guess nobody has gotten interested enough in the problem to whip one up.

Functional programming

Martin went through an extended example based on a 2000 study comparing programming languages. In the study, students implemented a programming problem in one of seven different programming languages. The study is interesting on its own if you haven't read it before, and among other things shows that there is much more variability among programmers than among programming languages. However, we can learn something about programming languages by comparing either the best or the median solutions in each language.

Scala shines well on the programming problem used in the study, and it's because of Scala's excellent support for running functions across collections. Such facilities don't work well unless the language has a concise notation for functions. Here is the bottom line on several different solutions:

  • Students using compiled languages: 200-300 lines of code
  • Students using scripting languages: ~100 lines of code
  • Martin's solution in Scala: 20-30 lines of code
  • A Java master's solution in Java: over 100 lines of code

I will attribute the "Java master" if I can find a reliable source, but Martin showed the Java example and it looked pretty reasonable at a glance. The reason it is so long compared to the Scala solution is that instead of using collection operations, it defines several recursive methods that record their progress in extra parameters to the methods. I've written a lot of code like that in the last few years. I think about the beautiful functional solution, and then I start over with an imperative solution because inner classes in Java require ever so much boilerplate.

Parallelism

Martin talked a bit on his latest thinking on making use of multiple cores, a problem that has obsessed programming-language research for several years now. One principle he emphasized is that people are much more able to find one solution that works than at finding all the potential problems that can occur due to non-determinism. Thus, he's interested lately in programming-language constructs that are parallel yet still deterministic. That's a tough principle to achieve in an expressive language! It rules out all of actors (!), agents, and software-transactional memory, because they all have state, and the state can change differently depending on the non-deterministic choices the implementation makes.

Aside from the general talk on the challenges of parallelism, Martin talked a bit about the parallel collections in Scala. They're better than I realized. Their implementation uses fork-join with work stealing, rather than blindly creating lots of threads. As an additional twist, they adaptively choose a chunk size based on how much work stealing appears to be happening. If there is no work stealing, then every node must be busy, so increase the chunk size.

To demonstrate how the collections can help, he made two tweaks to the phone-number solution to switch from sequential to parallel collections. After doing so, the program ran 2.5 times faster on a 4-core machine. One can imagine doing better than 2.5 times faster, but it's a very good start given how very easy the change was.

Domain-specific languages

Martin emphasized that Scala is excellent for DSLs. However, he points out, and I agree, that embedded DSLs in Scala mesh so well that they are essentially just Scala code. I vastly prefer this style of DSL to the style where you embed the DSL but the constructs of the DSL don't map well to constructs in the host language. Since the code is doing something different from what it looks like it does, all kinds of weird bugs can arise. Whenever working on a DSL that doesn't embed in a straightforward way, I'd prefer to make it an external DSL with a plain old parser and interpreter.

Wednesday, June 1, 2011

Secure DNS supports PROTECT IP

There is some commentary lately about a paper arguing that PROTECT IP is fundamentally incompatible with secure DNS. This argument is misleading in the extreme. The strategy with DNSSEC is to have root authorities digitally sign DNS records, just like with TLS. As such, it is vulnerable in the same place as TLS. Whoever controls the root servers has ultimate control over what Internet-connected computers will consider to be the truth.

Far from making PROTECT IP more difficult, a hypothetical success of DNSSEC would make it easier. With DNS as it currently works, governments must contend with what, from their perspective, are rogue DNS servers that continue to post "false" (meaning correct) addresses. Under DNSSEC, the rogue server's certificate chains will not check out. Whenever a government orders a domain name to be changed, the root servers will not just issue the new address, but presumably also cryptographically revoke the old one. It would all work as if it were the legitimate domain owner making the request instead of a government.

I don't think the technical arguments about PROTECT IP are convincing. DNS is by its nature a sitting duck. The technical argument I would make is that a global Hierarchy of Truth is not a good approach to security on the Internet. If you don't like PROTECT IP, then you shouldn't like DNSSEC nor DNS as we currently know it.

Given how things technically work right now, however, the best argument against PROTECT IP is simply that we don't want to live that way. Do we really want to live in a world where Sony or Blizzard or MGM can turn off a web site without the site owner getting to defend themselves in court? Is 20th century copyright really worth such heavy handed measures?