Saturday, April 30, 2011

Maintaining an intellectual nexus

It saddens me to read the comments about Scala Days on scala-lang.org. The only two comments are from people who won't be coming because of U.S. border inspections. It's not possible to tell how many people really avoid the conference because of this problem. What we know, however, is that potential attendees are weighing it in their decision.

Paul Graham has written a great essay describing the conventional wisdom about the intellectual nexus that is Silicon Valley. One aspect he emphasizes is immigration:
I doubt it would be possible to reproduce Silicon Valley in Japan, because one of Silicon Valley's most distinctive features is immigration. Half the people there speak with accents. And the Japanese don't like immigration. When they think about how to make a Japanese silicon valley, I suspect they unconsciously frame it as how to make one consisting only of Japanese people. This way of framing the question probably guarantees failure.

A silicon valley has to be a mecca for the smart and the ambitious, and you can't have a mecca if you don't let people into it.

I've written before that the effective workspaces I've seen all involve a mix of workers from all over the world. Making people work with only people who look like each other and have the same accents is much like only allowing marriages within one's own clan. You get better matchups if you let people date more widely.

The same is true for conferences. Ideas feed back on each other, with multiple sparks contributing to starting the fire. It often takes more than one good idea to make progress on something. Any of those ideas alone doesn't get you part of the progress. It gets you none. Concentrating people together is an essential ingredient to having a productive conference.

Unfortunately, border control, xenophobia, and general empowerment of the police have increased with every U.S. president since the fall of the Berlin Wall. You wouldn't know it from the news or from the talking heads, but if you do some brief web research you can verify it is true. The current president brought us airport gropes and is also responsible for the line, "Because of recent circumstances, the underwear was taken away from him as a precaution to ensure that he did not injure himself". I don't detect any trend in a positive direction.

I'm spitting into the wind here in the hopes that any change must start somewhere. Xenophobia is not just wrong, and it's not just unpleasant. It's making us dumber and poorer.

Monday, April 25, 2011

Types are fundamentally good?

Once in a while, I encounter a broad-based claim that it's fundamentally unsound to doubt the superiority of statically typed programming languages. Bob Harper has recently posted just such a claim:
While reviewing some of the comments on my post about parallelism and concurrency, I noticed that the great fallacy about dynamic and static languages continues to hold people in its thrall. So, in the same “everything you know is wrong” spirit, let me try to set this straight: a dynamic language is a straightjacketed static language that affords less rather than more expressiveness.

Much of the rest of the post then tries to establish that there is something fundamentally better about statically typed languages, so much so that it's not even important to look at empirical evidence.

Such a broad-based claim would be easier to swallow if it weren't for a couple of observations standing so strongly against it. First, many large projects have successfully been written without a static checker. An example would be the Emacs text editor. Any fundamental attack on typeless heathens must account for the fact that many of them are not just enjoying themselves, but successfully delivering on the software they are writing. The effect of static types, whether it be positive or negative, is clearly not overwhelming. It won't tank a project if you choose poorly.

A second observation is that in some programming domains it looks like it would be miserable to have to deal with a static type checker. Examples would be spreadsheet formulas and the boot scripts for a Unix system. For such domains, programmers want to write something quickly and then have the program do its job. A commonality among these languages is that live data is often immediately available, so programmers can just as well run the code on live data as fuss around with a static checker.

Armed with these broad observations from the practice of programming and the design of practical programming languages, it's easy to find problems with the fundamental arguments Harper makes:
  • "Types" are static, and "classes" are dynamic. As I've written before, run-time types are not just sensible, but widely used in the discussion of languages. C++ literally has "run-time types". JavaScript, a language with no static types, has a "typeof" operator whose return value is the run-time type. And so on. There are differences between run-time types and static types, but they're all still "types".
  • A dynamic language can be viewed as having a single static type for all values. While I agree with this, it's not a very useful point of view. In particular, I don't see what bearing this single "unitype" has on the regular old run-time types that dynamic languages support.
  • Static checkers have no real restrictions on expressiveness. This is far from the truth. There has been a steady stream of papers describing how to extend type checkers to check fairly straightforward code examples. In functional languages, one example is the GADTs needed to type check a simple AST interpreter. In object-oriented languages, the lowly flatten method on class List has posed difficulties, because it's an instance method on class List but it only applies if the list's contents are themselves lists. More broadly, well-typed collection libraries have proven maddeningly complex, with each new solution bringing with it a new host of problems. All of these problems are laughably easy if you don't have to appease a static type checker.
  • Dynamic languages are slow. For some reason, performance always pops up when a computer person argues a position that they think is obvious. In this case, most practitioners would agree that dynamic languages are slower, but there are many cases where the performance is perfectly fine. For example, so long as Emacs can respond to a key press before I press the next key, who cares if it took 10 microseconds or 100 microseconds? For most practical software, there's a threshold beyond which further performance is completely pointless.

Overall, type checkers are wonderful tools. However, any meaningful discussion about just how they are wonderful, and to what extent, needs to do a couple of things. First, it needs some qualifiers about the nature of the programming task. Second, it needs to rest on at least a little bit of empirical data.

Thursday, April 21, 2011

Practical challenges for attribute grammars

I've had occasion recently to work on a compiler implemented with an attribute grammar framework. Such frameworks let you declare attributes on AST nodes that are computed in terms of other attributes. In general, it's a powerful and convenient way to compute information about an AST, and I expect they make a lot of sense in tools that only analyze ASTs and don't generate or rewrite them. They save you from working out a specific computation order, because the tool can look at which attributes refer to which other ones and compute them all on demand.

For compilers, however, I no longer think attribute grammars work well. Let me describe three fundamental problems with attribute grammars for use in compilers.

  1. Attributes can be inherited, meaning they look up in the AST to for information. Examples are the expected type of an expression, the enclosing module for any node, and the set of free variables available in some context. These computations cannot be computed unless the node in question really is attached to an AST. However, compilers create new AST nodes all the time, and while those new nodes are being set up, they can't possibly be attached yet. Thus, unattached nodes are inconsistent with inherited attributes.

  2. Compilers change the AST. They desugar syntax into lower-level syntax, they optimize, they reorder parts of the program, and so on. Every time the AST changes, some cached attribute values become invalid. At that point, how do you flush the caches that are now invalid? To some extent you can get by by flushing all caches after every compiler phase. However, this means that later parts of a phase will see stale data compared to earlier parts. This situation results in bugs where, to fix them, you have to reason about the exact order that the traversal happens and the relationship of the stale data to the current version of the AST.

  3. Attributes have unpredictable performance characteristics. When you see a reference to an attribute, it might be cached and take constant time. On the other hand, it might trigger a chain of attribute computations that chain across the whole program. It's much harder to convince yourself that a particular AST traversal will operate in linear time, because you don't know how expensive all the attribute computations are.

These are all potentially addressable, but they are serious challenges that any practical tool will need to face. As things stand, I haven't seen an attribute grammar framework that is there yet. They don't really achieve their main benefit, which is to save you from reasoning about order of evaluation of and dependencies between different attributes. You end up reasoning about them anyway to get the bugs out and the performance up.

Instead of storing attributes on the AST nodes and computing them with an attribute grammars framework, a more traditional approach is as follows. Store a few, carefully selected bits of information on nodes, such as the types of expressions. These things must be carefully maintained across all AST rewrites, and the compiler will be buggy if you get it wrong, so don't store too much this way. For most information, have each compiler phase compute and maintain whatever information it needs. This results in some amount of computation, but you can usually find a way to calculate whatever you need in linear time. Since any one phase will usually require at least linear time, anyway, this is sufficient for getting the compiler reasonably performant before you do your first profile run.

Wednesday, April 13, 2011

Programming in Scala available online

Artima has graciously agreed to post the first edition of Programming in Scala online for free. It's now easier than ever to learn about Scala.

Monday, April 4, 2011

Appel on seal protocols for electronic voting

Andrew Appel reminds the public that good election protocols were historically rather difficult to develop:
Democratic elections present a uniquely difficult set of problems to be solved by a security protocol. In particular, the ballot box or voting machine contains votes that may throw the government out of office. Therefore, it's not just the government--that is, election officials--that need evidence that no tampering has occurred, it's the public and the candidates.[...] In the late 19th century, after widespread, pervasive, and long-lasting fraud by election officials, democracies such as Australia and the United States implemented election protocols in an attempt to solve this problem. The struggle to achieve fair elections lasted for decades and was hard-fought.

It's a good point. I believe with modern mathematical tools, we should be able to develop good protocols much faster than a period of decades. However, before such problems can be solved, there must be an awareness that there is a problem at all. For whatever reason, elections all over the U.S. are using terrible voting machines and protocols that are worse than the old hole-punch machines. It's as if all these locales simply bought the voting machines with the lowest bid.

In general, government requisitions need to come with a long string of requirements, because officials are largely going to have to choose the cheapest product that meets the requirements. Anything not in the requirements is likely to be cut in the interest of cost savings. In the case of voting software, a key part of those requirements is to have sensible voting protocols.

Friday, April 1, 2011

Dan Wallach on fixing the certificate authorities

I like the latest ideas from Dan Wallach about building better infrastructure for browsers to detect impostor web sites.

First, there's this:
A straightforward idea is to track the certs you see over time and generate a prominent warning if you see something anomalous. This is available as a fully-functioning Firefox extension, Certificate Patrol. This should be built into every browser.
This is similar to pet names, but is more similar to the way SSH works. Like Pet Names, this approach will tell you if you visit a site and its certificate has changed. Unlike Pet Names, it won't say anything when you visit a new site. There's a trade off there. Either is a big improvement on the current state, though I suspect pet names could lead to a better overall user interface. The reason is that pet names can be integrated with the browser's bookmarks.

Second, there's this more speculative request:
In addition to your first-hand personal observations, why not leverage other resources on the network to make their own observations? For example, while Google is crawling the web, it can easily save SSL/TLS certificates when it sees them, and browsers could use a real-time API much like Google SafeBrowsing.

The Y property would give us this effect. What if, when you got a Google search result, it not only told you the URLs for the hits but also the certificates for those pages? You can then only be attacked if the attacker fools both you and also every Google web crawler.

Let me add one thing. If web protocols used these two tricks, how important would certificate authorities be? These two decentralized techniques strike me as so much more effective that certificate authorities are a waste of time. If you already know that a site is the same one you've visited a dozen times, and you already know it's the same site that Google thinks it is, what do you care about what some Iranian agency thinks of the site?