Wednesday, December 28, 2011

All software has bugs

Johm Carmack has a great article up on his experience with bug-finder software such as Coverity and PC-Lint. One of his observations is this:
The first step is fully admitting that the code you write is riddled with errors. That is a bitter pill to swallow for a lot of people, but without it, most suggestions for change will be viewed with irritation or outright hostility. You have to want criticism of your code.
He feels that the party line for bug finders is true, that you may as well catch the easy bugs:
The value in catching even the small subset of errors that are tractable to static analysis every single time is huge.
I agree. One of the ways it is easier to talk to more experienced software developers is that they take this view for granted. When I talk to newer developers, or to non-engineers, they seem to think that if we spend enough time on something we can remove all the bugs. It's not possible for any body of code more than a few thousand lines. Removing bugs is more like purifying water. You can only manage the contaminants, not remove them. Thus, software quality should be thought of from an effort/reward point of view.

I also have found the following to be true:

This seems to imply that if you have a large enough codebase, any class of error that is syntactically legal probably exists there.
An example I always come back to is the Olin Shivers double word finder. The double word finder scans a text file and detects occurrences of the same word repeated twice in a row, which is usually a grammatical mistake in English. I have started running it on any multi-page paper I write, and it almost always finds at least one such instance that is legitimately an error. If an error can be made, it will be, so almost any automatic detector is going to find real errors. Another one that jives with me is:
NULL pointers are the biggest problem in C/C++, at least in our code.
I did a survey once of the forty most recently fixed bugs on the Squeak bug tracker, and I found that the largest single category of bugs was a null dereference. They were significantly higher than type errors, bugs where one type (e.g. string) was used where another was intended (e.g., open file).

I do part ways with Carmack on the relative value of bug finders:

Exhortations to write better code plans for more code reviews, pair programming, and so on just don’t cut it, especially in an environment with dozens of programmers under a lot of time pressure.
If we were to candidly rank methodology for improving quality, I'd put write better code above use bug finders. In fact, I'd put it second, right after regression testing. I could be wrong, but my intuition is that there are a number of low-effort ways to improve software before it is submitted, and the benefits are often substantial. Things like use a simpler algorithm and read your diff before committing add just minutes to the time for each patch but often save over an hour of post-commit debugging from a repro case.

All in all it's a great read on the value of bug finding tools. Highly recommended if you care about high-quality software. HT John Regehr.

Saturday, December 17, 2011

Blizzard embraces pseudonyms

Blizzard Software's lets you use the same name on multiple games and on multiple servers within the same game. Historically, they required you to use a "real name" (in their case, a name on a credit card). This week they announced that they are deploying a new system without that requirement:
A BattleTag is a unified, player-chosen nickname that will identify you across all of Battle.net – in Blizzard Entertainment games, on our websites, and in our community forums. Similar to Real ID, BattleTags will give players on Battle.net a new way to find and chat with friends they've met in-game, form friendships, form groups, and stay connected across multiple Blizzard Entertainment games. BattleTags will also provide a new option for displaying public profiles.[...] You can use any name you wish, as long as it adheres to the BattleTag Naming Policy.
I am glad they have seen the light. There are all sorts of problems with giving away a real [sic] name within a game.

From a technical perspective, the tradeoffs they choose for the BattleTag names are interesting and strike me as solid:

If my BattleTag isn't unique, what makes me uniquely identifiable? How will I know I'm adding the right friend to my friends list? Each BattleTag is automatically assigned a 4-digit BattleTag code, which combines with your chosen name to create a unique identifier (e.g. AwesomeGnome#3592).
I'll go out on a limb and assume that the user interfaces that use this facility will indicate when you are talking to someone on your friends list. In that case, the system will be much like a pet names system, just with every name including a reasonable default nickname. When working within such UIs, they will achieve all of Zooko's Triangle. When working outside it, the security aspect will be weaker, because attackers can make phony accounts with a victim's nickname but a different numeric code. That's probably not important in practice, so long as all major activities happen within a good UI such as one within one of Blizzard's video games.

Regarding pseudonymity, I have to agree with the commenters on the above post. Why not do it this way to begin with and not bother with RealID? They can still support real [sic] names for people who want them, simply by putting a star next to the names of people whose online handle matches their credit card. Going forward, now that they've done this right, why not simply scrap RealID? It looks like high-level political face cover. You have to read closely in the announcement even to realize what they are talking about.

Thursday, December 1, 2011

Joshua Gans on ebook lending

Our approach to copyright is outdated now that we have a wide-spread Internet. What should we do? Joshua Gans proposes an approach based on lending and on tracking usage:
If lending is the appropriate mode for books, then how would the business of publishing look if it is built around lending rather than ownership? So here is my conjecture. All books are read on devices. Imagine that each device has built in a means of tracking what people read and how much. Imagine that it can also do this in a manner that respects privacy. Then the model I have in mind would allow publishers to receive money based on how much of a book people read and to price that at will.

I like the idea. One point of comparison is to the way radio works. In radio, the content is not DRMed, and you don't pay for each song you listen to. Instead, you subscribe in bulk to content and then flip around to whatever you feel like listening to. There are a variety of specific payment schemes on both sides of the arrangement. For the customer, I've encountered payment based on public taxes (Switzerland), by subscription (Sirius Radio), and by listening to ads (broadcast in the U.S.).

For the content producers, I am less clear about what contracts are out there. At least indirectly, however, they are paid more when there are more users listening to them. I imagine that radio has the same sort of marketing research that television does, and that radio stations know how many people are listening to their station and at what times. They then, through mechanisms that are probably kludgy, buy more of the popular music and less of the unpopular music.

It's a good idea, and I would be happy for it to catch on. Copies are trivial to make, nowadays, so the only ways to control copies are rather draconian. Far better to put a good society first and then find business models that work with it.

Monday, November 28, 2011

Dana Boyd on Pseudonyms

I'm late to notice, but Dana Boyd has a good article up on the case for pseudonymity. She emphasizes the safety issues, which I certainly agree about.

Something I hadn't fully processed is that many people are using Facebook as an example that real names work. Perhaps this argument is so popular because the Zuckerbergs have publicly emphasized it. At any rate, it's a weak argument. For one thing, quite a number of Facebook profiles are using pseudonyms. See Lady Gaga, Ghostcrawler, and Anne Rice. If the Zuckerbergs really are trying to shut down pseudonyms, they're doing a terrible job of it. Another reason is that, as Boyd points out, Facebook is unusual for starting as a close-knit group of college grads. The membership it grew from is a group of people relatively uninterested in pseudonyms.

Reading the comments to Boyd's post, it appears that the main reasons people are convinced about pseudonyms is the hope that it will improve the level of conversation in a forum. I continue to be mystified by this perspective, but it does appear to be what is driving the most opponents of pseudonyms. I just don't get it. Partially I'm just used to an Internet full of pseudonyms. Partially it's just too easy to think about perfectly legitimate activities that wouldn't be good to pop up if someone does a web search on "Lex Spoon". People interested in that stuff should instead search for Saliacious Leximus. They'll avoid all the nerdy computer talk and get straight to the goods they are looking for.

Overall, pseudonyms appear to be one of those divides where people on each side have a hard time talking over the gulf. Apparently is is perfectly obvious to many people that if Google Plus and Facebook embraced pseudonyms, then their members would get overwhelmed by harassment and spam. Personally, I don't even understand the supposed threat. Why would I circle or friend a complete stranger? If I had, why wouldn't I simply unfriend them?

Saturday, November 19, 2011

The Web version of interface evolution

Nick Bradbury points out an important issue with web APIs:
I created FeedDemon 1.0 in 2003, and it was the first app I wrote that relied on web APIs. Now those APIs no longer exist, and almost every version of FeedDemon since then has required massive changes due to the shifting sands of the web APIs I've relied on.

There are a tangle of issues involved here.

One is that, for sure, web APIs are not a suitable way to build a self-contained system that will at least remain internally consistent indefinitely. You couldn't build a space probe using any web APIs that you don't control. By the time it got to Jupiter the protocols might well have changed. Even barring an intentional protocol change, the service might upgrade its software and accidentally break you.

Not all software is of this character, however. Most user-facing software is expected to stay compatible with the other software a particular user is taking advantage of. Most user-facing software has a mechanism for being updated after it's deployed.

The thought experiment I use here is to consider why the Lisp Machine is no use nowadays. These systems are very impressive; think Emacs, but with a better programming language, and with multiple for-profit companies writing professional software to run on it. Nonetheless, they are useless nowadays (and Emacs is as awesome as ever). That it is so is obvious, but what is the precise reason for it?

My best answer is that the world changed around them. Even without any explicit API break, Lisp Machines just don't have integration with other forms of software that its users would find important. That whole Internet thing is a simple example.

In general, most software is only useful if it is under active maintenance. The difference between zero maintenance and a little bit of maintenance is huge. If you are considering using software isn't under maintenance, run away! If you continue to use it, you will eventually find that you have become its maintainer yourself.

Which brings me back to APIs. APIs are never perfect, and so they evolve just like any other interface. The only way this is different for web APIs is that the clients do not get to choose when to upgrade. One day, the provider updates their software and it's simply on the new API. Gilad Bracha describes this as "versionless software".

Ideally, this evolution should involve discussion between the service provider and all of the clients. Exactly how those discussions work is a rich question that is similar to any other decision process by a number of stakeholders. For an API like one offered by Google, individual clients have very little influence on the API, so you have to decide whether to take it or leave it; part of that decision involves your expectation that the service provider will treat you well. In other cases, there might be a contractual agreement between a user of the service and its provider; in that case, any API changes could be worked out as part of negotiating the contract. In still other cases, a group of software users might meet in standards committees, as happens to some degree with the HTTP protocol.

In short, I really don't think that constant interface change is a fundamental reason to avoid web APIs. Instead, it's a fundamental part of most software development that you have to keep maintaining, whether or not the APIs you program against are accessed over the web. Instead, you should be choosy about what specific APIs you depend on. Just like you wouldn't want to depend on an ancient unmaintained hunk of software, you also wouldn't want to use a hyperactively maintained hunk of web service that has a completely different API from one week to the next. Think explicitly about the API evolution story for any service you depend on, and use your judgement.

Tuesday, November 8, 2011

Cloud9 is hitting limits with JavaScript

Cloud9 has a well-written call for adding classes to JavaScript:
Adding classes to JavaScript is technically not hard -- yet, its impact can be profound. It lowers the barrier to entry for new JavaScript developers and reduces the incompatibility between libraries. Classes in JavaScript do not betray JavaScript’s roots, but are a pragmatic solution for the developer to more clearly express his or her intent. And in the end, that’s what programming languages are all about.
Their argument about library compatibility seems strong to me. It is reasonable to write a Python or Java library that stands alone and has minimal external dependencies. With JavaScript, however, the temptation is strong to work within a framework like Dojo or JQuery just so that you get basic facilities like, well, classes. It's a good argument. If I were working on a large JavaScript code base, however, I'd be strongly tempted to switch over to Dart. It already has the basic language facilities they yearn for, and it's going to move forward much more quickly.

Friday, October 14, 2011

Dennis Ritchie passes

Dennis Ritchie, co-inventor of C and Unix, has passed away.

I first learned C at a summer camp for math enthusiasts. We implemented programs to generate fractals on Sun workstations (or was it SGIs?), and the language we implemented them in was C. I asked around about this interesting language, borrowed a book from someone, and soaked up the language like a sponge. I wish I could remember if the book was K&R, but I do remember reading through one of the first examples in it. The program read its input character by character and performed some transformation on them, perhaps converting them to uppercase, and printed out the result. I remember thinking it was a hard problem, and I remember a growing delight as the chapter broke the problem down into smaller pieces that mapped perfectly to C. I read over and over that ten line program, savoring the while loop and the stdio calls and the looping variable named "c" (or was it "ch"?).

Several years later, I gained access to the Unix machines at Clemson, and it was computing nirvana. The machines had a ubiquitous C compiler, a suite of Internet programs, preemptive multi-tasking, a good scripting language, a good shell, and a windowing operating system to run it all in. Compared to Windows 3.1, DOS (because Windows was a resource hog), batch files, debug.exe, and Pascal, these machines just felt right. It felt like I'd been walking with a fifty pound backpack for miles, and now I could set aside that backpack and walk more lightly.

Unix and C are just that good. At the time, those systems were more than a generation ahead of the IBM, Apple, and Commodore computers everyone was using at home. Nowadays, the number of Unix machines has only grown. Android, Apple, and ChromeOS machines all run Unix. If you can't beat 'em, join 'em.

Ritchie was instrumental in this revolution. Though few seem to have actually talked to the guy, the web is filled with testimonials. Here are a few of them that I have run across:

Thursday, October 13, 2011

Schmidt on federal policy

Eric Schmidt testified before Congress on its technology policy, and he tells the Washington Post that he is not happy.

Much of the issue he blames on age. For example:

And inevitably what happens is everyone says ‘yes,’ yet inevitably on the Hill you have an older gentleman or lady. The staffers—and the staffers are young—the staffers get it. They’re 25, 30 years old and they all get it. So that’s what we depend on. And of course we’ve hired ex-staffers as well. They all know each other. So that’s how it really works. And I believe what we’re doing is extremely defensible if it’s around ideas. I would have a lot of trouble if we in our industry started following the other kind of lobbying.
I'm not sure I agree. I know a lot of younger people that haven't thought through the horror that search neutrality will be if it goes through. I think there's a more fundamental problem computers are changing quickly. Search engines didn't exist twenty years ago, and twenty years from now, they will be completely different. How is it realistic for Washington to regulate something that completely changes every couple of decades?

I admit I appreciate his suggestion about how to improve the state of IT in the U.S. even further:

A classic example is H-1B visas. Now, the following arguments are so obvious, it’s hard for me to believe that anyone would believe that they’re false. These industries are full of very smart people. There are very smart people who don’t live in America. They come to America, we educate them at the best universities, they are smarter than I am, and then we kick them out. If they stayed in the country, let’s just review: They would create jobs, pay taxes, have high incomes, pay more taxes than the average American, and generally increase the GDP of the country. I hope my argument is clear, and if it isn’t I’ll start screaming about it. It’s the stupidest policy the government has with respect to high tech. So you have this conversation and people say “yes,” and you say, ”This is the single thing that you can do that will lead to innovation occurring in our country, and the future economic wealth of our country.” And then they don’t act.... It’s stupid. So my point is that if you want to get a sense of how to screw this up, to put it negatively, then make it harder for us to bring in the world’s smartest people.
Tell 'em, Eric!

Monday, October 10, 2011

Dart spec is online

Google has posted a technical overview and a language specification for Dart, their new programming language for web browsers.

In short, the language is a skin around JavaScript. It provides syntax for parts of JavaScript that are left to convention, and it is designed to be easily compilable to JavaScript. It has optional types, class definitions, and a module syntax.

The type system has some controversial aspects, in particular an explicit choice not to bother about soundness. If I understand correctly, assigning an apple to a variable holding bananas would cause an error, but assigning an unknown fruit to a variable holding bananas would not. The idea is to pick up the egregious errors and otherwise leave the programmers alone.

Hat tip to Lambda the Ultimate, which has several interesting discussions in the comments.

The jlouis blog has a detailed breakdown of what's in the language. Delightfully, he includes the following homage to Blade Runner:

If all you know is Javascript, the language is probably pretty neat. But for a guy who has seen things you wouldn't believe. Haskell ships off the shoulder of MultiCore CPUs. Watched OCaml glitter in the dark near TAPL. All those moments will be lost in time, like tears in rain. Time to die.

Friday, October 7, 2011

What every guide says about child safety on the Internet

At the same time that Blizzard and Google are fighting for real names only on the Internet, children's advocacy groups are fighting for exactly the opposite. Take a look at the top hits that come up if you do a web search on "advice to children online".

First there is ChildLine, a site targeted directly at children. Here is the entirety of their guide on how to stay safe:

How do I stay safe when playing games online?
  • Don’t use any personal information that might identify you. This could be your full name, home address, phone number or the name of your school.
  • Use a nickname instead of your real name and chose one that will not attract the wrong type of attention.
  • Look out for your mates. Treat your friend’s personal details like you would your own and do not share their details with others.
Not only do they suggest not using real names, it is pretty much the only advice they give.

Next is Safe Kids, a site targeted at parents. This site has a more detailed guide on things you can do to help a child say safe. Here is their number one suggestion under "Guidelines for parents":

Never give out identifying information—home address, school name, or telephone number—in a public message such as chat or newsgroups, and be sure you’re dealing with someone both you and your children know and trust before giving out this information via E-mail. Think carefully before revealing any personal information such as age, financial information, or marital status. Do not post photographs of your children in newsgroups or on web sites that are available to the public. Consider using a pseudonym, avoid listing your child’s name and E-mail address in any public directories and profiles, and find out about your ISP’s privacy policies and exercise your options for how your personal information may be used.

Third up is BullyingUK, a site dedicated to bullying in particular instead of general child abuse. Here are their first two pieces of advice for Internet saftey:

  • Never give out your real name
  • Never tell anyone where you go to school

The real names movement is not just out of touch with BBS culture and with norms of publication. It's also out of touch with child safety advocates.

Real names proponents talk about making Internet users accountable. Child advocates, meanwhile, strive for safety. Safety and accountability are in considerable tension. To be safe on a forum, one thing you really want is the ability to exit. You want children to be able to leave a forum that has turned sour and not have ongoing consequences from it. To contrast, real name proponents hope that if someone misbehaves and leaves a forum, there is some outside mechanism to track the person down and retaliate. That might sound good if the person tracked down is really a troll, but it's a chilling prospect if the person being hunted is a child.

Thursday, October 6, 2011

Throttling via the event queue

Here's a solution to a common problem that has some interesting advantages.

The problem is as follows. In a reactive system, such as a user interface, the incoming stream of events can sometimes be overwhelming. The most common example is mouse-move events. If the OS sends an application a hundred mouse-move events per second, and if the processing of each event takes more than ten milliseconds, then the application will drift further and further behind. To avoid this, the application should discard enough events that it stays caught up. That is, it should throttle the event stream. How should it do so?

The solutions I have run into do one of two things. They either delay the processing of events based on wall-clock time, or they require some sophisticated support from the event queue such as the ability to look ahead in the queue. The solutions that use time have the problem that they often introduce a delay that isn't necessary; the user will stop moving the mouse, but the application won't know it, so it will add in a delay anyway. The solutions using fancy event queues are not always possible, depending on the event queue, and anyway they make the application behavior more difficult to understand and test.

An alternative solution is as follows. Give the application a notion of being paused, and have the application queue Unpause events to itself to get out of the paused state. The first time an event arrives, process it as normal, but also pause the application and queue an Unpause event. If any other events arrive event while the application is paused, simply queue them on the side. Once the Unpause event arrives, if there are any events on the side queue, drain the side queue, process the last event, and queue another Unpause event. If an Unpause event arrives before any other events are queued, then simply mark the application unpaused.

This approach has much of the advantages of looking ahead in the event queue, but it doesn't require any direct support for doing so. It also has as good of responsiveness under system load as appears possible to achieve. If the system is lightly loaded, then every event is processed, and the system is just as responsive as it would be without the throttling. If the system is loaded, then enough events are skipped that the application avoids backlogging further and further behind. If the load is temporarily high, and then stops, then the last event will be processed promptly.

The one tricky part of implementing this kind of solution is posting the Unpause event from the application back to the application itself. That event needs to be on the same queue that the other work is queuing up on, or the approach will not work. How to do this depends on the particular event queue in question. For the case of a web browser, the best technique I know is to use setTimeout with a timeout of one millisecond.

Kudos to Princeton for an open-access policy

It seems that Princeton has adopted an open-access policy for the papers their faculty publish.
...each Faculty member hereby grants to The Trustees of Princeton University a nonexclusive, irrevocable, worldwide license to exercise any and all copyrights in his or her scholarly articles published in any medium, whether now known or later invented, provided the articles are not sold by the University for a profit, and to authorize others to do the same.... The University hereby authorizes each member of the faculty to exercise any and all copyrights in his or her scholarly articles that are subject to the terms and conditions of the grant set forth above.

The legalese is making my head spin, but I think they are saying that the university gets full access to all faculty publications, and that the university is granting all faculty full access to their own publications. As a programmer, I yearn to write it in a more simple way, and probably to drop the "for a profit" part. Still, the spirit is there. Anything published by a Princeton faculty member will not be hidden exclusively behind a paywall.

Hat tip to Andrew Appel, who emphasizes that this policy is compatible with ACM's paywall:
Most publishers in Computer Science (ACM, IEEE, Springer, Cambridge, Usenix, etc.) already have standard contracts that are compatible with open access. Open access doesn't prevent these publishers from having a pay wall, it allows other means of finding the same information.

This is true, but I find it too gentle on ACM. The ACM is supposed to be the preeminent association of computer-science researchers in the United States. They would serve their members, not to mention science, if they made the articles open access. Charge the authors, not the readers.

Friday, September 30, 2011

Pseudonyms lead to uncivil forums?

I am late to realize, but apparently, Google Plus is requiring a real names only. They go so far as to shut down accounts that are using a name they are suspicious of, and they're doing a lot of collateral damage to people with legal names that happen to sound funny.

The battle for "real names" is one that I have a hard time understanding. Partially this is because it is impossible to indicate which names are "real". Is it ones on legal papers? On a credit card or bank account? Ones people call you all the time? Partially it is that I started using forums at an impressionable age. Online forums are filled with pseudonyms and they work just fine. Hobbit and Ghostcrawler are the real names of real people in my world. It's all so normal and good that I have a hard time understanding why someone would want to shut it down.

Let me take a try at it, though, because I think it's important that pseudonymity thrive on the Internet.

The most common defense I hear for a real-names policy is that it improves the quality of posts in a forum. That's the reason Blizzard used when they announced they would require real names only on their official forums. As far as I can understand, the idea is that a "real name" gives some sort of accountability that a pseudonym does not.

There is much to say on this, but often a simple counter-example is the strongest evidence. Here are the first four Warcraft guilds I could find, by searching around on Google, that have online forums viewable by the public.

Feel free to peruse them and see what a forum is like without real names. At a glance, I don't see a single real name. Everyone is posting using names like Brophy, Porcupinez, and Nytetia. As well, after skimming a few dozen posts, I didn't find a single one that is uncivil. In fact, the overall impression I get is one of friendliness. Camaraderie. Just plain fun.

The tone of these forums is not surprising if you think about the relationship the members of a guild have with each other. This is just the sort of thing you see over and over again if you participate in Internet forums. It is just the kind of thing that will be shut down under a real names policy.

Monday, September 26, 2011

Teach your build tool jars, not classes

As far as I can tell, the single compelling feature about ant as a build system is that it makes it easy to compile Java. I just encountered a wiki page where SCons developers are discussing some of the problems.

One of the problems is this:
Currently the DAG built by SCons is supposed to have knowledge about every single generated file so that the engine can work out which files need transformation. In a world where there is 1 -> 1 relationship between source file and generated file (as with C, C++, Fortran, etc.) or where there is 1 -> n relationship where the various n files can be named without actually undertaking the compilation, things are fine. For Java, and even more extremely for languages like Groovy, it is nigh on impossible to discover the names of the n files without running the compiler -- either in reality or in emulation.

It's actually worse. It's not really a 1->n compile. The 1 file on the left can only be compiled by consulting other input files, and if any of those files change, you also need to recompile. Determining the exact dependency graph is a rather complicated problem.

I believe such a graph is unavoidable and indispensable if you want to have a decent build tool. "Decent" is subjective, but surely anyone would say that rebuilds should be reliable. You don't generally get that with ant. If you are building with ant, your have probably gotten very familiar with "ant clean".

To address Java's build problems without having to use ant, what I do is set up my build files in terms of jar files rather than directories of class files. If you do that, then even though the Java or Scala compiler produces loads of class files, the build tool doesn't actually see them. They are created in a temporary directory, combined into a jar, and then the temporary directory is deleted. While it's true that you don't get the optimal rebuild this way if you change just one file, I'd usually use an IDE if I am repeatedly editing the files of a single Java or Scala module. If I change just one file at random, I'd prefer to have a safe rebuild than the absolute fastest one.

In principle you can update the build tool to accurately model all the class files. Ant's depend task does so, and the Simple Build Tool uses a scalac plugin to track dependencies. While I don't have experience with the SBT version, I have found the ant version to significantly slow down compiles and yet still not be completely reliable. I prefer to stick with jar files and have the build tool be reliable. Besides, you shouldn't have to use a particular build tool just because your project includes some code in some particular language. It doesn't scale as soon as you add a second language to the project.

Tuesday, August 23, 2011

A Scientist's Manifesto?

I was disheartened today to read so many glowing reviews of Robert Winston's "Scientist's Manifesto".

I think of science as a way to search for knowledge. It involves forming explanations, making predictions based on those explanations, and then testing whether those predictions hold true. Scientists make claims, and they fight for those claims using objective evidence from repeatable experiments.

Winston promotes an alternative view of science, that scientists are people who are in a special inner circle. They've gone to the right schools, they've gone through the right processes, and they've had review by the right senior scientists. Essentially, they are priests of a Church of Science. His concern is then with the way in which members of this church communicate with the outside.

If that sounds overblown, take a look at item one in the 14-item manifesto. It even uses the term "layperson":
We should try to communicate our work as effectively as possible, because ultimately it is done on behalf of society and because its adverse consequences may affect members of the society in which we all live. We need to strive for clarity not only when we make statements or publish work for scientific colleagues, but also in making our work intelligible to the average layperson. We may also reflect that learning to communicate more effectively may improve the quality of the science we do and make it more relevant to the problems we are attempting to solve.

Aside from the general thrust of it, many individual items I disagree with. For example, I think of scientists interested in a topic as conferring with each other through relatively specialized channels. Thus item three is odd to me:
The media, whether written, broadcast or web-based, play a key role in how the public learn about science. We need to share our work more effectively by being as clear, honest and intelligible s possible in our dealings with journalists. We also need to recognize that misusing the media by exaggerating the potential of what we are studying, or belittling the work of other scientists working in the field, can be detrimental to science.

Of course, it makes perfect sense if you think of science as received wisdom that is then propagated to the masses.

I also think of science as seeking objective truth. I can't really agree the claim that it is relative:
We should reflect that science is not simply ‘the truth’ but merely a version of it. A scientific experiment may well ‘prove’ something, but a ‘proof’ may change with the passage of time as we gain better understanding.

I don't even think peer review is particularly scientific. The main purpose of peer review is to give a mechanism to measure the performance of academics. In some sense it measures how much other academics like you. Yet, item 8 in the manifesto claims that peer review is some sacred process that turns ordinary words into something special, much like the process behind a Fatwah:
Scientists are regularly called upon to assess the work of other scientists or review their reports before publication. While such peer review is usually the best process for assessing the quality of scientific work, it can be abused....



I have an alternative suggestion to people who want the public to treat scientists with esteem. Stop thinking of yourself as a priest, evangelist, or lobbyist trying to propagate your ideas. Instead, remember what it is that's special about your scientific endeavors. Explain your evidence, and invite listeners to repeat crucial parts of the experiment themselves. Don't just tell people you are a scientist. Show them.

Friday, August 19, 2011

Why LaTeX?

Lance Fortnow laments that no matter how crotchety he gets he can't seem to give up LaTeX:
LaTeX is a great system for mathematical documents...for the 1980s. But the computing world changed dramatically and LaTeX didn't keep up. Unlike Unix I can't give it up. I write papers with other computer scientists and mathematicians and since they use LaTeX so do I. LaTeX still has the best mathematics formulas but in almost every other aspect it lags behind modern document systems.

I think LaTeX is better than he gives it credit for. I also think it could use a reboot. It really is a system from the 80s, and it's... interesting how many systems from the 70s and 80s are still the best of breed, still in wide use, but still not really getting any new development.

Here's my hate list for LaTeX:
  • The grammar is idiosyncratic, poorly documented, and context-dependent. There's no need for any of that. There are really good techniques nowadays for having a very extensible language nonetheless have a base grammar that is consistent in every file and supports self-documentation.
  • You can't draw outside the lines. For all the flexibility the system ought to have due to its macro system, I find the many layers of implementation to be practically impenetrable. Well written software can be picked up by anyone, explored, and modified. Not so with LaTeX--you have to do things exactly the way the implementers imagined, or you are in for great pain and terrible-looking output.
  • The error messages are often inscrutable. They may as well drop all the spew and just say, "your document started sucking somewhere around line 1308".
  • The documentation is terrible. The built-in documentation is hard to find and often stripped out anyway. The Internet is filled with cheesy "how to get started" guides that drop off right before they answer whatever question you have.
  • Installing fonts is a nightmare. There are standalone true-type fonts nowadays. You should be able to drop in a font and configure LaTeX to use it. That this is not possible suggests that the maintainers are as afraid of the implementation as I am.
  • Files are non-portable and hard to extract. This problem is tied up in the implementation technology. Layered macros in a global namespace are not conducive to careful management of dependencies, so everything tends to depend on everything.
However, as bad as that list is, the pros make it worth it:
  • Excellent looking output, especially if you use any math. If you care enough to use something other than ASCII, I would think that the document appearance trumps just about any other concern.
  • Excellent collaborative editing. You can save files in version control and use file-level merge algorithms effectively. With most document systems, people end up mailing each other the drafts, which is just a miserable way to collaborate.
  • Scripting and macros. While you can't reasonably change LaTeX itself, what you can easily do is add extra features to the front end by way of scripts and macros.
  • It uses instant preview instead of WYSIWYG. WYSIWYG editors lead to quirky problems that are easy to miss in proofreading, such as headers being at the wrong level of emphasis. While I certainly want to see what the output will look like all the time, I don't want to edit that version. I want to edit the code. When you develop something you really want to be good, you want very tight control.
  • Scaling. Many document systems develop problems when a document is more than 10-20 pages long. LaTeX keeps on chugging for at least up to 1000-page documents.
I would love to see a LaTeX reboot. The most promising contender I know of is the Lout document formatting system, but it appears to not be actively maintained.

Monday, August 15, 2011

Paul Chiusano on software patents

Paul Chiusano reminds us why we would conceivably want software patents:
What I find irritating about all the software patent discussion is that patents are intended to benefit society - that is their purpose, "To promote the Progress of Science and useful Arts". But no one seems to want to reason about whether that is actually happening - that would mean doing things like thinking about how likely the invention was to be independely discovered soon anyway, estimating the multiplier of having the invention be in the public domain, etc. Instead we get regurgitation of this meme about making sure the little guy working in his basement gets compensated for his invention.

It's a good reminder. The point of patents is to make society better off.

The standard argument for patents requires, among other assumptions, that the patented inventions require a significant level of investment that would not otherwise occur. As Paul points out, that is not the case for software:
Software patents rarely make sense because software development requires almost no capital investment, and as a result, it is almost impossible for an individual to develop some software invention that would not be discovered by multiple other people soon in the future. Do you know of any individual or organization that is even capable of creating some software "invention" that would not be rediscovered independently anyway in the next five or ten years? I don't. No one is that far ahead of everyone else in software, precisely because there is no capital investment required and no real barriers to entry.

I agree.

I have read many posts where people try to fine tune software patents to make them less awful. I wish we could instead start by considering the more fundamental issue. Do we want software patents at all?

Wednesday, August 10, 2011

Schrag on updating the IRBs

Zachary Schrag has a PDF up on recent efforts to update IRBs. Count me in as vehemently in favor of two of the proposals that are apparently up for discussion.

First, there is the coverage of fields that simply don't have the human risks that medical research does:
Define some forms of scholarship as non-generalizable and therefore not subject to regulation. As noted above, the current regulations define research as work “designed to develop or contribute to generalizable knowledge.” Since the 1990s, some federal officials and universities have held that journalism, biography, and oral history do not meet this criterion and are therefore not subject to regulation. However, the boundaries of generalizability have proven hard to define, and historians have felt uncomfortable describing their work as something other than research.

I would add computer science to the list. A cleaner solution is as follows:
Accept that the Common Rule covers a broad range of scholarship, but carve exceptions for particular methods. Redefining “research” is not the only path to deregulation contemplated by the ANPRM, so a third possibility would be to accept Common Rule jurisdiction but limit its impact on particular methods.

Schrag's PDF gives limited attention to this option, but it seems the most straightforward to me. If a research project involves interviews, studies, or workplace observations, then it's just shouldn't need ethics review. The potential harms are so minor that it should be fine to follow up on reports rather than to require ahead-of-time review.


Schrag also takes aim at exempt determinations:
Since the mid-1990s, the federal recommendation that investigators not be permitted to make the exemption determination, combined with the threat of federal sanction for incorrect determinations, has led institutions to insist that only IRB members or staff can determine a project to be exempt. Thus, “exempt” no longer means exempt, leaving researchers unhappy and IRBs overwhelmed with work.

Yes! What kind of absurd system declares a project exempt from review but then requires a review anyway?

Monday, August 8, 2011

TechDirt on the latest draft of PROTECT IP

Tech Dirt has an analysis of the latest available version of PROTECT IP.
Yesterday, we got our hands on a leaked copy of the "summary" document put together by those writing the new version of COICA, now renamed the much more media friendly PROTECT IP Act. It looked bad, but some people complained that we were jumping ahead without the actual text of the bill, even if the summary document was pretty straightforward and was put together by the same people creating the bill. Thankfully, the folks over at Don't Censor the Internet have the full text of the PROTECT IP Act, which I've embedded below as well. Let's break it down into the good, the bad and the horribly ugly.

I find it hard to care about the nitty gritty details of the approach. The bill is still fundamentally about taking down DNS names on the mere allegation of infringement, and that seems like a very bad idea to me.

Sunday, August 7, 2011

Brad on foreign CEOs

Brad Templeton describes a good way to explain the current distribution of nationalities in the tech field:
I gave him one suggestion, inspired by something I saw long ago at a high-roller CEO conference for the PC industry. In a room of 400 top executives and founders of PC and internet startups, it was asked that all who were born outside the USA stand up. A considerable majority (including myself) stood. I wished every opponent of broader immigration could see that.

I agree with Brad that, at least in the software field, we benefit tremendously from foreign workers.

I suspect most observers would agree if they thought about it. You don't have to look just at executives. Walk into any software shop and you will see a large fraction of the workers who were born abroad. Futhermore, talk to any software developer about the job market, and it's not like they are hurting for work. If we sent all the foreign workers home, it's not like we'd have more American programmers at work. We'd simply have less total computer work being done.

It seems that software is getting swept up in laws and regulation that were developed with other fields in mind. If you follow the political discussions on the topic, it is always about lower-skilled jobs in fields where it is tough to start a new company. This depiction simply does not match computer science.

It's the same sort of thing that happens with research oversight. Research oversight is driven by the needs of medical research, and it just doesn't match the ethical issues that computer researchers face.

Inducing infringement alive and well

Mitch Golden writes, in a good analysis of the legal state of LimeWire's file-sharing software, that inducing infringement was a key part of the October 2010 court case against them:
Interestingly, the court largely sidestepped the technical issues as to whether Gnutella itself had non-infringing uses or not, or whether a Gnutella client can be legally distributed. The court's decision instead turned on evidence submitted by the plaintiffs that LimeWire intended to facilitate filesharing.

I continue to feel that we are much better off leaving content carriers alone. Trying to make content carriers into IP policemen is not going to work out well.

Saturday, July 30, 2011

Cedric on type erasure

I've been meaning to get around to posting on type erasure, and Cedric Beust beat me to it:
The main problem is that reified generics would be incompatible with the current collections.... The extra type information also impacts the interoperability between languages within the JVM but also outside of it.

I completely agree. I used to rail on erasure until I got more experience with it.

The interoperability issue is one big reason I now like erasure. With erased types, the interop layer uses only a very simple type system. Knowledge of complicated type systems stays within the compilers for individual languages.

An additional reason is that it puts the cost of type checking in the compiler rather than in the runtime. With erased types, the compiler works hard to do its type checking, and if it signs off, the code is known to be type safe. At runtime, the types disappear and the code runs at full speed.

This property is more than just pretty. It is very helpful to an engineer trying to build anything using the language. When you write code, you want to know how it is going to perform. With erasure, the things you write convert directly to machine code, just with extra details added such as which variable goes in which register. With reification, you end up with extra crud being inserted everywhere. To understand performance under reified types, you have to reason about this additional type crud. You'd rather not have to.

Sunday, July 17, 2011

A major milestone for Scala+GWT

Stephen Haberman has announced that the majority of features in GWT's "Showcase" app are now available to Scala code as well. Aaron Novstrup did the original port to Scala, and Stephen's recent revamp of the GWT import code has gotten this much more functionality working.

Grzegorz Kossakowski has posted a compiled version on Dropbox for anyone that would like to click around in the final deployed version. A few of the things you will see working are:
  • Generators (see the "source code" tab)
  • Internationalization
  • Code splitting
  • Numerous built-in widgets

Scala+GWT still requires a lot of handholding to do anything with it, but this is a major milestone. Kudos to everyone who contributed!

Thursday, July 7, 2011

Professors' letter against PROTECT-IP

A number of professors have signed a letter to the U.S. Congress opposing Protect IP:
The undersigned are 108 professors from 31 states, the District of Columbia, and Puerto Rico who teach and write about intellectual property, Internet law, innovation,and the First Amendment. We strongly urge the members of Congress to reject the PROTECT-IP Act (the "Act"). Although the problems the Act attempts to address-–online copyright and trademark infringement–-are serious ones presenting new and difficult enforcement challenges, the approach taken in the Act has grave constitutional infirmities, potentially dangerous consequences for the stability and security of the Internet's addressing system, and will undermine United States foreign policy andstrong support of free expression on the Internet around the world.

The most important point raised in the letter is that it is a violation of free speech. Forgetting the constitutional issue in the U.S., isn't it a bad way for people to interact online? Shutting down a DNS address is much like cutting a person's phone access, something that is simply not done unless the person is about to be arrested. The authors accurately call it an "Internet death sentence". It's far overboard.

The letter also raises the issues with secure DNS, but I believe this is a counter-productive argument. Secure DNS is a gift to anyone who wants to cut off DNS records. Sure, PROTECT-IP as it stands might not work, but all that means is that Secure DNS version 2 will be updated to have a government back door. The problems of PROTECT-IP are not technical.

Most of all, I really wish people could be more creative about digital copyright. You can copy bits, but you can't copy skill. Thus, we would do better to sell skill than to sell the bits that result from them. We can make that change, but expect Hollywood to fight it.

Friday, June 24, 2011

Against manual formatting

Programming languages are practically always expressed in text, and thus they provide the programmer with a lot of flexibility in the exact sequence of characters used to represent any one program. Examples of manual formatting are:
  • How many blank lines to put between different program elements
  • Whether to put the body of an if on the same line or a new line
  • Whether to sort public members before private members
  • Whether to sort members of a class top-down or bottom-up, breadth-first or depth-first
I've come to think we are better off not taking advantage of this flexibility. It takes a significant amount of time, and for well-written code, its benefits are small. First consider the time. The first place manual formatting takes time is in the initial entry of code. Programmers get their code to work, and then they have to spend time deciding what order to put everything in. It's perhaps not a huge amount of time, but it is time nonetheless. The second time manual formatting takes time is when people edit the code. The extract method refactoring takes very little time for a programmer using an IDE, but if the code is manually formatted, the programmer must then consider where to place the newly created method. It will take much longer to rearrange the new method than it did to create it.

Worse, sometimes the presentation the first programmer use doesn't make sense any longer after the edits that the second programmer made. In that case, the second programmer has to come up with a new organization. As well, they have to spend the time evaluating whether the new organization is worthwhile at all; to do that, they first have to spend time with the existing format trying to make it work. There is time being taxed away all over the place.

Meanwhile, what is the benefit? I posit that in well-written code, any structural unit should have on the order of 10 elements within it. A package should have about 10 classes, a class should have about 10 members, and a method should have about 10 statements. If a class has 50 members, it's way too big. If a method has 50 statements, it, too, is way too big. The code will be improved if you break the large units up into smaller ones.

Once you've done that, the benefit of manual formatting becomes really small. If you are talking about a class with ten members, who cares what order they are presented in? If a method has only 5 statements, does it matter how much spacing there is between them? Indeed, if the code is auto-formatted, then those 5 statements can be mentally parsed especially quickly. The human mind is an extraordinary pattern matcher, and it can match patterns faster that it has seen many times before.

I used to argue that presentation is valuable because programs are read more than written. However, then I tried auto-formatting and auto-sorting for a few months, and it was like dropping a fifty pound backpack that I'd been carrying around. Yes, it's possible to walk around like that, and you don't even consciously think about it after a while, but it really slows you down. What I overlooked in the past was that it's not just lexical formatting that can improve the presentation of a program. Instead of carefully formatting a large method, good programmers already divide large methods into smaller ones. Once this is done, manual formatting just doesn't have much left to do. So don't bother. Spend the time somewhere that has a larger benefit.

Wednesday, June 15, 2011

Universities behind borders

Ronaldo Lemos has a written a thought-provoking article about the role of non-Brazilians in a Brazilian university. He interviews Volker Grassmuck, a German professor working at a Brazilian university, who feels that the result is intellectually insular:
People read the international literature in the fields I’m interested in in. But without having actual people to enter into a dialogue with this often remains a reproduction or at best an application of innovations to Brazil.

That is what I would expect. Academics want to work with other academics in the same specialty, and you aren't going to be able to build such units if you can only hire from the immediate locale. The people you really want will be elsewhere, and the people you get will spend their time trying to emulate them.

I feel that the U.S. is currently too strict on foreign workers for our own good, but we seem to be better off than in Brazil. In the U.S., once they finish groping you and finish making sure you aren't going to take a job in manual labor, you can do as you will. In Brazil, they make you redo your Ph.D. examinations, much like American states do with professional certifications such as dentistry and accounting.

It all seems very dirty to me. If you want a good intellectual atmosphere, you need to admit people from all over.

Friday, June 10, 2011

Peek IPv4 ?

In its announcement about "IPv6 Day", the Internet Society (ISOC) casually remarked that IPv4 addresses are about to run out:
With IPv4 addresses running out this year, the industry must act quickly to accelerate full IPv6 adoption or risk increased costs and limited functionality online for Internet users everywhere.

This is highly misleading, and the recommended solution is not a good idea. I am sure if pressed the ISOC would respond that by "running out", they mean in the technical sense that some registrar or another has now given away all its addresses. However, the actual verbiage implies something far different. It implies that if you or I try to get an IPv4 address on January 1, 2012, we won't be able to do so. That implication is highly unlikely.

A better way to think about the issue is via price theory. From the price theory point of view, the number of IPv4 addresses at any time is finite, and each address has a specific owner. At the current time, every valid address is owned by some entity or another. (Thus, in some sense they "ran out" a long time ago.)

When a new person wants to get an IPv4 address for their own use, they must obtain the rights from some entity that already has one. While some large organizations can use political mechanisms to gain an IPv4 address, most people must purchase or rent the address from some entity that already owns one. Typically those IP addresses are bundled with a service contract that provides Internet bandwidth, though in some cases addresses can be purchased by themselves.

The price one pays gives us a way to think about the scarcity of addresses. Diamonds are relatively scarce, and their price is correspondingly high. Being 747s are even more scarce, and their price is even higher. For IP addresses, the price is surely rising over time as more and more people hook things up to the Internet. Already the price is high enough that, for example, most home users do not assign a separate publicly routable address to every IP device in their home. They make do with a single IP address from their Internet provider.

What is that price right now? The question is crudely phrased, because some addresses are more valuable than others, and all addresses come with some sort of strings attached. However, we can get a ballpark idea by considering a few data points:
  • Linode offers its subscribers an extra IP address for $1/month.
  • Linode offers an IP address along with an Internet hosting service for under $20/month.
  • Broadband providers such as Comcast and AT&T offer an IP address along with Internet connectivity for on the order of $50/month.
From these observations we can infer that the cost of an IP address is at most a few dollars per month. With the cost this low, I can't see any major site going to IPv6-only any time soon. A few dollars per month is a very low price to pay for a great deal of extra accessibility. With the protocols designed as they are right now, the reason to consider IPv6 is that it's a newer, better protocol, not because it has more available addresses.

I wish public communication about IPv6 would make this more clear. The Internet is important, and as such, it is important that the techies get it right. This isn't a minor technical detail.

Thursday, June 9, 2011

Two kinds of type inference

There are two separate lines of work on type inference. While they are superficially similar, they face very different design constraints. Let me explain a few of those differences. To begin with, consider the following program. The two kinds of inference reach different conclusions about it.
  class Foo {
    var x: Object = "a string"
    var y: String = "initial value"
    def copy() {
      y = x    // type error, or no?
    }
  }

Should this example type check? There are two points of view.

One point of view is that the compiler is able to prove that x only ever holds strings. Therefore y only ever holds strings. Thus, there will never be a type error at run time, and so the code can be allowed to run as is. This point of view might be called information gathering. The tool analyzes the code, typically the whole program, and learns type information about that code.

Another point of view is that x holds objects and y holds strings, so the "x = y" line is a problem. Yes, the current version of x only holds circles. However, the "x = y" line is using x outside of its spec, and you don't want to use a variable outside of its spec even if you can temporarily get away with it. This point of view might be called slots and tabs. The slot y is not specced to hold the tab x. From this point of view, the indicated line is an error even though the program doesn't have any real problems.

Every user-facing type checker I am familiar with is based on the slots and tabs point of view. The idea is that programmers use types to provide extra structure to their programs. They don't want even nominal violations of the structure; they really want their types to line up the way they say they do. As a concrete example, imagine the author of Foo checks in their code, and someone else adds to the program the following statement: "foo.x = Integer.valueOf(12)". Now the nominal type error has become a real one, but the author of Foo has already gone home. It's better if the author of Foo found out the problem rather than someone else.

That's one example difference between the two kinds of type inference. A slots-and-tab checker will flag errors that an information-gatherer would optimize away. Here are three other design constraints that differ between the two.

Declarations are important for a type checker. For the type checker to know what the slots and tabs are specced as, it must have declared types. In the above example, if x and y did not have declared types on them, then the type checker for class Foo could not determine that there is a problem. To contrast, an information gatherer doesn't necessarily pay much attention to declarations. It can usually infer better information by itself, anyway.

Changing a published type checker breaks builds. For a language under development, once a type checker has been published, it takes great care to change it without breaking any existing builds. Consider the addition of generics to Java 1.5, where it took a great deal of energy and cleverness to make it backwards compatible with all the existing Java code in the world. To contrast, an information gathering type inference can be swapped around at whim. The only impact will be that programs optimize better or worse or faster or slower than before.

Type checkers must be simple. The type system of a slots-and-tabs type checker is part of the contract betwen the compiler and a human developer. Human beings have to understand these things, human beings that for the most part have something better to do with their time than study types. As a result, there is tremendous design pressure on a slots-and-tabs type checker to make the overall system simple to understand. To contrast, the sky is the limit for an information gatherer. The only people who need to understand it are the handful of people developing and maintaining it.

Overall, I wish there was some term in common use to distinguish between these two kinds of type inferencers. Alas, both kinds of them infer things, and the things both of them infer are types, so the terminology seems inevitable. The best that can be done is to strive to understand which kind of type inferencer one is working with. Developers on one or the other face different design constraints, and they will find different chunks of the published literature to be relevant.

Thursday, June 2, 2011

Martin Odersky on the state of Scala

Martin Odersky talked today at ScalaDays about the state of Scala. Here are the highlights for me.

Tools and Documentation

He doesn't think documentation is a big issue at this point. People used to complain to him about it all the time, and that has largely died out. I know that this demand for documentation was a large motivation for him to develop a book about Scala. Perhaps it helped.

He still thinks IDE support is an important problem. However, he claims the Eclipse support has gotten good enough for him to switch his compiler hacking from Emacs to Eclipse. That's high praise--Emacs users don't switch! I will have to try it again.

He emphasizes binary compatibility. In practice, libraries need to be recompiled when a new version of Scala comes out, because inevitably some trait or another has a new method in it. He has a number of ways to address this. He's long talked about tools to detect and fix problems by analyzing the bytecodes, and that work is going to be emphasized at TypeSafe. Additionally, new today is that he plans to designate stable releases of Scala that stay valid for a number of years and never have binary-incompatible changes.

He also pointed out that style checking tools would be helpful in larger development groups. It's a good point. Such tools don't take a lot of code, but I guess nobody has gotten interested enough in the problem to whip one up.

Functional programming

Martin went through an extended example based on a 2000 study comparing programming languages. In the study, students implemented a programming problem in one of seven different programming languages. The study is interesting on its own if you haven't read it before, and among other things shows that there is much more variability among programmers than among programming languages. However, we can learn something about programming languages by comparing either the best or the median solutions in each language.

Scala shines well on the programming problem used in the study, and it's because of Scala's excellent support for running functions across collections. Such facilities don't work well unless the language has a concise notation for functions. Here is the bottom line on several different solutions:

  • Students using compiled languages: 200-300 lines of code
  • Students using scripting languages: ~100 lines of code
  • Martin's solution in Scala: 20-30 lines of code
  • A Java master's solution in Java: over 100 lines of code

I will attribute the "Java master" if I can find a reliable source, but Martin showed the Java example and it looked pretty reasonable at a glance. The reason it is so long compared to the Scala solution is that instead of using collection operations, it defines several recursive methods that record their progress in extra parameters to the methods. I've written a lot of code like that in the last few years. I think about the beautiful functional solution, and then I start over with an imperative solution because inner classes in Java require ever so much boilerplate.

Parallelism

Martin talked a bit on his latest thinking on making use of multiple cores, a problem that has obsessed programming-language research for several years now. One principle he emphasized is that people are much more able to find one solution that works than at finding all the potential problems that can occur due to non-determinism. Thus, he's interested lately in programming-language constructs that are parallel yet still deterministic. That's a tough principle to achieve in an expressive language! It rules out all of actors (!), agents, and software-transactional memory, because they all have state, and the state can change differently depending on the non-deterministic choices the implementation makes.

Aside from the general talk on the challenges of parallelism, Martin talked a bit about the parallel collections in Scala. They're better than I realized. Their implementation uses fork-join with work stealing, rather than blindly creating lots of threads. As an additional twist, they adaptively choose a chunk size based on how much work stealing appears to be happening. If there is no work stealing, then every node must be busy, so increase the chunk size.

To demonstrate how the collections can help, he made two tweaks to the phone-number solution to switch from sequential to parallel collections. After doing so, the program ran 2.5 times faster on a 4-core machine. One can imagine doing better than 2.5 times faster, but it's a very good start given how very easy the change was.

Domain-specific languages

Martin emphasized that Scala is excellent for DSLs. However, he points out, and I agree, that embedded DSLs in Scala mesh so well that they are essentially just Scala code. I vastly prefer this style of DSL to the style where you embed the DSL but the constructs of the DSL don't map well to constructs in the host language. Since the code is doing something different from what it looks like it does, all kinds of weird bugs can arise. Whenever working on a DSL that doesn't embed in a straightforward way, I'd prefer to make it an external DSL with a plain old parser and interpreter.

Wednesday, June 1, 2011

Secure DNS supports PROTECT IP

There is some commentary lately about a paper arguing that PROTECT IP is fundamentally incompatible with secure DNS. This argument is misleading in the extreme. The strategy with DNSSEC is to have root authorities digitally sign DNS records, just like with TLS. As such, it is vulnerable in the same place as TLS. Whoever controls the root servers has ultimate control over what Internet-connected computers will consider to be the truth.

Far from making PROTECT IP more difficult, a hypothetical success of DNSSEC would make it easier. With DNS as it currently works, governments must contend with what, from their perspective, are rogue DNS servers that continue to post "false" (meaning correct) addresses. Under DNSSEC, the rogue server's certificate chains will not check out. Whenever a government orders a domain name to be changed, the root servers will not just issue the new address, but presumably also cryptographically revoke the old one. It would all work as if it were the legitimate domain owner making the request instead of a government.

I don't think the technical arguments about PROTECT IP are convincing. DNS is by its nature a sitting duck. The technical argument I would make is that a global Hierarchy of Truth is not a good approach to security on the Internet. If you don't like PROTECT IP, then you shouldn't like DNSSEC nor DNS as we currently know it.

Given how things technically work right now, however, the best argument against PROTECT IP is simply that we don't want to live that way. Do we really want to live in a world where Sony or Blizzard or MGM can turn off a web site without the site owner getting to defend themselves in court? Is 20th century copyright really worth such heavy handed measures?

Thursday, May 26, 2011

Google Wallet: Why NFC ?

I was excited to read that Google is going to build a payment method based on a smartphone:
Today in our New York City office, along with Citi, MasterCard, First Data and Sprint, we gave a demo of Google Wallet, an app that will make your phone your wallet. You’ll be able to tap, pay and save using your phone and near field communication (NFC). We’re field testing Google Wallet now and plan to release it soon

I have long wished that payments could be made using an active device in the buyer's possession rather than having the buyer type secret information--a PIN--into a device the seller owns. It requires that a device the buyer has never seen before be dilligent about deleting the PIN after it is used. It also requires that a device the buyer has never seen before is making the same request to the bank that it is displaying on its screen. Security is much higher when using a device the buyer owns.

The main flaw with this approach is that it requires people to carry around these active devices. Google's bright idea is to make that device be a smartphone. Brilliant.

The one thing I don't understand is why Google is only supporting it using NFC. I had never heard of NFC until today, and for any readers like me, it is basically it's a really dumb, short-range, low-bandwidth wireless protocol. It sounds well-suited for the application, but no current phones support it.

An alternative approach that should work with current phones is to use bar code reading software. The seller's hardware would display a barcode that includes a description of what is being bought, the amount of money, and a transaction ID number. It would simultaneously upload the transaction information to Google. The buyer would scan the bar code, and if the user authorizes the payment, it would send authorization to Google. The seller would then receive notification that the payment has been authorized. For larger transactions, further rounds of verification are possible, but for groceries and gas, that could be the end of the story.

Why limit the feature to NFC devices? While NFC solutions look a little more convenient, barcodes don't look bad. Why not offer both?

Wednesday, May 25, 2011

Regehr on bounding the possible benefits of an idea

John Regehr posted a good thought on bounding the possible benefits of an idea before embarking on weeks or months of development:
A hammer I like to use when reviewing papers and PhD proposals is one that (lacking a good name) I call the “squeeze technique” and it applies to research that optimizes something. To squeeze an idea you ask:
  • How much of the benefit can be attained without the new idea?
  • If the new idea succeeds wildly, how much benefit can be attained?
  • How large is the gap between these two?

I am not sure how big of a deal this is in academia. If you are happy to work in 2nd-tier or lower schools, then you probably need to execute well rather than to choose good ideas. However, it's a very big deal if you want to produce a real improvement to computer science.

The first item is the KISS principle: keep it simple, stupid. Given that human resources are usually the most tightly constrained, simple solutions are very valuable. Often doing nothing at all will already work out reasonably. Trickier, there is often a horribly crude solution to a problem that will work rather effectively. In such a case, be crude. There are better places to use your time.

The second item is sometimes called a speed of light bound, due to the speed of light being so impressively unbeatable. You ask yourself how much an idea could help even if you expend years of effort and everything goes perfectly. In many cases the maximum benefit is not that high, so you may as well save your effort. A common example is in speeding up a system. Unless you are working on a major bottleneck, any amount of speedup will not help very much.

Thursday, May 19, 2011

IRBs under review

There are several interesting blog entries up at blog.bioethics.gov concerning the ongoing presidential review of Internal Review Boards.

I liked this line:
“We pushed for an ethical reform of system, real oversight, and now we are left with this bureaucratic system, really a nitpicking monster,” Arras said, addressing Bayer. “And I am as stupefied as you are.”

I am not sure why this pattern would be stupefying. A great many things that people attempt to do don't work out as intended. IRBs are just one more for the list, albeit one that has lingered for decades.

I am not as sanguine about the reviewers about this conclusion about whether another "Guatemala" could happen:
“Of the many things that happened there, no, it could not happen again because of informed consent,” said Dafna Feinholz, chief of the Bioethics Section, Division of Ethics and Science and Technology, Sector for Social and Human Sciences, United Nations Educational, Scientific and Cultural Organization.

The idea is that since IRBs require informed consent of study participants, the Guatemala experiments could never again happen, because the study participants would know what is going on.

I hope so, but consider the following evil scenarios:
  • A wealthy autodidact negotiates directly with local authorities and runs the experiment on his own dime. No university is involved, so no IRB review even happens.
  • A university researcher learns about a disease outbreak in some part of the world. The researcher waits two years and then applies for a research grant to study the effects of the disease. Since the researcher did nothing for the first two years, there was nothing for the IRB to review.
  • The Professor Muckety, Chair of Hubert OldnDusty at BigGiantName University, announces a grand new experiment that he expects will cure cancer. He invites all of the upcoming faculty in his area to take part in it, and there will be numerous papers and great acclaim to all the participants. The IRB at BigGiantName U.'s is stacked with faculty that are totally brainwashed into thinking the experiment is for the greater good. Will they really take a stand against the project?
I would not be so sure that, despite all the efforts of IRBs, an evil experiment couldn't happen again.

Whenever something goes wrong, there is a natural reaction for everyone to yell, "DO SOMETHING!" IRBs are the result of such an outcry. They are there to project human subjects, but I don't believe they are very effective at that. I believe that the MucketyMucks largely breeze through the red tape doing whatever they like, and instead we are staffing a bunch of bureaucrats to check that the smaller players filed form T19-B in triplicate, double spaced and typed with a manual type writer.

To improve on the current mess, carving out a large exempt category would be a large improvement. Surveys, observations, and other experiments with minimal opportunity for harm shouldn't need prior review.

Tuesday, May 17, 2011

It was just getting started...

There are many things wrong with California jumping in to regulate Facebook's privacy policies:
  • Facebook is a world-wide service, not a California service. Why is this up to California?
  • Facebook has over five hundred million users. That's five times more than the number of people who watch the SuperBowl. Whatever Facebook is doing, it must be pretty reasonable.
  • Social network sites tend to only last about five years before the next new hotness overtakes them. The odds are against Facebook lasting all that long.

All of these matter, but the last one is most peculiar to Internet services. I really want to see what the next social site is like, and the next site after that. I don't relish a long sequence of watered-down Facebook clones with all of their paperwork properly stamped and in order. How dreary.

Monday, May 16, 2011

A package universe for Lisp

Quicklisp is a package manager for Common Lisp that is popular among Lisp programmers. I'm happy to read that one of their secrets to success is quality control in the central distribution:
Quicklisp has a different approach:
  • Quicklisp does not use any external programs like gpg, tar, and gunzip.
  • Quicklisp stores copies of project archives in a central location (hosted on Amazon S3 for reliability and served with Amazon CloudFront for speed).
  • Quicklisp computes a lot of project info in advance. Projects that don't build or don't work nicely with other projects don't get included.
I would quibble with the order of their bullet points, because the last point is overwhelmingly important. It isn't a little side benefit to have a well-defined distribution and to test the members of that distribution against each other. On the contrary, it's a make or break property of the system if you want users to have some level of confidence in the code they're downloading.

Wednesday, May 11, 2011

Sven Efftinge on Java-killer languages

I just ran across Sven Efftinge's fun post on what he wants to see in a Java-killer language.

My list would be something like: remove boilerplate for common coding arrangements, make things easier to understand, be compatible with existing Java code, and otherwise leave everything alone.

Sven has a more detailed list. Here are his bullet points and some thoughts on them:

1. Don't make unimportant changes. Gosh yes. Changing = to :=, or changing the keywords, adds a barrier to entry for anyone learning the language. Don't do it without a real benefit.

2. Static typing. Static typing is one of those choices where the up-front choice is far from obvious and has many intangibles, but once you choose, many of the follow-up choices are fairly clear to people who know the area. I think it is perfectly reasonable to have untyped languages on the JVM, and I think it's perfectly reasonable to have simply typed languages with generics only used for collections. Note that the choice will strongly influence what sorts of applications the language is good for, however. Additionally, I would emphasize that today's type systems have gotten more convenient to use, so the niche for untyped languages is smaller than it used to be.

3. Don't touch generics. Java's type system is long in the tooth. While its basic parametric types are fine, there are parts that are simply bad: raw types, wildcards, arrays, and primitive types. If you are developing a Java killer, improving the type system is one of the ways you can improve the language. You'd be crazy not to consider it.

4. Use type inference. Absolutely. This is a large source of boilerplate in Java.

5. Care about tool support (IDE). I agree. When I joined the Scala project in 2005, I was glad to see that the core team was working on a number of tools, including: scaladoc, the scala command (repl, script runner, and object runner), scalap, ant tasks, and the Eclipse plugin. Nowadays there are even more tools, including an excellent IntelliJ plugin and integration with a larger number of build tools.

In a nutshell, making programmers productive requires more than a good programming language. There are huge benefits to good tools and rich libraries. The overall productivity of a programmer is something like the product of language, tools, and libraries.

6. Closures. Yes, please. The main reason to leave it out historically is the lack of garbage collection. I don't understand why Java has been so slow to adopt them, and I was terribly saddened to hear Guy Steele at OOPSLA 1998 pronouncing that Java didn't look like it really needed closures. It was surreal given the content of the talk that he had given just minutes before.

7. Get rid of old unused concepts. Yes, in general. However, this can be hard to do while also maintaining compatibility and generally letting people write things in a Java way if they want. For the specific things Sven lists: totally agreed about fall-through switch; totally agreed about goto, but it's not in Java anyway; not so sure about bit operations. Bit operations are useful on the JVM, and besides, Java's numerics work reasonably already. Better to focus on areas where larger wins are possible.

Free linking on the web?

Lauren Weinstein has a great article up on the efforts of governments around the world to make Internet material disappear. One tactic for this is to go after search engines:
In Europe, one example of this is the so-called Spanish “right to be forgotten” -- currently taking the form of officials in Spain demanding that Google remove specific search results from their global listings that “offend” (one way or another) particular plaintiffs.

I agree with Weinstein's conclusion:
We are at the crossroads. Now is the time when we must decide if the Internet will continue its role as the most effective tool for freedom of information in human history, or if it will be adulterated into a mechanism for the suppression of knowledge, a means to subjugate populations with a degree of effectiveness that dictators and tyrants past could not even have imagined in their wildest dreams of domination.

The U.S. is in a position to affect that future. Currently, it is gradually inserting censorship backdoors into the Internet at the request of its music and film industries. It's not worth the cost. I freely admit that Hollywood is wonderful, but we should remember that Broadway is pretty cool, too. Unlike Hollywood, Broadway has business models that don't require an Internet overload.

Tuesday, May 10, 2011

Externally useful computer science results

John Regehr asks what results in computer science would be directly useful outside the field. I particularly like his description of his motivation:
An idea I wanted to explore is that a piece of research is useless precisely when it has no transitive bearing on any externally relevant open problem.
A corollary of this rule is that the likelihood of a research program ever being externally useful is exponentially decreased by the number of fundamental challenges to the approach. Whenever I hear about a project relying on synchronous RPC, my mental estimate of likely external usability goes down tremendously. As well, there is the familiar case of literary deconstruction.

Regehr proceeds from here to speculate on what results in computer science would truly be useful. I like most of Regehr's list--go read it! I would quibble about artificial intelligence being directly useful; it would be better to be more specific. Is Watson an AI? It's not all that much like human intelligence, so perhaps it's not really AI, but it is a real tour de force of externally useful computer science.

One thing not on the list is better productivity for software developers, including tools, programming languages, and operating systems. When software developers get more done, more quickly, more reliably, anything that includes a computer can be built more quickly and cheaply.

Saturday, April 30, 2011

Maintaining an intellectual nexus

It saddens me to read the comments about Scala Days on scala-lang.org. The only two comments are from people who won't be coming because of U.S. border inspections. It's not possible to tell how many people really avoid the conference because of this problem. What we know, however, is that potential attendees are weighing it in their decision.

Paul Graham has written a great essay describing the conventional wisdom about the intellectual nexus that is Silicon Valley. One aspect he emphasizes is immigration:
I doubt it would be possible to reproduce Silicon Valley in Japan, because one of Silicon Valley's most distinctive features is immigration. Half the people there speak with accents. And the Japanese don't like immigration. When they think about how to make a Japanese silicon valley, I suspect they unconsciously frame it as how to make one consisting only of Japanese people. This way of framing the question probably guarantees failure.

A silicon valley has to be a mecca for the smart and the ambitious, and you can't have a mecca if you don't let people into it.

I've written before that the effective workspaces I've seen all involve a mix of workers from all over the world. Making people work with only people who look like each other and have the same accents is much like only allowing marriages within one's own clan. You get better matchups if you let people date more widely.

The same is true for conferences. Ideas feed back on each other, with multiple sparks contributing to starting the fire. It often takes more than one good idea to make progress on something. Any of those ideas alone doesn't get you part of the progress. It gets you none. Concentrating people together is an essential ingredient to having a productive conference.

Unfortunately, border control, xenophobia, and general empowerment of the police have increased with every U.S. president since the fall of the Berlin Wall. You wouldn't know it from the news or from the talking heads, but if you do some brief web research you can verify it is true. The current president brought us airport gropes and is also responsible for the line, "Because of recent circumstances, the underwear was taken away from him as a precaution to ensure that he did not injure himself". I don't detect any trend in a positive direction.

I'm spitting into the wind here in the hopes that any change must start somewhere. Xenophobia is not just wrong, and it's not just unpleasant. It's making us dumber and poorer.

Monday, April 25, 2011

Types are fundamentally good?

Once in a while, I encounter a broad-based claim that it's fundamentally unsound to doubt the superiority of statically typed programming languages. Bob Harper has recently posted just such a claim:
While reviewing some of the comments on my post about parallelism and concurrency, I noticed that the great fallacy about dynamic and static languages continues to hold people in its thrall. So, in the same “everything you know is wrong” spirit, let me try to set this straight: a dynamic language is a straightjacketed static language that affords less rather than more expressiveness.

Much of the rest of the post then tries to establish that there is something fundamentally better about statically typed languages, so much so that it's not even important to look at empirical evidence.

Such a broad-based claim would be easier to swallow if it weren't for a couple of observations standing so strongly against it. First, many large projects have successfully been written without a static checker. An example would be the Emacs text editor. Any fundamental attack on typeless heathens must account for the fact that many of them are not just enjoying themselves, but successfully delivering on the software they are writing. The effect of static types, whether it be positive or negative, is clearly not overwhelming. It won't tank a project if you choose poorly.

A second observation is that in some programming domains it looks like it would be miserable to have to deal with a static type checker. Examples would be spreadsheet formulas and the boot scripts for a Unix system. For such domains, programmers want to write something quickly and then have the program do its job. A commonality among these languages is that live data is often immediately available, so programmers can just as well run the code on live data as fuss around with a static checker.

Armed with these broad observations from the practice of programming and the design of practical programming languages, it's easy to find problems with the fundamental arguments Harper makes:
  • "Types" are static, and "classes" are dynamic. As I've written before, run-time types are not just sensible, but widely used in the discussion of languages. C++ literally has "run-time types". JavaScript, a language with no static types, has a "typeof" operator whose return value is the run-time type. And so on. There are differences between run-time types and static types, but they're all still "types".
  • A dynamic language can be viewed as having a single static type for all values. While I agree with this, it's not a very useful point of view. In particular, I don't see what bearing this single "unitype" has on the regular old run-time types that dynamic languages support.
  • Static checkers have no real restrictions on expressiveness. This is far from the truth. There has been a steady stream of papers describing how to extend type checkers to check fairly straightforward code examples. In functional languages, one example is the GADTs needed to type check a simple AST interpreter. In object-oriented languages, the lowly flatten method on class List has posed difficulties, because it's an instance method on class List but it only applies if the list's contents are themselves lists. More broadly, well-typed collection libraries have proven maddeningly complex, with each new solution bringing with it a new host of problems. All of these problems are laughably easy if you don't have to appease a static type checker.
  • Dynamic languages are slow. For some reason, performance always pops up when a computer person argues a position that they think is obvious. In this case, most practitioners would agree that dynamic languages are slower, but there are many cases where the performance is perfectly fine. For example, so long as Emacs can respond to a key press before I press the next key, who cares if it took 10 microseconds or 100 microseconds? For most practical software, there's a threshold beyond which further performance is completely pointless.

Overall, type checkers are wonderful tools. However, any meaningful discussion about just how they are wonderful, and to what extent, needs to do a couple of things. First, it needs some qualifiers about the nature of the programming task. Second, it needs to rest on at least a little bit of empirical data.