Wednesday, March 30, 2011

A computer science journal with open access

Since writing that economics has an online open-access journal, I've been informed that computer science has at least one: Logical Methods in Computer Science. Mea culpa.

Perhaps the ACM can follow their lead.

Tuesday, March 29, 2011

When stateful code is better

One branch of functional-programming enthusiasts have long strived to eliminate state from programming. By doing so, you end up with program code that supports equational reasoning. If you know a=b in one part of the code, then you can freely replace a by b and b by a anywhere else in the code. Since there's no state, the program will still behave the same. It's good stuff.

Nonetheless, state is essential for good code in most languages. You don't want to live without it. Take a moment to consider the more practical of functional programming languages, and see how programmers in those languages have voted with their feet. ML, the most popular typed, strict functional language, includes ref cells in the core language. Haskell, the most popular typed, lazy functional language, includes not just the state monad but also UnsafeIO. Lisp and Scheme, the most popular untyped functional languages, shamelessly include state everywhere. It's a clean sweep. Functional programmers are using languages that have state.

Why is this? Let me describe a couple of programming problems that any practical language needs to be able to solve. Both of these problems are easy with state and hair-pulling without. Any language without state will give programmers a tough time with these problems, so such languages don't become popular. The two problems are: logging, and the model-view architecture.

With logging, what you'd like to do is write your program as normal and then insert log messages here and there in the code. As much as possible, you want to avoid disturbing the core logic of the program with the logging behavior. Solving this problem using state is so easy it's hard to even talk about. All you do is use that state, typically an external file system. Every time you want to log something, write the message to the state.

What if you want to log without state? In that case, you have to pass around the log as a parameter to every function in the program that might want to log anything. Every function gets an extra parameter which is the state of the log before the function call. Such functions must also return that extra parameter, after updating, when they finish. This approach has two large problems. First, it requires pervasive changes throughout the code base to pass around the latest version of the log. Second, it's highly error prone. For example, the following function logs two messages but accidentally discards the first one:
def doTwoThings(log) = {
val log1 = write(log, "about to do first thing")
doFirstThing()
val log2 = write(log, "about to do second thing")
doSecondThing()
return log2
}
This kind of problem can probably be prevented with linear types, and many functional language researchers would observe this and go do a bunch of research on linear types. Until they come up with something, your best bet is to use state.

A second example is event handling in the model-view architecture that is so pervasive in practical code. In the model-view architecture, you write the program in two layers: one layer for the core model of the software and one layer for the view. Views have a pointer to their models, and whenever the model changes they update themselves. This way, the model code stands alone and can be analyzed and unit tested without needing a user interface. The view, meanwhile, focuses on user interfaces, and can be tested on its own if you stub out the model. It's a fine architecture, well worth its popularity. Here's the challenge for stateless programming: how do you update the model in response to an event?

In a stateful language, what programmers can do is mutate the model in place. Every pointer from the view to the model will still be valid, so the view doesn't need any changes to its structure. Again, it's so simple it is hard to even talk about.

Now consider a stateless language. In a stateless language, you must not only update the model, but must also update any view object that refers to any part of the model that changed. Likewise, you have to update any view object that has a reference to any such view object, transitively. There's no theoretical bar from programming like this. However, your process-event object ends up taking a view as an argument, just so that it can update all the pointers from the view to the updated model. This approach is tedious and error-prone in the same way as with stateless logging. It's very easy to leave some parts of the view pointing to old parts of the model. If you do that, things will mostly work, but there will be stale data in parts of the view.

In general, stateless code is usually better. However, I can't escape believing that for logging and for the model-view architecture, it's the stateful version that is best. These problems share the aspect that there are references from one component (main application, view) to a separate component (log file, model), and the second component is undergoing change. By letting the reference be stateful, the two components can work with each other at arm's distance. Contrary to its usual reputation, state in such cases is a help rather than a hindrance for building useful abstractions.

Monday, March 28, 2011

Economics now has an online journal

Economics joins the ranks of those fields with an online academic journal. All papers are free for download. They've even back-posted all the old papers back to 1970. Feel free to click through and browse around. There's no registration or charge.

The next best thing to open access is preprint archives, the most prominent of which is ArXiv (pronounced "archive"). ArXiv is infrastructure to upload papers that are usually also submitted to a journal or conference. I first heard of preprint archives as used among physics researchers. Physics is a natural enough field to kick this off considering that they built the the World Wide Web to host their papers. Physicists publish in journals that have long review and publication delays, on the order of 6-12 months, and they seem to have realized that a 6-12 month ping time is not good for a group conversation.

I applaud BPEA going open access, and I wish computer science proceedings would do the same thing. Currently, all the American conferences are published through the ACM digital library, which has a dizzying array of subscription plans that has been designed to maximize profit. The current model in computer science is that CS research is IP for the ACM to sell for profit, much like a CD or a DVD. I would prefer a model where CS research is meant to advance science. Pay walls have a damping effect on discussion, and science without discussion isn't really science at all.

Thursday, March 24, 2011

Certificate authority compromised

Wired reports:
In a fresh blow to the fundamental integrity of the internet, a hacker last week obtained legitimate web certificates that would have allowed him to impersonate some of the top sites on the internet, including the login pages used by Google, Microsoft and Yahoo e-mail customers.

As a rule of thumb, a system that requires the entire world to cooperate and do things right is unlikely to work very well. This is particularly true for security software, where the very point of the software is to defend against those that misbehave.

The good news is that TLS certificates aren't that effective anyway, so the breach didn't cause much harm. The harm is more like someone breaking through a gauzy curtain than someone breaking into a bank vault. Few people notice if they are even connected via http or https, and TLS only helps for https connections. As well, if you connect to bankammerica.com instead of bankamerica.com, certificates won't save you. Further, what exactly can a certification authority ever certify even if everything checks out? Pretty much all they can do is verify that you are connecting to the owner of the given DNS address. It doesn't mean that bankamerica.com is really the web site for the Bank of America you are trying to contact.

TLS certificates are a case of following a beautiful theory that mismatches reality. The theory is that people gain trust in a web site by having a lot of third-party certificates attesting to that web site's authenticity. The more reputable the sites, the better.

To see that this is an odd theory, consider how it is that we believe a person we are talking to is who we think they are. It's almost never because we checked their ID and are savvy enough to know whether it's a fake ID or not. A more plausible source of trust is that we recognize that we're talking to the same person we talked to yesterday. Another more likely way is that we were introduced to the person by someone else that we trust, so we tentatively start talking to the new person based on that contact.

There are web analogies for both of these processes. If we visit the same site two days in a row, our browsers could tell us this via an improved bookmarking system such as the Pet Names toolbar. If one site links to another site, then we gain confidence in the second site corresponding to how we thought about the first site. That's hyperlinking, and it could be improved by a system like YURLs.

Neither such mechanism, however, is getting much attention. The action is all in certificate chains. For some reason, engineers are fixating on an approach where truth descends down a hierarchy and where end users are able to study and act on these delivered truths. Web protocols would be better, it seems to me, if they relied on more realistic models of identification that mirror what we do in our social lives.

Wednesday, March 23, 2011

Prior permission for indexing books?

Timothy Lee writes that the agreements backing Google Books are undergoing renegotiation. He argues that Google should seek a fundamental legal principle rather than negotiating a contract via class-action law.
Fair use exists as a kind of safety valve for the copyright system, to ensure that it does not damage free speech, innovation, and other values. Although formally speaking judges are supposed to run through the famous four factor test to determine what counts as a fair use, in practice an important factor is whether the judge perceives the defendant as having acted in good faith. Google has now spent three years looking for a way to build its Book Search project using something other than fair use, and come up empty.

I like this approach better myself. It's better to have simple, common-sense rules about proper rules of engagement than to have a thousand-page contract that nobody has even read in its entirety. For books, part of the common sense rules would include that indexing is allowed, and that abandonware is largely free reign, at least until the owner shows up again.

To contrast, the current approach has Google negotiating a contract that will bind all authors. That seems a little weird given that all authors aren't really present. It doesn't seem like a good fit for contract negotiation. It's one of those rare beasts that is a good fit for our legislative bodies to sort out.

Robin Hanson against IRBs

Robin Hanson makes the case against Internal Review Boards for research on human subjects:
IRBs seem a good example of concern signaling leading to over-reaction and over-regulation. It might make sense to have extra regulations on certain kinds of interactions, such as giving people diseases on purpose or having them torture others. But it makes little sense to have extra regulation on researchers just because they are researchers. That mainly gets in the way of innovation, of which we already have too little.

I agree with Robin. Mistreatment of fellow humans should certainly be stopped. However, why should academic researchers have to go before a board any time they want to interact with humans, just because they are researchers?

For the majority of legal responses that our society makes, the approach we take is that people act first and then, if there is wrongdoing, the legal system follows up. For example, you don't get interviewed before you buy a gallon of gas. You get interviewed after a house burned down with your car parked outside of it. You don't go before a board before you grade a stack of papers. You go before a board after it is rumored that you told people what other people's grades were. Prior review is stifling.

People who defend IRBs probably assume that they will apply a large dose of common sense about what is dangerous and what is not. For example, surely the IRBs for an area like computer science will simply green light research all day long. In practice, it seems they look for work to do to justify their budgets. Witness the treatment of "exempt" research, where IRBs that have the manpower to do so tend to require review even of "exempt" research projects.

I can only speculate why such a useless and harmful institution persists, but a big part of my guess is that Robin's signalling explanation is correct. If you are the president of a university, could you ever take a stand against IRBs? Such a stand would have the appearance of signalling that you are soft on protection of humans. I wish that people would pay less attention to signals and more attention to results. Pay less attention to how many institutes, regulations, and vice presidencies have been created, and pay more attention to exactly how a university is treating the people it draws data from.

Monday, March 21, 2011

Defining neutral web search

I like South Bend Seven's depiction of that process:
"Let's adjust the learning parameter to .00125 and the momentum factor to .022." "Sure thing, but we better run that by Legal first." There's a recipe for success

Perhaps in 40-50 years, there will be a stable version of the World Wide Web that it makes sense to clamp down with regulation, including search neutrality. However, right new, the web is changing way too fast.

Looking backward, imagine what kind of standards would have been written for a web search back when Alta Vista was the best. What are the odds that a government effort would have a reasonable approach to ranking pages according to linkage patterns? How about deciding what counts as a keyword? Is Zeitgeist one word or two? How about the "did you mean" results? If such an attempt had succeeded, then today we wouldn't even know about these innovations. They would have been squashed by the regulation before anyone could try.

Looking forward, imagine all the ways the web might change in the coming decade. What if more of the web moves into social spaces that have severe privacy needs, such as Facebook and Orkut? What if more of the content uses rich media, as do physical magazines, and is less possible to describe using plain text? What if things go the other way, and the web's information becomes a scattering of little sentences that are glued together on the fly for a particular user's settings?

The search neutrality project can only impoverish our Internet. I'm not clear on why the American government has jurisdiction over the "World" Wide Web, but to the extent they do, I hope they do the right thing and just go take a nap.

Wednesday, March 16, 2011

Battling over the top-level domains

Brad Templeton has some good comments up on top-level domains in the DNS.
Their heart is in the right place, because Verisign’s monopoly on “.com” — which has become the de facto only space where everybody wants a domain name, if they can get it — was a terrible mistake that needs to be corrected. We need to do something about this, but the plan of letting other companies get generic TLDs which are ordinary English words, with domains like “.sport” and “.music” (as well as .ibm and .microsoft) is a great mistake.

There is one option Brad doesn't mention: do away with TLDs. This would have two advantages. First, it would remove needless drag from the current system. Everyone agrees that TLDs are nearly useless and that practically everyone goes for .com anyway. Web browsers even add a .com for you automatically if you leave it off. Why bother adding it at all? Second, it would remove security problems that happen due to confusion about a top-level domain, e.g. mixing up Amazon.fr and Amazon.rf.

More ambitiously, it would be nice to move away from registering English-language words at all. Instead, use IP addresses as the globally unique address. To get English-language names for web sites, use something that is not globally unique, such as pet names. I wish I could point to a concrete implementation of such a system to rely on, but I believe a good system could be designed.

If such a system sounds weird, ponder for a moment just how much you trust DNS names, anyway. If you want to go to Bank of America's web site, which is more reliable. You typing out bankofamerica.com letter-for-letter perfectly, or doing a web search on "Bank of America" and using the top hit? As this example shows, DNS as it stands is not a particularly good solution for naming sites using English-language words. It's merely a tolerable system that sort of works, that has become a de facto standard at this point.

Sunday, March 6, 2011

Trial by jury in the U.S.

Pretty sad stuff from the WikiLeaks case:
Pfc. Bradley E. Manning, the Army intelligence analyst accused of leaking government files to WikiLeaks, will be stripped of his clothing every night as a “precautionary measure” to prevent him from injuring himself, an official at the Marine brig at Quantico, Va., said on Friday. Private Manning will also be required to stand outside his cell naked during a morning inspection, after which his clothing will be returned to him, said a Marine spokesman, First Lt. Brian Villiard.
Imagine how he'll be treated if he is actually convicted of anything.

I don't believe that U.S. oversight over most any aspect of the Internet will make things better. I don't expect them to support content carriers in the best of times, and certainly not ones like WikiLeaks that post material embarrassing to the U.S.

Tuesday, March 1, 2011

Avoiding big-bang updates

A big-bang update is a system update where the entire system must be shut down, multiple components are upgraded in parallel, and then the whole system is turned back on. If it goes well, then BANG, the whole system is upgraded in one big swoop. If it goes well, you lose a day or two of work from everyone working on that system, which is already pretty bad, but at least it's only a day or two. It often doesn't go well, however. Often, there is some dependency that snuck through early testing, and the one or two days extends into three, four, or five as engineers panic to patch up the lurking problems.

With a little foresight, big-bang updates can usually be avoided. The basic requirement is that system components can be updated one by one, without needing to update multiple components at the same time. That, in turn, implies that the bulk of components need to tolerate both the current versions of all other components as well as the upcoming version. Solve the multi-version dependency problem, and you can avoid big-bang updates.

Depending on multiple versions can be tricky. A common situation that arises, for any kind of system, is a desire to change a name that is shared between two different components. If the components are modules of source code, then the name might be a class name that is exported from one module and imported by others. If the components are hooked up using HTTP, then the names might be URLs. The temptation is to change the name, but that leads to a big-bang update. Every component that uses the name has to be updated at the same time. To avoid the big bang, try to find a way to support both the old name and the new name during a transition period. Then all components can be updated one by one to support both names. Once everything has been updated, it is possible to gradually drop support for the old name. It takes longer but it avoids a big bang.

An instance of the problem in programming languages is that module interfaces often tempt programmers into big-bang updates. For example, experienced Java system developers frequently exhort fellow developers not to change any interface once it's published. If you do change such an interface, then all users and all implementers of the interface have to update simultaneously. Bang. With Java as it stands, a better approach is to define a new interface with the desired changes and use dynamic typing to decide whether to use the old or new interface in any given circumstance. Alternatively, of course, an components-friendly programming language could directly support interface evolution directly in the language.