Lex Spoon: December 2010

Wednesday, December 22, 2010

That's not how it works

James Robertson shares this depressing quote from the FCC:

"A commercial arrangement between a broadband provider and a third party to directly or indirectly favor some traffic over other traffic in the connection to a subscriber of the broadband provider (i.e., 'pay for priority') would raise significant cause for concern," the Commission then elaborates. This is because "pay for priority would represent a significant departure from historical and current practice."

Follow the link for analysis.

Let me focus just on this part. The FCC, here, joins the ranks of those who think the Internet is a star topology. The apparent model is that there's an Internet, and then everyone plugs their computer into the Internet. When one user routes a packet to another user, it takes two hops: one to the center node, and one to the other user. Everything that happens within this mythical center node is abstracted away.

As an aside, the FCC also presents a view of the Internet where a handful of providers are sending broadcasts to the masses. Individuals don't contract for Internet services. They are "consumers", and they "subscribe" to the feeds. Leave that aside for now.

The Internet is not a star topology, but a general network. When you send a packet to someone else, it usually takes a dozen or two hops to get to them. How fast it gets to them depends enormously on the intermediate nodes that are taken along the way. I used to play around with traceroute and watch just what routes the packets take under various circumstances. I saw some particularly striking examples when I worked on a Department of Defense bulletin board and watched how packets route between a university network and a DoD machine. Let's just say the routes favored security over latency. They'd go a LONG way in order to go through carefully controlled choke points.

Because the Internet works this way, people who provide Internet services work hard to make sure their servers are well connected with respect to their users. For example, if you want to provide service to British folks, then you really want to get a server up on the island. It wasn't so long ago that all major ftp sites had clones in the UK. Sending data across the English Channel, much less the Atlantic Ocean, was just horrendously slow. When you install an extra server in the UK, you must pay for it.

Relocating a server is just one option. It's also possible to lease network connections between where your server is and where you want the IP traffic to route to. When you do that, you will have to pay whomever you are leasing the bandwidth from.

In short, if you want better connectivity, you have to pay for it. The more you pay, the better the connectivity you get. What the FCC calls a disturbing development is a hair split away from how things already work. They seem to be riding on the notion of whether you pay a broadband provider or some other entity. I fail to see what a big difference it makes.

Let's try a few thought experiments and compare them to the star-topology model. Suppose Netflix pays Comcast to let them install some servers in the same building as a major Comcast hub. Is anything wrong with that? I don't see why. They'll get better bandwidth, but they're paying for all the expenses. Similarly, suppose Netflix, on their own dime, installs new network fiber from their data center to a major Comcast hub. Is there anything wrong with that? Again, I don't see it. After Netflix lays that network, would there be anything wrong with Comcast plugging into it and routing traffic to and from it? Again, I can't see how it would help users for them to decline.

Where the FCC seems to draw the line is when you go past barter and use more fungible resources. What if, instead of Netflix installing new network fiber itself, it pays Comcast to do it. And what if, instead of Comcast laying new fiber for each customer, they split the cost over different customers, giving more access to those who pay more. From the FCC's view, this goes from totally normal to something they've never seen in the past. From my view, this is how things work already. You pay more to get more bandwidth.

I wish the FCC would just abandon trying to regulate Internet service. I want a neutral network, but I don't see how the FCC is going to anything but hurt. I want the Internet we have, not something like broadcast TV, cable, wired telephony, or cellular telephony. I don't think it is a coincidence that the Internet is both less regulated and far more neutral than these other networks.

Friday, December 17, 2010

Every paper and book on our laptops?

Dick Lipton speculates on that question:

Today there are applications like Citeseer that contain about one million papers. The total storage for this is beyond the ability of most of us to store on our laptops. But this should change in the near future. The issue is that the number of papers will continue to grow, but will unlikely grow as fast as memory increases. If this is the case then an implication is that in the future we could have all the technical papers from an area on our own devices. Just as realtime spelling is useful, realtime access to technical papers seems to be a potentially exciting development.[...]
Right now there are too many books, even restricted to a subfield like mathematics, to have all of them stored on a single cheap device. But this large—huge—amount of memory could easily become a small one in the future.

I agree for Citeseer, and I agree for the local library. Very soon, if not already, we will have laptops that can hold the entirety of Citeseer and the entirety of the local library's collection of books. I was impressed when I looked at the file sizes for Project Gutenberg. Shakesperean plays take a few tens of kilobytes, and the largest archive they supply is a dual-sided DVD with over 29,000 books. I still remember the shock when I looked at a directory listing on their web site and the file sizes looked so small I thought the software must be broken.

As an aside, I wish I could say that the Association for Computing Machinery thought this way. Their current thinking that they'll have an online digital library that they take a toll on. If they really wanted to help science, they'd mail you a pre-indexed thumb drive you can load into your laptop and have all papers up until that date. I would bet that someone in physics works this out long before the ACM does. Who knows, though.

All this said, papers and books are backward looking. Nowadays, papers and books are developed as electronic content and then, only at delivery time, printed onto paper. An increasing amount of interesting material is simply never printed at all. Want a copy of the Scala Language Specification? It's essentially a book, but you won't find it at the local library. Over time, printed word is becoming a niche application. You only need it for reading something in depth, or if you want to physically hand it to someone. For the former, print on demand works more and more frequently, and for the latter, the number of times it happens is decreasing. As well, electronic ink just keeps getting better.

From the perspective of interesting words, as opposed to printed papers and books, it will take longer before personal computers can hold all the, ahem, material that is out there. It includes not just papers and books written by mathematicians, but also forum messages, blog posts, and even Facebook and Twitter messages written by all manner of people. Perhaps even then we are already at the point where our machines have enough storage, but it's certainly a lot more data than just for Citeseer and the library.

Of course, most people are only interested in a tiny fraction of all that information. Perhaps Dick Lipton really only cares about math papers from famous mathematicians. If the precise data interesting to someone can be identified, then the storage requirements for keeping a personal copy are much more reasonable, and in fact we probably are already there. However, identifying that subset of the data is, in general, entirely non-trivial.

Wednesday, December 15, 2010

Checked in debugging and profiling?

Engineers of all walks insert extra debugging probes into what they're building. The exterior surface of the artifact is often silent about what goes inside, so the extra probes give engineers important extra information. They can use that information to more quickly test theories about what the artifact is doing. They use these tests to improve performance, debug faulty behavior, or to gain extra assurance that the artifact is behaving correctly.

Software engineering is both the same and different in this regard. Software certainly benefits from extra internal probes, and for all of the standard reasons. A common way to insert such probes is to insert debug or trace messages that get logged to a file. If the messages have a timestamp on them, then they help in profiling for performance. For fine-grained profiling, log messages can be so slow as to affect the timing. In such a case, the timing data might be gathered internally using cheap arithmetic operations, and then summary information emitted at the end. This is all the same, I would imagine, in most any form of engineering.

What's different with software is that it's soft. Software can be changed very easily, and then changed back again. If you're building something physical, then you can change the spec very easily, but building a new physical device to play with involves a non-trivial construction step. With software, the spec is the artifact. Change the spec and you've changed the artifact.

As such, a large number of debugging and profiling probes are not worth checking into version control and saving. Several times now I've watched a software engineer work on a performance problem, develop profiling probes as they do so, and then check in those probes when they are finished. The odd thing I've noticed, however, is that usually that same programmer started by disabling or deleting the stuff checked in by the previous programmer. Why is that? Why keep checking in code that the next guy usually deletes anyway?

This question quantifies over software projects, so it's rather hard to come up with a robust reason why it happens. Let me hazard two possible explanations.

One explanation is that it's simply very easy to develop new probes that do exactly what is desired. When the next engineer considers their small build-or-buy decision--build new probes, or buy the old ones--the expense of rebuilding is so small that the buy option is not very tempting.

The other explanation is that it's a problem of premature generalization. If you build a profiling system based on the one profiling problem you are facing right now, you are unlikely to support the next profiling problem that comes up. This is only a partial explanation, though. I see new profiling systems built all the time when the old one could have been made to work. Usually it's just easier to build a new one.

Whatever the reason, I am currently not a fan of checking in lots of debugging and profiling statements into version control. Keep the checked-in code lean and direct. Add extra probes when you need them, but take them back out before you commit. Instead of committing the probes to the code base, try to get your technique into your group's knowledge base. Write it up in English and then post it to a wiki or a forum. If you are fortunate enough to have Wave server, post it there.

Tuesday, December 14, 2010

Typing arithmetic in Datalog

Unlike in imperative languages, arithmetic in Datalog can execute in different orders. If you write a formula z=x+y in an imperative language, then the meaning is that you compute x, you compute y, and then you add them together to produce z. If you want to understand the typing of such an expression, you first find the types of x and y, and then look at the numeric conversion rules to see what the type of z must be. If x is a 32-bit integer and y is a 64-bit floating-point number, then z will be a 64-bit floating-point number.

In Datalog, there are additional options for using the same formula. One option is to do as in an imperative language, compute x and y, and then use the arithmetic formula to compute z. However, you could just as well compute x and z and then subtract to find y. In total there are four different ways to use the formula: you could use it to compute x, y, or z, or if all three are already computed, you could add them up and verify that they are in a proper relation with each other. How can we reason about the typing implications about an arithmetic formula matching?

Call such a constraint an arithmetic type constraint. Such a constraint must treat x, y, and z symmetrically, so think of them as being in a set. That is, an arithmetic type constraint is defined by a set of variables. If the variables are x, y, and z, then the arithmetic constraint for them could be written arith({x,y,z}).

It appears to work well to treat an arithmetic constraint as equivalent to the following set of individual constraints:

 
  // constraints for arith({x,y,z}) are:
  x <= lub(y, z)
  y <= lub(x, z)
  z <= lub(x, y)

In these constraints, "<=" means subtype-of, and "lub" means least upper bound. The lub of two types is the smallest type that is a supertype of both of them. For numeric types, humor me and think of subtyping as including valid numeric upcasts, so int32 is a subtype of float64. This post is already too long to get into subtyping versus numeric conversion.

Try these equations on a few examples, and it appears to work well. For example, if x and y are already known to be int32, then these constraints imply that z is no larger than lub(int32,int32), which is just int32. As another example, if x is an int32 but y and z are completely unknown, then these constraints give no new bound to anything. As a final example, if all three variables already have a known type, but one of them is a larger type than the other two, then the third one can be shrunk to the lub of the other two.

In general, these constraints will only allow the types of three variables to be in one of the following four combinations:

All three types are completely unknown.
One type is known but the other two are unknown.
All three types are the same.
Two of the types are the same, and the third type is a subtype of the other two.

The first two cases will eventually lead to a compile-time error, because the programmer has introduced variables but not provided a way to reliably bind those variables to values. The third option is a normal, happy case, where there are no numeric conversions being applied. The fourth option is most interesting, in my view, because there is an operation between two types and the result is the lub of those two types. Notably absent from the list is any combination where all three types are different.

As one final wrinkle, it's arguably more programmer friendly to always perform arithmetic on at least 32-bit if not 64-bit values. Otherwise, when a programmer writes 127+1, they might get 8-bit arithmetic and thus an overflow. To make arithmetic work that way, it's possible to add a minimum type to an arithmetic constraint. For example, arith({x,y,z},int64) would be an arithmetic constraint over x, y, and z, and a minimum result type of 64-bit integers. This would be equivalent to the following combination of constraints:

  
  // constraints for arith({x,y,z}, int64) are:
  x <= lub(y, z, int64)
  y <= lub(x, z, int64)
  z <= lub(x, y, int64)

Thursday, December 9, 2010

Published literature as fencing?

"Any discourse will have to be peer-reviewed in the same manner as our paper was, and go through a vetting process so that all discussion is properly moderated," wrote Felisa Wolfe-Simon of the NASA Astrobiology Institute. "The items you are presenting do not represent the proper way to engage in a scientific discourse and we will not respond in this manner."

Felisa Wolfe-Simon is responding here to attacks on a paper she recently published. This is a widely held view, that science takes place on some higher plane of discourse. In this view, ordinary speech is not enough to move the discussion forward. You must go through the publication process just in order to state your counter-argument. Science then progresses by an exchange of one high-minded paper after another.

Hogwash. This romantic picture has no relation to science in the fields I am familiar with.

A killer mismatch between this picture and reality is that counter-arguments are not publishable. If someone publishes the results of a horribly botched experiment, it would serve science to dissect that experiment and show the problem. However, there aren't any peer-reviewed journals to publish it in. If you take the quoted stance seriously, then you must believe it's not proper to criticize published research at all.

A second mismatch is that, in the fields I am familiar with, nobody in the field learns a major new result through the publication process. When someone has a new idea, they talk to their colleagues about it. They speak at seminars and workshops. They write messages to mailing lists about it. They recruit students to work on it, and students post all over the place. Everyone knows what everyone is working on and the way they are doing it. Everyone knows the new ideas long before they have any evidence for them, and they learn about the new pieces of evidence pretty quickly as well.

Researchers debate all right, but not via publication. They email lists. They write each other. They give public speeches attacking each other's ideas. Others in the field do all of the same, and they are often more convincing due to being less invested in the conclusions.

In short, declining to participate in discussions outside the publication process is often presented as some sort of high ground. This is a backwards and dangerous notion. It means that you are not defending your ideas in the forums that convince the experts.

Lex Spoon