Wednesday, July 29, 2015

Finding and fixing complicated code

It's hard to beat lines of code as a complexity metric: the subjective complexity of a piece of code seems to correlate very well with its length. Over the years, many other metrics have been devised and explored, but they tend to be more complicated yet less effective than the simple old lines-of-code.

One exception to the rule is McCabe Cyclomatic Complexity, a metric almost exactly as old as I am. McCabe can often find methods that aren't that many lines of long, but that nonetheless a typical maintainer would identify as being complicated and difficult to maintain. Small wonder that GNU Complexity puts so much emphasis on it.

What McCabe does is count the number of independent paths are needed to cover a method from beginning to end. The way McCabe described it, in his delightfully readable paper on the subject, is that it's the number of test cases you need to get statement coverage for a given method. It sounds like something that would be impractically hard to calculate--you are minimizing over arbitrary test cases!--but it's not complicated at all. He proved that, with a few extra assumptions, all you really have to do is count the number of branching options in a given body of code.

McCabe complexity is very effective in my experience. A method with complexity 10 is usually a dense, gnarly tangle of code, even if it's only 20 lines long. Yet, 20 lines of code is not enough to reliably be complicated just by its number of lines of code. If it's a simple linear sequence of non-branching code, it's usually reasonable to leave it alone. McCabe Complexity is therefore a real coup.

Even better, there's usually a straightforward way to act on code that has a poor McCabe complexity measure. If you see a method with a lot of branches in it, you can usually make the code more maintainable by factoring out some of the clusters of branches to their own smaller methods. The main method gets easier to understand, and if you do it tastefully, the smaller factored-out methods will also be easier to understand.

One place McCabe complexity falls down, though, is large switch statements. If you write network code that receives a message and dispatches it, it's perfectly fine style to write a switch statement for the dispatch with one case for each kind of message. This is especially true if each case simply calls a method that does the real work. Such a switch statement will easily have a dozen cases, one for each message type, but what alternative would be any more maintainable? Factoring out methods for subsets of the cases doesn't seem obviously better, and reducing the number of cases is often impossible.

McCabe is even worse for switch statements that consider pairs of types that are each drawn from some enum. To see some typical examples, look at Cast.scala in the Apache Spark source code. All the big switch statements in there are enumerating pairs of types, and they strike me as a very reasonable way to manage a problem that is just fundamentally complicated. This organizational style reminds me of how people organize definitions on paper when they are studying and explaining language semantics. Such people will typically define their relations and functions using a dozen or so cases, each of which has a number of assumptions (x is a class type, y is a class type, and x inherits from y), and a conclusion you can draw if all of the assumptions hold (x is a subtype of y). If this style is reasonable for theory papers, where people are expending enormous time and energy to bring a complicated subject within the reach of the human mind, it seems only reasonable to write such definitions in code when possible.

Overall, McCabe complexity is excellent and has stood the test of time. I wonder, though, if "number of branching statements" might be even more useful for anyone trying to simplify their code. "Number of branching statements" is the same as McCabe for if statements and while loops, but for switch statements, it only adds 1. Such a measure seems like it would better focus on code that is not merely complicated, but also possible to simplify.

Wednesday, March 18, 2015

Every build is special

Since Semmle's business involves integrating with our customers' builds, I'm very sympathetic to Tim Boudreau's perspective on simple plain-jane builds that just work:
But with time and enough consulting customers, you grow a real appreciation for projects you can simply check out from version control and build.

It's always a relief when someone wants to, say, build a stock iPhone app using xbuild. It's often fine when they want to build a C program using msbuild. Third place is Maven, for a simple Java program; I say third place, because hardly anyone uses it without loading it with a bunch of plugins that fundamentally change how it works. But still, at least the overall project structure is somewhat standardized.

In my modest experience, I only really see the above for relatively small and simple projects. As "Scott" puts it in the first comment at the above link:

The idea that your build is "not as special as you think" more often false than it is true.

I just skimmed a code base I have handy, and here are some of the things I see in there that break the mold of any off-the-shelf build tool:

  • It uses parser generators, such as ANTLR.
  • It's mainly a Java code base, but there is a little bit of code in all of: C, Objective C, JavaScript, Scala, C#, and Python.
  • It uses a home-grown stack trace obfuscator and de-obfuscator that involves bytecode rewrites on the deployed jars.
  • It includes its own version of zipmerge, with options not available in the open-source version.
  • It uses the JIT support in mini-Lua to generate machine code that is then compiled into string literals in a C program.
  • It includes a proprietary query language and numerous queries in that language. Those queries get compiled during the build, using a query engine that is also compiled during the same build.

Wednesday, January 14, 2015

Surveillance states are possible

While clamping down on private encryption is bad policy, both for the economy and for privacy, I don't think it's technically impossible to implement. Let me draw a couple of comparisons to show why.

As background, here is Cory Doctorow explaining, like many other commenters, that the Internet is too wild and wooly for major governments to possibly implement widespread surveillance:

For David Cameron's proposal to work, he will need to stop Britons from installing software that comes from software creators who are out of his jurisdiction. The very best in secure communications are already free/open source projects, maintained by thousands of independent programmers around the world. They are widely available, and thanks to things like cryptographic signing, it is possible to download these packages from any server in the world (not just big ones like Github) and verify, with a very high degree of confidence, that the software you've downloaded hasn't been tampered with.

With cellular phones, any phone that uses the relevant chunks of bandwidth is legally required to use certain protocols that are registered with the government. This has been bad economically, in that the telephone network has developed much more slowly than the relatively unregulated Internet. However, being bad economically has never exactly stopped rules from being put into place.

Yes, you can rig up a wireless network in your garage that breaks the rules. However, as soon as you try to use it over a wide geographic region, you're going to be relatively easy to catch. You will have to either broadcast a strong signal, or make use of the existing telephone backbone, or both.

To draw another comparison, consider the income tax. Income tax is easy to avoid with small operations, because you can just pay cash under the table. However, larger operations have to file a variety of paperwork, and the interlocking paperwork is what will get you. The more you take part in the above-ground economy, the harder it is to spin a big enough web of lies to get out of your taxes.

To get back to Internet protocols, it will certainly always be possible to break the rules on an isolated darknet you assemble in your garage. However, as soon as you send packets across the Internet backbone, any use of unregistered protocols is going to be very easy to detect.

To rub the point in further, don't forget that the authorities have no requirement to go after everyone who they detect doing something fishy. If they are anything like the American tax service, they'll randomly (or politically....) select people to target, and those people will then be required to undergo an audit at their own expense. If they survive the audit, the tax service just says "I'm sorry" and moves on to the next victim. Because of selective enforcement, law enforcement has no requirement to go after everyone using illegal encryption.

Of course all this is bad for the economy and for humanity's development at large. Don't oppose a cryptography clampdown because it's technically impossible, or you will look just as silly as the people that say DNS takedowns are technically impossible. Rather, oppose a cryptography clampdown because we don't want to live like that. We want to have private communications, and we want to allow innovation on the Internet. It's brand new, and if we clamp down on it, it will ossify in its current state the same way that the telephone network did.

Friday, December 5, 2014

Goodbye, Spark Plug

Don't take me wrong. I know a dog is just a dog, and a pet is just a pet. There are people reading this who have cancer, and there are some who have outlived their human children. On the scale of life challenges, I've just had maybe a 3/10.

Still, I would like to write a few words. It's a way to organize my thoughts, and a way to say goodbye. I promise the next post will be about programming or law or identity or the web, but that all seems rather dry to me today.

As all you pet owners know, you get a little Pavlovian jolt each time you help out your little ward and they reward you for it. For example, when they greet you at the door and run in circles. Or, when they learn your gait well enough to jump on each leg in time and then jump to the other one before you pick that foot up. When they're cold, and you blanket them up, and they let out a deep sigh of contentment. When there's a burr in their foot, and they plaintively hold it out so that someone with thumbs can do something about it.

Over time it really adds up. You become attuned to where you expect them to be when you walk into a room. You gain a second sense about which ruffles in a couch blanket have a dog under them. You expect if you plink through a few measures of a waltz, that you'll see two brown eyes peek around the corner to see what you're doing. After 18 years of that and then nothing, you are left with a lot of little voids that add up to one great big void.

Some animals go and hide when they become deathly sick, but this one did not. In his last hours he came to me to fix it. Dog or no, it was crushing to see such hope and confusion, yet so little ability to do anything about it.

To anyone out there facing this kind of question, let me tell you that I feel no unease at all about the decision to eschew blood samples, fluid IVs, antibiotics, and I didn't even ask what else to try and give him a little more time. I keep thinking, he was 18, and kidneys don't get better, and he had multiple other problems, anyway. Indeed, what I really wish I could go back in time on is delaying euthanasia for too long. I even had my mind made up, and I went to a 24-hour vet to do it, but I backed down when confronted with a thorough vet that wanted to rehash the dog's entire medical history. I thought I could just take him to our regular vet the next day, but the sun never rose for him again. Yes, hindsight is 20/20, but I wish I had insisted.

Goodbye, Spark Plug. I hope we did well for you.

P.S. -- Mom, you are very optimistic to think we can get this plant to bloom every December. We'll give it a try!

Sunday, November 23, 2014

Is this the right server?

It's nice to see someone else reach the following conclusion:

"For those familiar with SSH, you should realize that public key pinning is nearly identical to SSH's StrictHostKeyChecking option. SSH had it right the entire time, and the rest of the world is beginning to realize the virtues of directly identifying a host or service by its public key."

Verifying a TLS certificate via the CA hierarchy is better than nothing, but it's not really all that reassuring. Approximately, what it tells you is that there is a chain of certification leading back to one or more root authorities, which for some reason we all try not to think about too much are granted the ultimate authority on the legitimacy web sites. I say "approximately", because fancier TLS verifiers can and do incorporate additional information.

The root authorities are too numerous to really have faith in, and they have been compromised in the past. In general, they and their delegates have little incentive to be careful about what they are certifying, because the entities they certify are also their source of income.

You can get better reliability in key verification if you use information that is based on the interactions of the actual participants, rather than on any form of third-party security databases. Let me describe three examples of that.

Pin the key

For many applications, a remotely installed application needs only communicate with a handful of servers back at a central site you control. In such a case, it works well to pin the public keys of those servers.

The page advocates embedding the public key directly in the application. This is an extremely reliable way of obtaining the correct key. You can embed the key in the app's binary as part of your build system, and then ship the whole bundle over the web, the app store, or however else you are transmitting it to the platform it will run on. Given such a high level of reliability, there is little benefit from pulling in the CA hierarchy.

As linked above, you can implement pinning today. It appears to be tricky manual work, though, rather than something that is built into the framework. As well, you don't get to ignore the CA hierarchy by doing this sort of thing. So long as you use standard SSL libraries, you still have to make sure that your key validates in the standard ways required by SSL.

Associate keys with links

The Y property deserves wider recognition, given how important hyperlinks are in today's world. Put simply, if someone gives you a hyperlink, and you follow that hyperlink, you want to reliably arrive at the same destination that the sender wanted you to get to. That is not what today's URLs give you.

The key to achieving this property is that whenever you transmit a URL, you also transmit a hash of the expected host key. There are many ways to do this, including the ones described at the above hyperlink (assuming you see the same site I am looking at as I write this!). Just to give a very simple example, it could be as simple as using URLs of the following form:

This particular example is interesting for being backward compatible with software that doesn't know what the hash means.

I don't fully know why this problem is left languishing. Part of it is probably that people are resting too easy on the bad assumption that the CA hierarchy has us covered. There's a funny mental bias where if we know nothing about a subject, and we see smart people working on it, the more optimistic of us just assume that it works well. Another part of the answer is that the core protocols of the world-wide web are implemented in many disparate code bases; SSH benefits from having an authoritative version of both the client and the server, especially in its early days.

As things stand, you can implement "YURLs" for your own software, but they won't work as desired in standard web browsers. Even with custom software, they will only work among organizations that use the same YURL scheme. This approach looks workable to me, but it requires growing the protocols and adopting them in the major browsers.

Repeat visits

One last source of useful information is the user's own previous interactions with a given site. Whenever you visit a site, it's worth caching the key for future reference. If you visit the "same" site again but the key has changed, then you should be extremely suspicious. Either the previous site was wrong, or the new one is. You don't know which one is which, but you know something is wrong.

Think how nice it would be if you try to log into your bank account, and the browser said, "This is a site you've never seen before. Proceed?"

You can get that already if you use pet names, which have been implemented as an experimental browser extension. It would be great if web browsers incorporated functionality like this, for example turning the URL bar and browser frame yellow if they see a a site is a new one based on its certificate. Each browser can add this sort of functionality independently, as an issue of quality of implementation.

In your own software, you can implement key memory using the same techniques as for key pinning, as described above.

Key rotation

Any real cryptography system needs to deal with key revocation and with upgrading to new keys. I have intentionally left out such discussion to keep the discussion simple, but I do believe these things can be worked into the above systems. It's important to have a way to sign an official certificate upgrade, so that browsers can correlate new certificates with old ones during a graceful phase-in period. It's also important to have some kind of channel for revoking a certificate, in the case that one has been compromised.

For web applications and for mobile phone applications, you can implement key rotation by forcing the application to upgrade itself. Include the new keys in the newly upgraded version.

Thursday, November 20, 2014

FCC inches away from neutrality

The FCC’s latest proposal for network neutrality rules creates space for broadband carriers to offer “paid prioritization” services.[11] While the sale of such prioritization has been characterized as a stark and simple sorting into “fast” and “slow” traffic lanes,[12] the offering is somewhat more subtle: a paid prioritization service allows broadband carriers to charge content providers for priority when allocating the network’s shared resources, including the potentially scarce bandwidth over the last-mile connection between the Internet and an individual broadband subscriber. Such allocation has historically been determined by detached—or “neutral”—algorithms. The Commission’s newly proposed rules, however, would allow carriers to subject this allocation to a content provider’s ability and willingness to pay.

That's from a review on Standard Law Review a few months ago. I think this evolution in the FCC's approach will benefit the public.

It seems important to consider realistic developments of the Internet. Here's a thought experiment I've used for a long time, and that seems to be happening in practice. Try to imagine what goes wrong if a site like YouTube or Netflix pays--with its own money--to install some extra network infrastructure in your neighborhood, but only allows its own packets to go across that infrastructure. Doing so is a flagrant violation of network neutrality, because packets from one site will get to you faster than packets from another site. Yet, I can't see the harm. It seems like a helpful development, and just the sort of thing that might get squashed by an overly idealistic commitment to neutrality.

As a follow-on question, what changes if instead of Netflix building the infrastructure itself, it pays Comcast to do it? It's the same from a consumer's view as before, only now the companies in question are probably saving money. Thus, it's even better for the general public, yet it's an even more flagrant violation of network neutrality. In this scenario, Netflix is straight-up paying for better access.

It seems that the FCC now agrees with that general reasoning. They not only support content delivery networks in general, but now they are going to allow generic ISPs to provide their own prioritized access to sites that pay a higher price for it.

I believe "neutrality" is not the best precise goal to go for. Rather, it's better to think about a more general notion of anti-trust.

Tuesday, October 14, 2014

Three tiers of classrooms

Via Arnold Kling, I see Jesse Rothstein trying to prove that you can't measure teaching ability, or perhaps even that teaching ability doesn't matter:
Like all quasi-experiments, this one relies on an assumption that the treatment – here, teacher switching – is as good as random. I find that it is not: Teacher switching is correlated with changes in students’ prior-year scores.

It's important to figure out which kind of classroom we are talking about. There are at least three tiers of classroom styles. If you measure only in the middle tier, then I can believe that teacher skill would have only a small effect. However, it's really easy to tell the difference between the tiers if you look, especially for the bottom-most tier compared to the other ones.

At the bottom tier, some classes are just zoos. The teacher is ignored, and the students talk to each other. At best, they work on material for another class. Teacher skill doesn't matter within this tier, from at academic perspective; one zoo teaches students just as much as another zoo. I am sad to say that classrooms like this do exist. It's a potential bright note that such teachers are very easy to identify in an objective way; their students have absolutely terrible results on standardized tests such as Advanced Placement (AP). There's no need for sophisticated statistics if all the students are scoring 1-2 out of 5.

At the middle tier, some classes involve the teacher walking the students through standardized textbooks and other material. Basically, the textbooks are software and the teachers are the hardware that runs it. It's not an inspiring kind of classroom, but at least it is inexpensive. Within this tier, I could see teacher skill not mattering much, because the students spend all their time glued to the course materials. However, you'd certainly like to find out who is in this tier versus in the zoo tier.

Worth a brief mention is that there's an upper tier as well. Maybe "style" is a better word in this case. Sometimes the teacher actually understands the course material, and so is able to respond to the questions with anecdotes and exercises that are tailored for that particular student. For this tier, teacher evaluation is especially important. Among other things, some teachers are fooling themselves, and would be better off staying closer to the book.