Monday, October 12, 2015

Initial input on the Scala collections redesign

It looks like Martin Odersky is considering an overhaul of Scala's collections:
A redesign of the standard library is on the roadmap for one of the next Scala versions (could be as early as 2.13). So I think now is the time to start thinking about what we want to change, in particular in what concerns collections!

Some more details are available on the Dotty issue tracker. Paul Phillips has weighed in on this issue with a very interesting slide deck.

I think it will be a very positive change if done carefully. It's painful to change such a fundamental library, but the current state is a real pain point for practical Scala code. If no other contender can get full support, I would even go so far as to suggest going back to Matthias Zenger's original version and working forward from there more conservatively this time. It was really a gem of useful functionality, in particular the "persistent" collections which used to be contraversial back then. Since that initial version, there have been several waves of changes that added complexity while only supporting minority use cases: lazily evaluated views, concurrent collections, and the CanBuildFrom magic. These are all valuable but should all be done in a separate library that is opted into when you need it.

I cannot put together a complete straw man implementation given my professional duties right now. I can make a few high-level suggestions, though, based on my experience in developer tools at Google and LogicBlox. Mind you, these are just initial reactions. I may well be overlooking important implementation details. Also, I haven't had the pleasure of using Scala for the new "big data" growth area, so I may be overlooking some concerns compared to the kinds of code I am used to.

At a high level, I'd want to see the following in an updated collection library:

  • A focus on just the core collection types: maps, sets, linked lists, and some form of array-like sequences (maybe Array itself). Other collection types do come up in practice, but in a minority of contexts, and for those cases it should be fine to use a concrete collection type that does not necessarily inherit from any standard-library collection traits.
  • Simple old-fashioned classes and methods, without any implicit parameters or type members. The status quo is bad in numerous ways, including problems with all of the following: IDE auto-completion, single-step debugging, and and compiler error messages.

Non-goals include the following:

  • User extension of the library with new collection types. This is a small enough use case that it doesn't merit giving up on any other design goals of the library, e.g. performance.
  • Supporting the majority of the collection types that are available now. I wrote a 1-2 page description of each of them for the latest Programming in Scala, and I found it quite the slog. The library has simply grown very large over time. It would be better if the less common collection types were more obviously in a side library. They're all useful sometimes, but the grand total is just overwhelming.
  • Infinite collection types, in particular the lazy Stream type.

Some other things are desirable but might be unreasonable:

  • High performance in common use cases involving a for() loop. This is a complicated problem, but it is a persistent problem area that drives people away from writing idiomatic Scala code. If nothing else, perhaps for() should translate to iterators instead of to higher-order functions. Another approach would be to have the compiler recognize and optimize standard types like List, Set, and Map; this will work better if those types are sealed rather than user extensible.
  • Rename the three versions of Set and Map to have different names. For example, maybe AbstractSet, HashSet, and Set. There are numerous reasons for this; for example, automatic IDE import doesn't work as well if you have multiple classes with the same name. I put this in the "maybe" list, though, because it would require a lot of code to be rewritten. Even taking care to phase the change in over a period of months, it's a lot of code that would need to be turned over.

On a minor note, type Seq seems pretty useless. I know that Java has an interface for this, but maybe take a stand on this. For well-written code, it doesn't help anything to know that a type is a Seq and not just an Iterable. So delete the useless intermediate trait.

Wednesday, August 5, 2015

Ken Clark on shame mobs

Ken Clark has posted the top seven things he likes about shame mobs. Here's a taste:

5) Internet shame mobs weigh the evidence carefully and deliberately before attacking, so they only happen to people who deserve them. [...] 3) Internet shame mobs always make sure that the punishment is proportional to the crime.

There's a larger phenomenon here where problematic information spreads faster than the correction to it. If it spreads fast enough, then it can even pass a tipping point where it becomes very hard to get a hearing to object to any part of it. Everyone has already heard the idea from, well, everyone else, so they quickly dismiss anyone who objects without even really considering it.

The key to stopping such memetic chain reactions is to apply some filtering before propagating information that you read. It's still early days for the Internet, though, and we are all still learning to inoculate ourselves from being the wrong kind carrier.

There is some reason to have hope. Chain emails used to flourish, but are now mostly stamped out. In their heyday, 15-20 years ago, it was fairly common to open your mail program and see numerous messages that said something very exciting, and furthermore that the message should be passed on to everyone you know as fast as possible. Nowadays, the people I interact with just delete any email such emails. If an email explicitly says that it should be forwarded to everyone you know, then it triggers something like an antibody response. Such an email starts looking very bogus, and it gets deleted quickly and possible even flagged for followup by the email provider.

Intriguingly, people would likely not have developed that response had they not gone through the misery of dealing with chain emails earlier on. There are clear parallels to viruses and to building up antibodies!

Shame mobs are a place where it still goes on, though. I'm not entirely sure why it happens. In part, people just want to defend an idea, and they are happy to use real people as an example no matter the consequences. In part, people just enjoy being part of a mob. I hope that shame mobs go the same way as the chain email. We shall see.

Wednesday, July 29, 2015

Finding and fixing complicated code

It's hard to beat lines of code as a complexity metric: the subjective complexity of a piece of code seems to correlate very well with its length. Over the years, many other metrics have been devised and explored, but they tend to be more complicated yet less effective than the simple old lines-of-code.

One exception to the rule is McCabe Cyclomatic Complexity, a metric almost exactly as old as I am. McCabe can often find methods that aren't that many lines of long, but that nonetheless a typical maintainer would identify as being complicated and difficult to maintain. Small wonder that GNU Complexity puts so much emphasis on it.

What McCabe does is count the number of independent paths are needed to cover a method from beginning to end. The way McCabe described it, in his delightfully readable paper on the subject, is that it's the number of test cases you need to get statement coverage for a given method. It sounds like something that would be impractically hard to calculate--you are minimizing over arbitrary test cases!--but it's not complicated at all. He proved that, with a few extra assumptions, all you really have to do is count the number of branching options in a given body of code.

McCabe complexity is very effective in my experience. A method with complexity 10 is usually a dense, gnarly tangle of code, even if it's only 20 lines long. Yet, 20 lines of code is not enough to reliably be complicated just by its number of lines of code. If it's a simple linear sequence of non-branching code, it's usually reasonable to leave it alone. McCabe Complexity is therefore a real coup.

Even better, there's usually a straightforward way to act on code that has a poor McCabe complexity measure. If you see a method with a lot of branches in it, you can usually make the code more maintainable by factoring out some of the clusters of branches to their own smaller methods. The main method gets easier to understand, and if you do it tastefully, the smaller factored-out methods will also be easier to understand.

One place McCabe complexity falls down, though, is large switch statements. If you write network code that receives a message and dispatches it, it's perfectly fine style to write a switch statement for the dispatch with one case for each kind of message. This is especially true if each case simply calls a method that does the real work. Such a switch statement will easily have a dozen cases, one for each message type, but what alternative would be any more maintainable? Factoring out methods for subsets of the cases doesn't seem obviously better, and reducing the number of cases is often impossible.

McCabe is even worse for switch statements that consider pairs of types that are each drawn from some enum. To see some typical examples, look at Cast.scala in the Apache Spark source code. All the big switch statements in there are enumerating pairs of types, and they strike me as a very reasonable way to manage a problem that is just fundamentally complicated. This organizational style reminds me of how people organize definitions on paper when they are studying and explaining language semantics. Such people will typically define their relations and functions using a dozen or so cases, each of which has a number of assumptions (x is a class type, y is a class type, and x inherits from y), and a conclusion you can draw if all of the assumptions hold (x is a subtype of y). If this style is reasonable for theory papers, where people are expending enormous time and energy to bring a complicated subject within the reach of the human mind, it seems only reasonable to write such definitions in code when possible.

Overall, McCabe complexity is excellent and has stood the test of time. I wonder, though, if "number of branching statements" might be even more useful for anyone trying to simplify their code. "Number of branching statements" is the same as McCabe for if statements and while loops, but for switch statements, it only adds 1. Such a measure seems like it would better focus on code that is not merely complicated, but also possible to simplify.

Wednesday, March 18, 2015

Every build is special

Since Semmle's business involves integrating with our customers' builds, I'm very sympathetic to Tim Boudreau's perspective on simple plain-jane builds that just work:
But with time and enough consulting customers, you grow a real appreciation for projects you can simply check out from version control and build.

It's always a relief when someone wants to, say, build a stock iPhone app using xbuild. It's often fine when they want to build a C program using msbuild. Third place is Maven, for a simple Java program; I say third place, because hardly anyone uses it without loading it with a bunch of plugins that fundamentally change how it works. But still, at least the overall project structure is somewhat standardized.

In my modest experience, I only really see the above for relatively small and simple projects. As "Scott" puts it in the first comment at the above link:

The idea that your build is "not as special as you think" more often false than it is true.

I just skimmed a code base I have handy, and here are some of the things I see in there that break the mold of any off-the-shelf build tool:

  • It uses parser generators, such as ANTLR.
  • It's mainly a Java code base, but there is a little bit of code in all of: C, Objective C, JavaScript, Scala, C#, and Python.
  • It uses a home-grown stack trace obfuscator and de-obfuscator that involves bytecode rewrites on the deployed jars.
  • It includes its own version of zipmerge, with options not available in the open-source version.
  • It uses the JIT support in mini-Lua to generate machine code that is then compiled into string literals in a C program.
  • It includes a proprietary query language and numerous queries in that language. Those queries get compiled during the build, using a query engine that is also compiled during the same build.

Wednesday, January 14, 2015

Surveillance states are possible

While clamping down on private encryption is bad policy, both for the economy and for privacy, I don't think it's technically impossible to implement. Let me draw a couple of comparisons to show why.

As background, here is Cory Doctorow explaining, like many other commenters, that the Internet is too wild and wooly for major governments to possibly implement widespread surveillance:

For David Cameron's proposal to work, he will need to stop Britons from installing software that comes from software creators who are out of his jurisdiction. The very best in secure communications are already free/open source projects, maintained by thousands of independent programmers around the world. They are widely available, and thanks to things like cryptographic signing, it is possible to download these packages from any server in the world (not just big ones like Github) and verify, with a very high degree of confidence, that the software you've downloaded hasn't been tampered with.

With cellular phones, any phone that uses the relevant chunks of bandwidth is legally required to use certain protocols that are registered with the government. This has been bad economically, in that the telephone network has developed much more slowly than the relatively unregulated Internet. However, being bad economically has never exactly stopped rules from being put into place.

Yes, you can rig up a wireless network in your garage that breaks the rules. However, as soon as you try to use it over a wide geographic region, you're going to be relatively easy to catch. You will have to either broadcast a strong signal, or make use of the existing telephone backbone, or both.

To draw another comparison, consider the income tax. Income tax is easy to avoid with small operations, because you can just pay cash under the table. However, larger operations have to file a variety of paperwork, and the interlocking paperwork is what will get you. The more you take part in the above-ground economy, the harder it is to spin a big enough web of lies to get out of your taxes.

To get back to Internet protocols, it will certainly always be possible to break the rules on an isolated darknet you assemble in your garage. However, as soon as you send packets across the Internet backbone, any use of unregistered protocols is going to be very easy to detect.

To rub the point in further, don't forget that the authorities have no requirement to go after everyone who they detect doing something fishy. If they are anything like the American tax service, they'll randomly (or politically....) select people to target, and those people will then be required to undergo an audit at their own expense. If they survive the audit, the tax service just says "I'm sorry" and moves on to the next victim. Because of selective enforcement, law enforcement has no requirement to go after everyone using illegal encryption.

Of course all this is bad for the economy and for humanity's development at large. Don't oppose a cryptography clampdown because it's technically impossible, or you will look just as silly as the people that say DNS takedowns are technically impossible. Rather, oppose a cryptography clampdown because we don't want to live like that. We want to have private communications, and we want to allow innovation on the Internet. It's brand new, and if we clamp down on it, it will ossify in its current state the same way that the telephone network did.

Friday, December 5, 2014

Goodbye, Spark Plug

Don't take me wrong. I know a dog is just a dog, and a pet is just a pet. There are people reading this who have cancer, and there are some who have outlived their human children. On the scale of life challenges, I've just had maybe a 3/10.

Still, I would like to write a few words. It's a way to organize my thoughts, and a way to say goodbye. I promise the next post will be about programming or law or identity or the web, but that all seems rather dry to me today.

As all you pet owners know, you get a little Pavlovian jolt each time you help out your little ward and they reward you for it. For example, when they greet you at the door and run in circles. Or, when they learn your gait well enough to jump on each leg in time and then jump to the other one before you pick that foot up. When they're cold, and you blanket them up, and they let out a deep sigh of contentment. When there's a burr in their foot, and they plaintively hold it out so that someone with thumbs can do something about it.

Over time it really adds up. You become attuned to where you expect them to be when you walk into a room. You gain a second sense about which ruffles in a couch blanket have a dog under them. You expect if you plink through a few measures of a waltz, that you'll see two brown eyes peek around the corner to see what you're doing. After 18 years of that and then nothing, you are left with a lot of little voids that add up to one great big void.

Some animals go and hide when they become deathly sick, but this one did not. In his last hours he came to me to fix it. Dog or no, it was crushing to see such hope and confusion, yet so little ability to do anything about it.

To anyone out there facing this kind of question, let me tell you that I feel no unease at all about the decision to eschew blood samples, fluid IVs, antibiotics, and I didn't even ask what else to try and give him a little more time. I keep thinking, he was 18, and kidneys don't get better, and he had multiple other problems, anyway. Indeed, what I really wish I could go back in time on is delaying euthanasia for too long. I even had my mind made up, and I went to a 24-hour vet to do it, but I backed down when confronted with a thorough vet that wanted to rehash the dog's entire medical history. I thought I could just take him to our regular vet the next day, but the sun never rose for him again. Yes, hindsight is 20/20, but I wish I had insisted.

Goodbye, Spark Plug. I hope we did well for you.

P.S. -- Mom, you are very optimistic to think we can get this plant to bloom every December. We'll give it a try!

Sunday, November 23, 2014

Is this the right server?

It's nice to see someone else reach the following conclusion:

"For those familiar with SSH, you should realize that public key pinning is nearly identical to SSH's StrictHostKeyChecking option. SSH had it right the entire time, and the rest of the world is beginning to realize the virtues of directly identifying a host or service by its public key."

Verifying a TLS certificate via the CA hierarchy is better than nothing, but it's not really all that reassuring. Approximately, what it tells you is that there is a chain of certification leading back to one or more root authorities, which for some reason we all try not to think about too much are granted the ultimate authority on the legitimacy web sites. I say "approximately", because fancier TLS verifiers can and do incorporate additional information.

The root authorities are too numerous to really have faith in, and they have been compromised in the past. In general, they and their delegates have little incentive to be careful about what they are certifying, because the entities they certify are also their source of income.

You can get better reliability in key verification if you use information that is based on the interactions of the actual participants, rather than on any form of third-party security databases. Let me describe three examples of that.

Pin the key

For many applications, a remotely installed application needs only communicate with a handful of servers back at a central site you control. In such a case, it works well to pin the public keys of those servers.

The page advocates embedding the public key directly in the application. This is an extremely reliable way of obtaining the correct key. You can embed the key in the app's binary as part of your build system, and then ship the whole bundle over the web, the app store, or however else you are transmitting it to the platform it will run on. Given such a high level of reliability, there is little benefit from pulling in the CA hierarchy.

As linked above, you can implement pinning today. It appears to be tricky manual work, though, rather than something that is built into the framework. As well, you don't get to ignore the CA hierarchy by doing this sort of thing. So long as you use standard SSL libraries, you still have to make sure that your key validates in the standard ways required by SSL.

Associate keys with links

The Y property deserves wider recognition, given how important hyperlinks are in today's world. Put simply, if someone gives you a hyperlink, and you follow that hyperlink, you want to reliably arrive at the same destination that the sender wanted you to get to. That is not what today's URLs give you.

The key to achieving this property is that whenever you transmit a URL, you also transmit a hash of the expected host key. There are many ways to do this, including the ones described at the above hyperlink (assuming you see the same site I am looking at as I write this!). Just to give a very simple example, it could be as simple as using URLs of the following form:

This particular example is interesting for being backward compatible with software that doesn't know what the hash means.

I don't fully know why this problem is left languishing. Part of it is probably that people are resting too easy on the bad assumption that the CA hierarchy has us covered. There's a funny mental bias where if we know nothing about a subject, and we see smart people working on it, the more optimistic of us just assume that it works well. Another part of the answer is that the core protocols of the world-wide web are implemented in many disparate code bases; SSH benefits from having an authoritative version of both the client and the server, especially in its early days.

As things stand, you can implement "YURLs" for your own software, but they won't work as desired in standard web browsers. Even with custom software, they will only work among organizations that use the same YURL scheme. This approach looks workable to me, but it requires growing the protocols and adopting them in the major browsers.

Repeat visits

One last source of useful information is the user's own previous interactions with a given site. Whenever you visit a site, it's worth caching the key for future reference. If you visit the "same" site again but the key has changed, then you should be extremely suspicious. Either the previous site was wrong, or the new one is. You don't know which one is which, but you know something is wrong.

Think how nice it would be if you try to log into your bank account, and the browser said, "This is a site you've never seen before. Proceed?"

You can get that already if you use pet names, which have been implemented as an experimental browser extension. It would be great if web browsers incorporated functionality like this, for example turning the URL bar and browser frame yellow if they see a a site is a new one based on its certificate. Each browser can add this sort of functionality independently, as an issue of quality of implementation.

In your own software, you can implement key memory using the same techniques as for key pinning, as described above.

Key rotation

Any real cryptography system needs to deal with key revocation and with upgrading to new keys. I have intentionally left out such discussion to keep the discussion simple, but I do believe these things can be worked into the above systems. It's important to have a way to sign an official certificate upgrade, so that browsers can correlate new certificates with old ones during a graceful phase-in period. It's also important to have some kind of channel for revoking a certificate, in the case that one has been compromised.

For web applications and for mobile phone applications, you can implement key rotation by forcing the application to upgrade itself. Include the new keys in the newly upgraded version.