Thursday, November 5, 2015

The mystics are coming out

Discussion on the Scala collections revamp is starting to get mystical. It really bugs me: good language design makes a huge difference, but it's hard to see unless you actually deal with thousands or more developers on millions or more lines of code. Casual observers of language design discussions don't see it themselves, so they don't realize what these problems look like to the other people in the discussion. So they think everyone is just goofing around and throwing out random ideas just because they are possible.

I started to follow up on the issue itself, but I fear making the problem worse. So I'll post a couple of rebuttals here. I don't really think the people actually involved in the redesign will be distracted by the mystics, anyway. They are like gold-sellers in MMOs or beggars in San Francisco; unless you are specifically trying to engage with them, you just learn to tune them out. Well, okay, they are like really nerdy gold sellers who like to talk about higher math. Okay I better just drop this attempt at an analogy.

First, Python's indexing operator has a lot of practical problems. Here are a few concrete examples: https://plus.google.com/+MattMight/posts/BVSmNadKni4 . Thinking backward from those puzzlers, I think the root of the problem is that the [a:b] slicing operator means something different depending on whether a and b are positive, and whether a is smaller or larger than b. This is foundational syntax used everywhere, and if you want code to be readable, people need to know whether it's doing a forward or reverse slice without having to do mental data flow analysis on the parameters. Java avoids this trap, and Scala should, too. The sublist operation should only work on non-negative arguments, and only when the first argument is smaller than the second.

The other thing I will say is that exceptions, null, and -1 are also all fine, when used by a tasteful library designer. We tried at Semmle to get people using more Options, and we found that it harmed our reputation as a quality lint tool. I can only speak publicly about open-source code, but to give an example, Apache Spark has over a thousand occurrences where they use null but, with only local changes, they could use Option instead. It's too many. It means that the basic premise of the programming advice has some sort of problem.

As one stab at it--though it's really a huge topic--you have to think about what you want the callers to do to defend against a missing value. If Scala's basic get/apply methods starts returning an Option, then people will just litter their code with calls to .get, so the net result is that the code is more bloated but otherwise behaves exactly the same. Even in pure math, people will write notation like f'(x) as the derivative of f, but you know, derivative isn't always defined. So should smart mathematicians instead write get(f')(x)? Or (f') match { ... }?

That's my try, but you don't even have to understand this if you are willing to ape mature libraries in areas that they are working okay. It's not a practical problem in Java that the various .get() methods return null or throw exceptions; even if you say it's not perfect, it's certainly not bad. It is, however, a very big problem that Java collections are overly mutable. For example, see this Stack Overflow question: https://stackoverflow.com/questions/2842169/why-are-public-static-final-array-a-security-hole. Scala will be more attractive to more developers if it focuses on these widely recognized pain points. Which is why the mystics drive me crazy--if they get their way they will find that their playground is gradually becoming a ghost town, but only after it's too late.

Monday, October 12, 2015

Initial input on the Scala collections redesign

It looks like Martin Odersky is considering an overhaul of Scala's collections:
A redesign of the standard library is on the roadmap for one of the next Scala versions (could be as early as 2.13). So I think now is the time to start thinking about what we want to change, in particular in what concerns collections!

Some more details are available on the Dotty issue tracker. Paul Phillips has weighed in on this issue with a very interesting slide deck.

I think it will be a very positive change if done carefully. It's painful to change such a fundamental library, but the current state is a real pain point for practical Scala code. If no other contender can get full support, I would even go so far as to suggest going back to Matthias Zenger's original version and working forward from there more conservatively this time. It was really a gem of useful functionality, in particular the "persistent" collections which used to be contraversial back then. Since that initial version, there have been several waves of changes that added complexity while only supporting minority use cases: lazily evaluated views, concurrent collections, and the CanBuildFrom magic. These are all valuable but should all be done in a separate library that is opted into when you need it.

I cannot put together a complete straw man implementation given my professional duties right now. I can make a few high-level suggestions, though, based on my experience in developer tools at Google and LogicBlox. Mind you, these are just initial reactions. I may well be overlooking important implementation details. Also, I haven't had the pleasure of using Scala for the new "big data" growth area, so I may be overlooking some concerns compared to the kinds of code I am used to.

At a high level, I'd want to see the following in an updated collection library:

  • A focus on just the core collection types: maps, sets, linked lists, and some form of array-like sequences (maybe Array itself). Other collection types do come up in practice, but in a minority of contexts, and for those cases it should be fine to use a concrete collection type that does not necessarily inherit from any standard-library collection traits.
  • Simple old-fashioned classes and methods, without any implicit parameters or type members. The status quo is bad in numerous ways, including problems with all of the following: IDE auto-completion, single-step debugging, and and compiler error messages.

Non-goals include the following:

  • User extension of the library with new collection types. This is a small enough use case that it doesn't merit giving up on any other design goals of the library, e.g. performance.
  • Supporting the majority of the collection types that are available now. I wrote a 1-2 page description of each of them for the latest Programming in Scala, and I found it quite the slog. The library has simply grown very large over time. It would be better if the less common collection types were more obviously in a side library. They're all useful sometimes, but the grand total is just overwhelming.
  • Infinite collection types, in particular the lazy Stream type.

Some other things are desirable but might be unreasonable:

  • High performance in common use cases involving a for() loop. This is a complicated problem, but it is a persistent problem area that drives people away from writing idiomatic Scala code. If nothing else, perhaps for() should translate to iterators instead of to higher-order functions. Another approach would be to have the compiler recognize and optimize standard types like List, Set, and Map; this will work better if those types are sealed rather than user extensible.
  • Rename the three versions of Set and Map to have different names. For example, maybe AbstractSet, HashSet, and Set. There are numerous reasons for this; for example, automatic IDE import doesn't work as well if you have multiple classes with the same name. I put this in the "maybe" list, though, because it would require a lot of code to be rewritten. Even taking care to phase the change in over a period of months, it's a lot of code that would need to be turned over.

On a minor note, type Seq seems pretty useless. I know that Java has an interface for this, but maybe take a stand on this. For well-written code, it doesn't help anything to know that a type is a Seq and not just an Iterable. So delete the useless intermediate trait.

Wednesday, August 5, 2015

Ken Clark on shame mobs

Ken Clark has posted the top seven things he likes about shame mobs. Here's a taste:

5) Internet shame mobs weigh the evidence carefully and deliberately before attacking, so they only happen to people who deserve them. [...] 3) Internet shame mobs always make sure that the punishment is proportional to the crime.

There's a larger phenomenon here where problematic information spreads faster than the correction to it. If it spreads fast enough, then it can even pass a tipping point where it becomes very hard to get a hearing to object to any part of it. Everyone has already heard the idea from, well, everyone else, so they quickly dismiss anyone who objects without even really considering it.

The key to stopping such memetic chain reactions is to apply some filtering before propagating information that you read. It's still early days for the Internet, though, and we are all still learning to inoculate ourselves from being the wrong kind carrier.

There is some reason to have hope. Chain emails used to flourish, but are now mostly stamped out. In their heyday, 15-20 years ago, it was fairly common to open your mail program and see numerous messages that said something very exciting, and furthermore that the message should be passed on to everyone you know as fast as possible. Nowadays, the people I interact with just delete any email such emails. If an email explicitly says that it should be forwarded to everyone you know, then it triggers something like an antibody response. Such an email starts looking very bogus, and it gets deleted quickly and possible even flagged for followup by the email provider.

Intriguingly, people would likely not have developed that response had they not gone through the misery of dealing with chain emails earlier on. There are clear parallels to viruses and to building up antibodies!

Shame mobs are a place where it still goes on, though. I'm not entirely sure why it happens. In part, people just want to defend an idea, and they are happy to use real people as an example no matter the consequences. In part, people just enjoy being part of a mob. I hope that shame mobs go the same way as the chain email. We shall see.

Wednesday, July 29, 2015

Finding and fixing complicated code

It's hard to beat lines of code as a complexity metric: the subjective complexity of a piece of code seems to correlate very well with its length. Over the years, many other metrics have been devised and explored, but they tend to be more complicated yet less effective than the simple old lines-of-code.

One exception to the rule is McCabe Cyclomatic Complexity, a metric almost exactly as old as I am. McCabe can often find methods that aren't that many lines of long, but that nonetheless a typical maintainer would identify as being complicated and difficult to maintain. Small wonder that GNU Complexity puts so much emphasis on it.

What McCabe does is count the number of independent paths are needed to cover a method from beginning to end. The way McCabe described it, in his delightfully readable paper on the subject, is that it's the number of test cases you need to get statement coverage for a given method. It sounds like something that would be impractically hard to calculate--you are minimizing over arbitrary test cases!--but it's not complicated at all. He proved that, with a few extra assumptions, all you really have to do is count the number of branching options in a given body of code.

McCabe complexity is very effective in my experience. A method with complexity 10 is usually a dense, gnarly tangle of code, even if it's only 20 lines long. Yet, 20 lines of code is not enough to reliably be complicated just by its number of lines of code. If it's a simple linear sequence of non-branching code, it's usually reasonable to leave it alone. McCabe Complexity is therefore a real coup.

Even better, there's usually a straightforward way to act on code that has a poor McCabe complexity measure. If you see a method with a lot of branches in it, you can usually make the code more maintainable by factoring out some of the clusters of branches to their own smaller methods. The main method gets easier to understand, and if you do it tastefully, the smaller factored-out methods will also be easier to understand.

One place McCabe complexity falls down, though, is large switch statements. If you write network code that receives a message and dispatches it, it's perfectly fine style to write a switch statement for the dispatch with one case for each kind of message. This is especially true if each case simply calls a method that does the real work. Such a switch statement will easily have a dozen cases, one for each message type, but what alternative would be any more maintainable? Factoring out methods for subsets of the cases doesn't seem obviously better, and reducing the number of cases is often impossible.

McCabe is even worse for switch statements that consider pairs of types that are each drawn from some enum. To see some typical examples, look at Cast.scala in the Apache Spark source code. All the big switch statements in there are enumerating pairs of types, and they strike me as a very reasonable way to manage a problem that is just fundamentally complicated. This organizational style reminds me of how people organize definitions on paper when they are studying and explaining language semantics. Such people will typically define their relations and functions using a dozen or so cases, each of which has a number of assumptions (x is a class type, y is a class type, and x inherits from y), and a conclusion you can draw if all of the assumptions hold (x is a subtype of y). If this style is reasonable for theory papers, where people are expending enormous time and energy to bring a complicated subject within the reach of the human mind, it seems only reasonable to write such definitions in code when possible.

Overall, McCabe complexity is excellent and has stood the test of time. I wonder, though, if "number of branching statements" might be even more useful for anyone trying to simplify their code. "Number of branching statements" is the same as McCabe for if statements and while loops, but for switch statements, it only adds 1. Such a measure seems like it would better focus on code that is not merely complicated, but also possible to simplify.

Wednesday, March 18, 2015

Every build is special

Since Semmle's business involves integrating with our customers' builds, I'm very sympathetic to Tim Boudreau's perspective on simple plain-jane builds that just work:
But with time and enough consulting customers, you grow a real appreciation for projects you can simply check out from version control and build.

It's always a relief when someone wants to, say, build a stock iPhone app using xbuild. It's often fine when they want to build a C program using msbuild. Third place is Maven, for a simple Java program; I say third place, because hardly anyone uses it without loading it with a bunch of plugins that fundamentally change how it works. But still, at least the overall project structure is somewhat standardized.

In my modest experience, I only really see the above for relatively small and simple projects. As "Scott" puts it in the first comment at the above link:

The idea that your build is "not as special as you think" more often false than it is true.

I just skimmed a code base I have handy, and here are some of the things I see in there that break the mold of any off-the-shelf build tool:

  • It uses parser generators, such as ANTLR.
  • It's mainly a Java code base, but there is a little bit of code in all of: C, Objective C, JavaScript, Scala, C#, and Python.
  • It uses a home-grown stack trace obfuscator and de-obfuscator that involves bytecode rewrites on the deployed jars.
  • It includes its own version of zipmerge, with options not available in the open-source version.
  • It uses the JIT support in mini-Lua to generate machine code that is then compiled into string literals in a C program.
  • It includes a proprietary query language and numerous queries in that language. Those queries get compiled during the build, using a query engine that is also compiled during the same build.

Wednesday, January 14, 2015

Surveillance states are possible

While clamping down on private encryption is bad policy, both for the economy and for privacy, I don't think it's technically impossible to implement. Let me draw a couple of comparisons to show why.


As background, here is Cory Doctorow explaining, like many other commenters, that the Internet is too wild and wooly for major governments to possibly implement widespread surveillance:

For David Cameron's proposal to work, he will need to stop Britons from installing software that comes from software creators who are out of his jurisdiction. The very best in secure communications are already free/open source projects, maintained by thousands of independent programmers around the world. They are widely available, and thanks to things like cryptographic signing, it is possible to download these packages from any server in the world (not just big ones like Github) and verify, with a very high degree of confidence, that the software you've downloaded hasn't been tampered with.

With cellular phones, any phone that uses the relevant chunks of bandwidth is legally required to use certain protocols that are registered with the government. This has been bad economically, in that the telephone network has developed much more slowly than the relatively unregulated Internet. However, being bad economically has never exactly stopped rules from being put into place.

Yes, you can rig up a wireless network in your garage that breaks the rules. However, as soon as you try to use it over a wide geographic region, you're going to be relatively easy to catch. You will have to either broadcast a strong signal, or make use of the existing telephone backbone, or both.

To draw another comparison, consider the income tax. Income tax is easy to avoid with small operations, because you can just pay cash under the table. However, larger operations have to file a variety of paperwork, and the interlocking paperwork is what will get you. The more you take part in the above-ground economy, the harder it is to spin a big enough web of lies to get out of your taxes.

To get back to Internet protocols, it will certainly always be possible to break the rules on an isolated darknet you assemble in your garage. However, as soon as you send packets across the Internet backbone, any use of unregistered protocols is going to be very easy to detect.

To rub the point in further, don't forget that the authorities have no requirement to go after everyone who they detect doing something fishy. If they are anything like the American tax service, they'll randomly (or politically....) select people to target, and those people will then be required to undergo an audit at their own expense. If they survive the audit, the tax service just says "I'm sorry" and moves on to the next victim. Because of selective enforcement, law enforcement has no requirement to go after everyone using illegal encryption.

Of course all this is bad for the economy and for humanity's development at large. Don't oppose a cryptography clampdown because it's technically impossible, or you will look just as silly as the people that say DNS takedowns are technically impossible. Rather, oppose a cryptography clampdown because we don't want to live like that. We want to have private communications, and we want to allow innovation on the Internet. It's brand new, and if we clamp down on it, it will ossify in its current state the same way that the telephone network did.