Wednesday, July 29, 2015

Finding and fixing complicated code

It's hard to beat lines of code as a complexity metric: the subjective complexity of a piece of code seems to correlate very well with its length. Over the years, many other metrics have been devised and explored, but they tend to be more complicated yet less effective than the simple old lines-of-code.

One exception to the rule is McCabe Cyclomatic Complexity, a metric almost exactly as old as I am. McCabe can often find methods that aren't that many lines of long, but that nonetheless a typical maintainer would identify as being complicated and difficult to maintain. Small wonder that GNU Complexity puts so much emphasis on it.

What McCabe does is count the number of independent paths are needed to cover a method from beginning to end. The way McCabe described it, in his delightfully readable paper on the subject, is that it's the number of test cases you need to get statement coverage for a given method. It sounds like something that would be impractically hard to calculate--you are minimizing over arbitrary test cases!--but it's not complicated at all. He proved that, with a few extra assumptions, all you really have to do is count the number of branching options in a given body of code.

McCabe complexity is very effective in my experience. A method with complexity 10 is usually a dense, gnarly tangle of code, even if it's only 20 lines long. Yet, 20 lines of code is not enough to reliably be complicated just by its number of lines of code. If it's a simple linear sequence of non-branching code, it's usually reasonable to leave it alone. McCabe Complexity is therefore a real coup.

Even better, there's usually a straightforward way to act on code that has a poor McCabe complexity measure. If you see a method with a lot of branches in it, you can usually make the code more maintainable by factoring out some of the clusters of branches to their own smaller methods. The main method gets easier to understand, and if you do it tastefully, the smaller factored-out methods will also be easier to understand.

One place McCabe complexity falls down, though, is large switch statements. If you write network code that receives a message and dispatches it, it's perfectly fine style to write a switch statement for the dispatch with one case for each kind of message. This is especially true if each case simply calls a method that does the real work. Such a switch statement will easily have a dozen cases, one for each message type, but what alternative would be any more maintainable? Factoring out methods for subsets of the cases doesn't seem obviously better, and reducing the number of cases is often impossible.

McCabe is even worse for switch statements that consider pairs of types that are each drawn from some enum. To see some typical examples, look at Cast.scala in the Apache Spark source code. All the big switch statements in there are enumerating pairs of types, and they strike me as a very reasonable way to manage a problem that is just fundamentally complicated. This organizational style reminds me of how people organize definitions on paper when they are studying and explaining language semantics. Such people will typically define their relations and functions using a dozen or so cases, each of which has a number of assumptions (x is a class type, y is a class type, and x inherits from y), and a conclusion you can draw if all of the assumptions hold (x is a subtype of y). If this style is reasonable for theory papers, where people are expending enormous time and energy to bring a complicated subject within the reach of the human mind, it seems only reasonable to write such definitions in code when possible.

Overall, McCabe complexity is excellent and has stood the test of time. I wonder, though, if "number of branching statements" might be even more useful for anyone trying to simplify their code. "Number of branching statements" is the same as McCabe for if statements and while loops, but for switch statements, it only adds 1. Such a measure seems like it would better focus on code that is not merely complicated, but also possible to simplify.