Friday, December 11, 2009

Tail call optimization vs. growable stacks

Guy Steele has written an interesting post about tail calls being needed to implement certain object-oriented designs:

In this blog post we extend one of his examples in order to make a completely different point: object-oriented programming languages need tail calls correctly implemented, not just as a "trivial and optional" optimization of the use of stack space that could be achieved by using iteration statements, but in order to preserve object-oriented abstractions.

I agree as far as it goes, and it's a very interesting example. However, it doesn't go far enough. What is really needed to make Guy's example work well is stacks that can grow beyond their initial fixed size.

In the example, a set is built using a sequence of "adjoin" operations that add one element to the set at a time. The adjoined set is implemented as a linked list. Whether or not this particular API needs implementation as a linked list, I can certainly agree that or some problems the implementer would be pushed into using such a structure.

Given this implementation, the relevant implementation of the "contains" operation looks like this:

contains(y: ZZ) = if (y = x) then true else s.contains(x) end

First check the head of the list, then recurse into the tail of the list. The problem is that if the list is large, this recursion will overflow any fixed-size stack. The proposed solution is tail call optimization, which keeps the stack from growing.

I agree, but tail call optimization doesn't go far enough for this sort of problem. Look in the very same example at the "contains" operation for "union" objects:

contains(y: ZZ) = s1.contains(x) OR: s2.contains(x)

This operation first checks the first set and then checks the second set. This time neither call is a tail call. Through careful rewriting, one could be made tail recursive, but not the other. To make them both tail recursive would require really butchering the clean algorithm and data structure as they stand.

To sum up the situation, tail call optimization helps for recursing over deeply linked lists, but not for recursing over deeply linked trees. It looks rather narrow to me.

As an aside, I agree with Gilad Bracha that you may as well implement tail call optimization, but the motivation for me is not deeply nested data structures. Instead, one application I think about is to support state machines as implemented via recursion. A related application is continuation passing style. These examples need tail call optimization to avoid the stack needs growing in proportion to the program's run time, so I certainly have nothing against tail call optimization.

What, though, can be done about recursion on deeply nested data structures? Barring programmers from doing so, as Java does, is a real constraint. Both the Scala compiler and the GWT compiler require users to specify very large stack sizes so as to support recursing over large abstract syntax trees. Most of the time the trees aren't that deep and the large stacks are wasted memory. Once in a while even the very large stack sizes aren't big enough. It's a real mess.

A more thorough solution to the problem would be to allow stacks to grow and shrink at run time. Some languages implement this feature by putting their stacks on the heap, and others, such as Go, have a stack that is a linked list of traditional contiguous stacks:

To make the stacks small, Go's run-time uses segmented stacks. A newly minted goroutine is given a few kilobytes, which is almost always enough. When it isn't, the run-time allocates (and frees) extension segments automatically. The overhead averages about three cheap instructions per function call. It is practical to create hundreds of thousands of goroutines in the same address space. If goroutines were just threads, system resources would run out at a much smaller number.

By providing Go with growable stacks, Rob Pike stands in contrast to many other language implementers. The reluctance from most implementers I believe is due to the extra overhead it adds to subroutine invocation. With a traditional fixed-size stack, a subroutine call can simply allocate a stack frame and race ahead. If stacks grow and shrink, then the allocation code must check whether the stack really has space left. If it doesn't, it needs to grow the stack. Additionally, a better implementation should probably shrink stacks, too. That means that the subroutine return code needs some overhead as well.

It's not entirely clear that this overhead is non-zero, much less large. For example, the calling convention code could be written so as to write a stack frame even when the space is getting tight, and only to trap into the "grow the stack" subroutine once the stack frame has been written all the way out. With such an arrangement, a modern CPU should be able to do the stack check in parallel to its work to write out the stack frame, so long as the two operations are carefully written to avoid data dependence. To avoid overhead for returning from a subroutine, a frame could be periodically placed on the stack that indicates that it's time for the stack to shrink. The frames could be inserted using similar techniques to the parallelizable calling convention to avoid overhead on most subroutine calls. Given such frames, most subroutine returns would do nothing special. Only returns that fall into one of these frames would need to do anything, and the frames can be set up such that their saved instruction pointer points into the "shrink the stack" subroutine.

That's one theoretical implementation strategy, but it's enough to suggest that growable stacks can be very inexpensive indeed. I will be very interested in how the performance numbers work out for Go as the language matures.


infinite8s said...

Actually, in your example of the UnionSet the second "contains" is a tail call. So for a depth-first search in a binary tree, roughly half your calls would be tail calls.

Lex Spoon said...

If OR: is recognized and optimized by the compiler, then yes, I agree that half of the calls will be tail calls.

Even so, you can still run out of stack with this method. Whether you overflow depends on the size and shape of the tree being recursed through.

David Feuer said...

I believe growable stacks should be easy on systems supporting anonymous mmap (most systems), and shrinkable stacks should be easy on systems supporting something like the mremap system call available on Linux. The question that's been running around in my head is whether tail-call optimization could be simplified/improved by representing the call stack as two interleaved stacks, where a tail call switches stacks to avoid having to copy the callee over the caller.