Monday, October 12, 2015

Initial input on the Scala collections redesign

It looks like Martin Odersky is considering an overhaul of Scala's collections:
A redesign of the standard library is on the roadmap for one of the next Scala versions (could be as early as 2.13). So I think now is the time to start thinking about what we want to change, in particular in what concerns collections!

Some more details are available on the Dotty issue tracker. Paul Phillips has weighed in on this issue with a very interesting slide deck.

I think it will be a very positive change if done carefully. It's painful to change such a fundamental library, but the current state is a real pain point for practical Scala code. If no other contender can get full support, I would even go so far as to suggest going back to Matthias Zenger's original version and working forward from there more conservatively this time. It was really a gem of useful functionality, in particular the "persistent" collections which used to be contraversial back then. Since that initial version, there have been several waves of changes that added complexity while only supporting minority use cases: lazily evaluated views, concurrent collections, and the CanBuildFrom magic. These are all valuable but should all be done in a separate library that is opted into when you need it.

I cannot put together a complete straw man implementation given my professional duties right now. I can make a few high-level suggestions, though, based on my experience in developer tools at Google and LogicBlox. Mind you, these are just initial reactions. I may well be overlooking important implementation details. Also, I haven't had the pleasure of using Scala for the new "big data" growth area, so I may be overlooking some concerns compared to the kinds of code I am used to.

At a high level, I'd want to see the following in an updated collection library:

  • A focus on just the core collection types: maps, sets, linked lists, and some form of array-like sequences (maybe Array itself). Other collection types do come up in practice, but in a minority of contexts, and for those cases it should be fine to use a concrete collection type that does not necessarily inherit from any standard-library collection traits.
  • Simple old-fashioned classes and methods, without any implicit parameters or type members. The status quo is bad in numerous ways, including problems with all of the following: IDE auto-completion, single-step debugging, and and compiler error messages.

Non-goals include the following:

  • User extension of the library with new collection types. This is a small enough use case that it doesn't merit giving up on any other design goals of the library, e.g. performance.
  • Supporting the majority of the collection types that are available now. I wrote a 1-2 page description of each of them for the latest Programming in Scala, and I found it quite the slog. The library has simply grown very large over time. It would be better if the less common collection types were more obviously in a side library. They're all useful sometimes, but the grand total is just overwhelming.
  • Infinite collection types, in particular the lazy Stream type.

Some other things are desirable but might be unreasonable:

  • High performance in common use cases involving a for() loop. This is a complicated problem, but it is a persistent problem area that drives people away from writing idiomatic Scala code. If nothing else, perhaps for() should translate to iterators instead of to higher-order functions. Another approach would be to have the compiler recognize and optimize standard types like List, Set, and Map; this will work better if those types are sealed rather than user extensible.
  • Rename the three versions of Set and Map to have different names. For example, maybe AbstractSet, HashSet, and Set. There are numerous reasons for this; for example, automatic IDE import doesn't work as well if you have multiple classes with the same name. I put this in the "maybe" list, though, because it would require a lot of code to be rewritten. Even taking care to phase the change in over a period of months, it's a lot of code that would need to be turned over.

On a minor note, type Seq seems pretty useless. I know that Java has an interface for this, but maybe take a stand on this. For well-written code, it doesn't help anything to know that a type is a Seq and not just an Iterable. So delete the useless intermediate trait.

1 comment:

ekrich said...

Hi Lex,

I agree with views and parallel collections being in their own package to be developed independently. Also, we need pure traits (interfaces) so that RDD for Spark, Slick, or other collection oriented systems can share the traits. This way the scala collection library can inter-operate with third party systems. Otherwise each new system is an island and can't share code easily. If that part gets done right collections can flourish and make sense long into the future. I'm not crazy about Seq either.

Eric
p.s. After your work on ScalaGWT it must be interesting for you to see the ScalaJS progress.