Friday, December 17, 2010

Every paper and book on our laptops?

Dick Lipton speculates on that question:
Today there are applications like Citeseer that contain about one million papers. The total storage for this is beyond the ability of most of us to store on our laptops. But this should change in the near future. The issue is that the number of papers will continue to grow, but will unlikely grow as fast as memory increases. If this is the case then an implication is that in the future we could have all the technical papers from an area on our own devices. Just as realtime spelling is useful, realtime access to technical papers seems to be a potentially exciting development.[...]
Right now there are too many books, even restricted to a subfield like mathematics, to have all of them stored on a single cheap device. But this large—huge—amount of memory could easily become a small one in the future.

I agree for Citeseer, and I agree for the local library. Very soon, if not already, we will have laptops that can hold the entirety of Citeseer and the entirety of the local library's collection of books. I was impressed when I looked at the file sizes for Project Gutenberg. Shakesperean plays take a few tens of kilobytes, and the largest archive they supply is a dual-sided DVD with over 29,000 books. I still remember the shock when I looked at a directory listing on their web site and the file sizes looked so small I thought the software must be broken.

As an aside, I wish I could say that the Association for Computing Machinery thought this way. Their current thinking that they'll have an online digital library that they take a toll on. If they really wanted to help science, they'd mail you a pre-indexed thumb drive you can load into your laptop and have all papers up until that date. I would bet that someone in physics works this out long before the ACM does. Who knows, though.

All this said, papers and books are backward looking. Nowadays, papers and books are developed as electronic content and then, only at delivery time, printed onto paper. An increasing amount of interesting material is simply never printed at all. Want a copy of the Scala Language Specification? It's essentially a book, but you won't find it at the local library. Over time, printed word is becoming a niche application. You only need it for reading something in depth, or if you want to physically hand it to someone. For the former, print on demand works more and more frequently, and for the latter, the number of times it happens is decreasing. As well, electronic ink just keeps getting better.

From the perspective of interesting words, as opposed to printed papers and books, it will take longer before personal computers can hold all the, ahem, material that is out there. It includes not just papers and books written by mathematicians, but also forum messages, blog posts, and even Facebook and Twitter messages written by all manner of people. Perhaps even then we are already at the point where our machines have enough storage, but it's certainly a lot more data than just for Citeseer and the library.

Of course, most people are only interested in a tiny fraction of all that information. Perhaps Dick Lipton really only cares about math papers from famous mathematicians. If the precise data interesting to someone can be identified, then the storage requirements for keeping a personal copy are much more reasonable, and in fact we probably are already there. However, identifying that subset of the data is, in general, entirely non-trivial.

No comments: