Sunday, October 11, 2009

One place you need Java source code rather than Java byte code

For the most part, Java tools work with byte code, not source code. If you dynamically load code in Java, it's byte code you will load. If you write a debugger for Java, the unit of single stepping will be the byte code. When a web browser downloads Java code to run, it downloads a "jar" of byte code. If you optimize Java code, you optimize from jars of byte code to better jars of byte code. If you run Findbugs to find errors in your code, you'll be running it across byte code.

So why not the Google Web Toolkit? One simple reason: GWT emits optimized JavaScript code. Byte code adds two challenges that aren't present for Java source code, yet it doesn't have the interop benefits that people hope for from byte code.

The first problem is that byte code has jumps rather than structured loops. If GWT reads from byte code, it would have to find a way to translate these to JavaScript, which has structured loops but does not have a general jumping instruction. This would be a game of trying to infer loops where possible, and then designing a general-case solution for the remaining cases. It would be a lot of work, and the output would suffer.

The second problem is that byte code does not include nesting expressions. Each byte code does one operation on one set of inputs, and then execution proceeds to the next operation. If this is translated directly to JavaScript, then the result would be long chains of tiny statements like "a = foo(); b = Bar(); c = a + b". It would be a lot of work to translate these back to efficient, nesting expressions like "c = foo() + bar()". Until that work got to a certain sophistication, GWT's output would suffer.

Finally, one must ask what the benefit would be. Certainly it would not be possible to run non-Java languages in this way. For the most part, such languages have at least one construct that doesn't map directly to byte code. In most cases, that mapping uses reflection in some way, and GWT doesn't even support reflection. To support such languages via byte code, GWT would have to reverse engineer what those byte codes came from.

Once you reach the point of reverse engineering the low-level format to infer what was intended in the high-level format, every engineer must ask if we couldn't simply use a higher-level format to begin with.

2 comments:

Bob said...

I think source is probably the right abstraction. The sarien.net folks solved the "goto" problem using switches: http://sarien.net/about

Rob Heittman said...

Middle ground?

Something keeps bringing me back to the fun (actual fun!) I've had lately playing with the Parrot VM and PIR. (http://bit.ly/GqaQW) Writing a language parser that produces PIR, with the provided Parrot tools, is so easy as to border on sheer geeky joy. To me, this validates the idea of an intermediate representation language that captures important information, but isn't necessarily something you want to actually code in.

Of course the GWT case is very different from Parrot VM in terms of needs, but PIR has convinced me that an intermediate "source-ish" representation can be a really exciting thing. And I wonder if producing a language-X to Javascript compilation couldn't someday be made as easy as the Parrot compiler toolkit. You've probably seen the 5-minute LOLCODE compiler demo: http://bit.ly/iLdyw Aaand I fully expect to watch you repeat it for GWT next May. ;-)

As a little warm-up exercise to studying this in more detail, I've started trying to make a conceptual map of Java and Scala language constructs that actually impact optimization vs. ones that solely exist for programmer convenience. This is probably knowledge you have in cache already, but it's an eye opener for me so far! Wow. We programmers are lazy. Especially we Scala programmers.