Tuesday, April 17, 2007

Problems with Generics in Java

I was substituting labs for a introductory programming class taught in Java a few months ago, and I thought it would be nice to teach Generics to the students because it illustrates safe, reusable coding practice. I usually program in ML where polymorphic functions are written and used almost everywhere. However, I soon ran into various dark corners of type variables in Java. Such complications are not present in ML due to ML's absence of subtyping and mutable values.

The complications of Generics in Java are well-documented in this paper, Parametric polymorphism for Java: is there any hope in sight?, and this IBM article, Java theory and practice: Generics gotchas, but I'm summarizing the problems a bit differently, in a way you can see how bad assumptions break the system.

Invariance   It is almost always assumed that if T extends T', then Class<T> extends Class<T'>; this is called covariance. This is true for read-only (immutable) collections, but not true if you're allowed to insert elements. The original collection assuming elements have type T now contains elements of type T'. For Java, subtyping of generic classes really is invariant.

Static type erasure   In order to maintain backward compatibility such that the newer generic collection classes can be used by older non-generic code, Java treats class Class<T extends S> the same as Class whose definition has T replaced by S throughout (if T does not extend anything, then Java assumes T extends Object). The idea is to reduce parametric polymorphism to subtyping polymorphism. It works if Class<T> extends Class<S> holds, but since we already know subtyping should be invariant in Java, this is unsound.

Dynamic type erasure   Again, for the sake of backward compatibility, existing "compiled-once" bytecode must be able to use compiled generic classes, including being able to reflect on it. Reflection is a language device to get information about meta-properties of an object such as its class and its method names. Because older code has no concept of generics, Class<T> is just Class in the run-time. After erasure, Class<T> cannot reflect on T, so it doesn't know T's constructor (cannot construct T) nor can it construct an array of T. In order to work around this, the construction site of T must be given a reflection of class T.

Another consequence is that you lose type information when you serialize objects. After writing Collection<T> to a disk file, you no longer know what T is after reading it back. If you want to manually keep track of this information, your serialization must be given a witness of T in the same manner as constructing T and an array of T.

Solution?   What if we provide backward compatibility by keep the old non-generic collection classes, and put the new generic-based collection classes in a new package? The same goes for reflection API, where an old version reflects only on the type erased part of a generic class, but a new reflection API can inquire about type variables.

There may still be complication when old code tries to construct an instantiation of generic class using reflection, but I think this contains the problem much better.

No comments: