Wednesday, March 26, 2014

Really Understanding Python @staticmethod and @classmethod

Nowadays people don't write blog posts explaining subtle programming concepts. They write surprisingly well-versed posts on StackOverflow which I personally use a lot. But in this case, I noticed the answers to the question "What is the difference between @staticmethod and @classmethod in Python?" are unsatisfactory. That gave me the excuse to write another blog post.

Python has three types of methods. Here is an example illustrating their use in Python.
class MyClass(object):

    def some_instance_method(self, *args, **kwds):
        pass

    @classmethod
    def some_class_method(cls, *args, **kwds):
        pass

    @staticmethod
    def some_static_method(*args, **kwds):
        pass
Python is unique in that its instance methods take an explicit instance argument. You can call it any name, but the convention is to call it self. Many languages like C++ and Java have an implicit this instead. Syntax wise, the class method would take a class (which is also an object in Python) as its first argument, and a static method takes no explicit class nor object references.

Many people understand that static methods do not use per-instance object states, but they implement functionality grouped as part of the class. That's why you can forego the explicit references. But nobody seems to explain well what class methods are good for. Class states? That's a synonym to the dirty word "global variables" which everyone knows you should avoid. But class methods have legitimate uses.

The first legitimate use is to use a class method like a static method, but better.

Imagine you have a static method that wants to use another static method in the same class, or a static method that wants to use constants defined in the class. You could refer to the class by its full name like this:
class Smoothie(object):

    YOGURT = 1
    STRAWBERRY = 2
    BANANA = 4
    MANGO = 8

    @staticmethod
    def blend(*mixes):
        return sum(mixes) / len(mixes)

    @staticmethod
    def eternal_sunshine():
        return Smoothie.blend(
            Smoothie.YOGURT, Smoothie.STRAWBERRY,
            Smoothie.BANANA)

    @staticmethod
    def mango_lassi():
        return Smoothie.blend(
            Smoothie.YOGURT, Smoothie.MANGO)
But with class method, you could write instead:
class Smoothie(object):

    YOGURT = 1
    STRAWBERRY = 2
    BANANA = 4
    MANGO = 8

    @staticmethod
    def blend(*mixes):
        return sum(mixes) / len(mixes)

    @classmethod
    def eternal_sunshine(cls):
        return cls.blend(cls.YOGURT, cls.STRAWBERRY, cls.BANANA)

    @classmethod
    def mango_lassi(cls):
        return cls.blend(cls.YOGURT, cls.MANGO)
This results in a much more succinct code. Imagine the headache you would have saved renaming the class if you later decide that SimpleSmoothie is a better name for just Smoothie. Indeed, making code refactoring easier is one of the first use case I discovered with Python class methods.

This is what you get if you want to see what the class methods return:
>>> Smoothie.eternal_sunshine()
2
>>> Smoothie.mango_lassi()
4
There is a subtle performance benefit. If you refer to the full class name in a static method, the Python interpreter has to first look up the local variables, not finding it there, and then look up the global variables. With a class method, cls would be found in the local variables, so there is no need to look up the globals.

With this in mind, the second legitimate use is to allow a subclass to override static and class methods.
class BetterSmoothie(Smoothie):

    YOGURT = 'yogurt'
    STRAWBERRY = 'strawberry'
    BANANA = 'banana'
    MANGO = 'mango'

    @staticmethod
    def blend(*mixes):
        return ', '.join(mixes)
Now, without touching the class methods eternal_sunshine() and mango_lassi(), they will use the new blend() as well as the new constants, and produce a different kind of smoothie mix.
>>> BetterSmoothie.eternal_sunshine()
'yogurt, strawberry, banana'
>>> BetterSmoothie.mango_lassi()
'yogurt, mango'
If you use a class method for your class factory, then it will be able to construct derived classes too. It may or may not make sense depending on your class design.
class Xyzzy(object):

    def __init__(self, which_factory):
        self.which_factory = which_factory

    @classmethod
    def factory_foo(cls):
        return cls('factory_foo')

    @classmethod
    def factory_bar(cls):
        return cls('factory_bar')

class Yappy(Xyzzy):

    def __init__(self, which_factory):
        super(Yappy, self).__init__(which_factory)
        print which_factory
Notice that only the Yappy class prints out the factory type, but not Xyzzy. And this is what you get:
>>> Xyzzy.factory_foo()
<__main__.Xyzzy object at 0x1027d1050>
>>> Xyzzy.factory_bar()
<__main__.Xyzzy object at 0x1027c7e10>
>>> Yappy.factory_foo()
factory_foo
<__main__.Yappy object at 0x1027d10d0>
>>> Yappy.factory_bar()
factory_bar
<__main__.Yappy object at 0x1027c7e10>
With class method factories, even though the factories are defined in the base class, they are able to construct the derived classes as well.

Class method in Python is really a unique feature. Some use cases shown above demonstrate how to use class methods to refer to other static or class methods in the same class and how to allow a subclass to change behavior of the base class. The closest analog in C++ would be the curiously recurring template pattern.

Friday, March 21, 2014

Programming Language as a Data Mining Problem

I've been playing with busybox for an embedded system lately. Busybox is a collection of Unix tools like ls, grep, cp, etc. all compiled into a single monolithic statically linked binary. The size of the single monolithic binary is smaller than the sum of individually compiled binaries.

Obviously, static linking would duplicate shared library code among the binaries, so the individually compiled binaries would contain a lot of duplicate code. But even if the binaries were to be dynamically linked against one shared library so we avoid code duplication, the shared library typically contains unused code as well. Whoever builds the shared library cannot predict in advance what code will eventually be used by busybox. Static linking the monolithic busybox binary allows the linker to throw away any unused code, thus achieving the size reduction. Of course, they also cut some corners in the functionality, but it's harder to tell exactly how much the savings would be.

The objective is to put together a complete program with only code that the program actually uses, even though the programmer might write code that is unused for the problem at hand. The unused code might be written for solving other problems, but only the relevant ones must go into the program.

The C language has other peculiarities that keeps us further away from this objective. A single *.c source file is a single compilation unit, and it is compiled to a single *.o object file. The linker only considers whether a *.o is used or unused. It is either entirely included or entirely omitted. If the *.c contains multiple functions, some are unused, then the unused functions are included as well. Note that dead-code elimination only removes intra-function dead code. A more sophisticated compiler and linker would put each function into its own section in the *.o so the linker can cherry pick them. This is explained more in Link time dead code and data elimination using GNU toolchain.

In a high-level view, the compiler takes in source code and produces key-value pairs of blobs, and each blob may reference other blobs by their keys. The linker traverses the blobs as a dependency tree. This is a data mining problem.

If you think C is bad, many modern languages are worse. Java classes are all stored in a *.jar which is a zip file, so the distributed binary is completist. Java partially mitigates the problem by only loading classes on-demand, but loading a class requires scanning the zip file's central directory which is \( O(n) \).

If we treat programming language design as a data mining problem, it is immediately obvious that we want code retrieval to be as efficient as possible, preferably \( O(1) \). But there is more. The compiler reads in a source file (as if it's a punched card) and outputs a structural representation that allows the code to be intelligently surveyed. Human can use this database to learn about the architecture of the codebase. The compiler can use it to compile other code. The linker can use it to eliminate dead and duplicate code. The debugger can use it to show diagnostic.

When people implement a programming language, they don't usually realize that they're designing a database as well. If we do, we could simplify the model and allow the code to be queried more efficiently and more expressively.