Friday, March 21, 2014

Programming Language as a Data Mining Problem

I've been playing with busybox for an embedded system lately. Busybox is a collection of Unix tools like ls, grep, cp, etc. all compiled into a single monolithic statically linked binary. The size of the single monolithic binary is smaller than the sum of individually compiled binaries.

Obviously, static linking would duplicate shared library code among the binaries, so the individually compiled binaries would contain a lot of duplicate code. But even if the binaries were to be dynamically linked against one shared library so we avoid code duplication, the shared library typically contains unused code as well. Whoever builds the shared library cannot predict in advance what code will eventually be used by busybox. Static linking the monolithic busybox binary allows the linker to throw away any unused code, thus achieving the size reduction. Of course, they also cut some corners in the functionality, but it's harder to tell exactly how much the savings would be.

The objective is to put together a complete program with only code that the program actually uses, even though the programmer might write code that is unused for the problem at hand. The unused code might be written for solving other problems, but only the relevant ones must go into the program.

The C language has other peculiarities that keeps us further away from this objective. A single *.c source file is a single compilation unit, and it is compiled to a single *.o object file. The linker only considers whether a *.o is used or unused. It is either entirely included or entirely omitted. If the *.c contains multiple functions, some are unused, then the unused functions are included as well. Note that dead-code elimination only removes intra-function dead code. A more sophisticated compiler and linker would put each function into its own section in the *.o so the linker can cherry pick them. This is explained more in Link time dead code and data elimination using GNU toolchain.

In a high-level view, the compiler takes in source code and produces key-value pairs of blobs, and each blob may reference other blobs by their keys. The linker traverses the blobs as a dependency tree. This is a data mining problem.

If you think C is bad, many modern languages are worse. Java classes are all stored in a *.jar which is a zip file, so the distributed binary is completist. Java partially mitigates the problem by only loading classes on-demand, but loading a class requires scanning the zip file's central directory which is \( O(n) \).

If we treat programming language design as a data mining problem, it is immediately obvious that we want code retrieval to be as efficient as possible, preferably \( O(1) \). But there is more. The compiler reads in a source file (as if it's a punched card) and outputs a structural representation that allows the code to be intelligently surveyed. Human can use this database to learn about the architecture of the codebase. The compiler can use it to compile other code. The linker can use it to eliminate dead and duplicate code. The debugger can use it to show diagnostic.

When people implement a programming language, they don't usually realize that they're designing a database as well. If we do, we could simplify the model and allow the code to be queried more efficiently and more expressively.

1 comment:

Kunle Oyedele said...

So, are you working on any such language or proof-of-concept that will have these properties? The idea sounds quite interesting to me! I'd be curious to see what your results are. I'd be even more interested in volunteering if you need help on such a project, if you don't mind the fact that I don't know much about data mining or compilers.