Tuesday, May 11, 2010

Cost estimate for large project

A programming project of mine is growing in the number of lines of code. Although I write documentation as inline comments, I'm at a point where it is nice to have a reference. I've always known that open source projects tend to use Doxygen, which automatically parses your code and extracts documentation from it to be used as a reference. But this is not what I'm going to talk about today, though I'm using Doxygen as an example.

As I read about Doxygen, I noticed that the author offers some form of commercial support: implementing features, fixing bugs, and providing help to answer questions. I think it would be more useful if he could offer the service to retrofit a Doxygen configuration file for an existing project. In doing this, he would learn about large project requirements that he may build into Doxygen in the future. Another reason is that some people might not want to divert their attention to figure out how to configure Doxygen for their project, and they're willing to pay for the service (this is another plausible open source business model).

Most projects have a particular convention, so that setting up Doxygen is only a one-time effort. The nice thing about a computer is that, once the initial setup is done, the cost of recreating the reference from source code is miniscule, and it is a cost that most project teams should be able to absorb. The expensive part is the setup, which requires human attention. This applies to any automation for a project, so I'm going to stop talking about Doxygen from now on.

Not all projects are created equal. It is simple to setup automation for a project that spans only a few thousand lines of code. A large project with millions lines of code has a lot of hidden complexity. One source of the complexity is due to the fact that collaborative effort often has bits of inconsistencies because everyone does things somewhat differently. Even the same person changes the way he/she does things over time. In big-O notation, the amount of inconsistency is O(n) where n is the number of lines of code. In order to resolve these inconsistencies, we need to compare each k inconsistencies to the remaining (k - 1) inconsistencies (so we can devise a workaround or a fix), and the amount of effort required would be O(n2). This is the cost of setting up automation for a project.

It's easy to see that it is the most cost effective to setup automation from when the project is small. In addition, for a large project, the cost estimate would be proportional to the square of the size of the project. A well-maintained project is tree-structured, and (without proof) a divide and conquer strategy could be used to resolve inconsistencies that could hinder automation. The cost would be closer to O(n logn).

I think this very simple asymptotic analysis explains why money spent on multi-million dollar projects tend to vanish. A project involving n workers cannot allow each one of them to collaborate with one another directly, which means the communication complexity alone is O(n2). A ideal, hierarchical, tree-structured organization still requires O(n logn) amount of communication complexity for the time spent on meetings and planning. The cost for the actual work done is still O(n), and the total cost estimate is O(n logn) + O(n) = O(n logn). The cost estimate for a large project never scales linearly, and most directors under-estimate the cost if they use a linear model.

1 comment:

Likai Liu said...

Some clarification...

The linear model predicts productivity and cost relation to grow like a linear equation an + b where n is the productivity, e.g. with a cost model "n * 2 + 20" we have:

productivity: 10, 20, 30, 40, 50
cost: 40, 60, 80, 100, 120
cost per unit: 4, 3, 2.67, 2.5, 2.4

Cost per unit simply divides the total cost by the number of units of productivity. This gives the illusion of scale, where more productivity decreases cost per unit. But I'm arguing that the linear model is overly naive. The model I propose is linear-logarithmic, so a cost model would be an * logn + bn + c, e.g. with a cost model "2n * logn + 20" we have:

productivity: 10, 20, 30, 40, 50
cost: 40, 72, 109, 148, 189
cost per unit of productivity: 4, 3.6, 3.63, 3.7, 3.78

You can see that the sweet spot is when the productivity is around 20. The more productivity you try to achieve, cost per unit goes up again.

I postulate this theory after observing how a lot of large projects seem to lose cost effectiveness at a large scale, using things we learn in computer science. Computer science is less about computers, but more about model of computation, and how to predict performance. That's why it's kind of hard to explain what we do. But if you treat human beings as a computer and the society as a model of computation, then computer science can explain economy very well.