Thursday, May 20, 2010

Secrets (everyone should know) About Running Web Servers

Datacenters are expanding, and a lot of them host web servers. It is only imperative that web servers run fast and lean. An ineffective web server implementation can cost you money in many ways.
  • You will need beefier hardware; they are more expensive, draw more power, and generate a lot of heat (requires high cooling cost).
  • You will need more servers. In addition to all the above disadvantages, they take more room, which costs you more in rent.
Although network bandwidth charges can be expensive, they scale nicely with the amount of traffic you receive. There are still things you can do to halve the bandwidth use, like compressing HTTP content. Other things like combining packet output can shave a few bytes from packet headers, and it improves latency, but these are not the main focus for this article. I want to talk about is, what can make you run a web server that is unnecessarily costly, purely due to the software that runs on the server.

First of all, PHP is expensive to run (i.e. Mark Zuckerberg is an idiot). In the very beginning, there was a web server that can serve static files. Then there was a web server that can serve dynamic content by spawning child processes and delegating requests to them using the CGI protocol. You can write CGI in any language, including C, C++, Perl, and even shell script. This satisfies applications where Server Side Include does not suffice. But for certain applications, the ability to embed code directly inside HTML code makes the program look more elegant because it saves you all the print statements. This is where PHP came along. It was first invoked by Apache over CGI. Each time you make a request to a PHP page, the PHP interpreter has to parse and compile your PHP code from head to toe into bytecode before it can run.

As PHP gains more popularity because of its ease of rapid prototyping, the application size increases. It became impractical to do head to toe parsing each time someone requests a PHP page, which nowadays can easily involve tens of megabytes worth of code. MediaWiki is 44MB. Joomla is 33MB. Wordpress is 9MB. PhpWiki is 1.4MB. This is where mod_php comes along. It avoids having to recompile a massive amount of code by doing it only once, and caching the bytecode in memory. Putting all the bytecode in memory seems to be a small deal, but Apache distributes its workload over several child processes, each of them has an instance of mod_php that caches its own copy of the bytecode. If you increase the number of child processes, you run out of memory. If you limit memory usage, you cannot fully utilize CPU and I/O. This is explained better in Why mod_php is bad for performance. Unfortunately, memory is still the scarce resource. There is a limited amount of memory you can put into each server. You will have to run more servers; they take up more space and cost more money.

Facebook recently developed a version of PHP called HipHop that precompiles your code. This will alleviate the need for mod_php hack, which trades memory usage for performance, a poor tradeoff (maybe Mark Zuckerberg isn't as dumb as I thought).

Now, back to CGI. I did mention that you could write CGI in any language you want. However, some languages have a pretty bad run-time startup cost (Java is particularly notorious due to its Just In Time compilation). A solution is to keep those CGI child processes running, hence the FastCGI protocol is born. This means that your application runs more or less persistently along with the web server. Some languages like Java or Python simply has a web server written in that language, which requires no bridging. FastCGI is just a clumsy way to bridge together a particular server software with dynamic site logic written in a different language, due to administrative reason, e.g. at a web hosting provider. If you can run your own web server, don't bother with FastCGI. For the rest of the article, assume you write a persistent web server from scratch.

The fact that your web application runs persistently on the server has several implications. If you write your application in C or C++, and it has a memory leak, it will keep accumulating memory usage until it runs out of memory and dies. You can limit the effect of memory leak by restarting the server process after a predetermined number of requests, and keep several processes active so that some servers are available to handle requests at any given time.

If you use a garbage collected language like C# or Java, you don't have to worry about memory leak, but multi-threaded request handling is known to interfere with generational garbage collection by prematurely putting would-be garbage in the tenure heap, which causes costly full-collection to run more frequently. There are two solutions, either to increase nursery size, or separate the application into several processes, each running only a few threads. You can mix the two approaches.

This pretty much covers all I want to say. Here is a summary of secrets that I want everyone to know, about running web servers.
  • Stay away from mod_php (or mod_perl, mod_python, mod_ruby).
  • Stay away from web hosting providers that insist everyone should use mod_php (or other mod_*s) because they are encouraging their customers to be more expensive, and they are often more pricey. That applies to 99.9% of web hosting companies out there.
  • If you run your own web server, don't bother with FastCGI. If you can help it, use a web server that can be embedded inside your web application, written in your application's native language.
  • It still pays to write a web application in C/C++, especially where scalability is concerned.
Oh, and I forgot. If you can use client-side scripting, use it. Each individual user's computer might compare feebly to your server, but collectively their computing power is more than you can rent from a data center.

2 comments:

Unknown said...

To avoid memory leaks in C++ I advise to use Deleaker ( http://deleaker.com/ )

jpl said...

I think you got it wrong about mod_php. When you use APC, the bytecode cache is shared.

Apart from that, there is always the cost-of-development versus cost-of-operation argument to consider, which goes in favor of portable scripted languages with automatic memory management.