Even in the simplest form, bin packing is an NP-hard problem. An approximation algorithm that runs in polynomial time may not produce the optimal solution, as you can see in the following illustration.

The first fit algorithm sorts objects by size, and then place each object in the leftmost bin that has space that fits. Heuristic algorithm can try harder to find a better solution at greater computation cost. You can imagine that vector packing problem is even more complex, now that the algorithm has to consider several dimensions of constraints.

Researchers understand that vector packing problem has a very useful application in cloud computing, and this has inspired a wealth of research literature. In this application, bins represent machines in a computation cluster, and each bin has a dimension for CPU, RAM, disk space, and network bandwidth. The objects represent jobs to be scheduled in the computing cluster. A job has specifications for the amount of CPU, RAM, disk, and network it uses. We can pack several jobs into a machine as long as the sum of the resources used is under what the machine could provide.

Better packing means that unused machines can be powered off to save energy, while the online machines have higher utilization which helps them achieve better energy efficiency. But as the number of jobs and the number of machines grow, even a heuristic algorithm can take a long time trying to find a better packing solution. This can grind cluster scheduling to a halt.

But I want to call this a fallacious understanding of the real problem we want to solve. In practice, cluster scheduling differs from the vector packing problem in the following regards:

- Bin or vector packing is a static problem where, once the objects are placed, they are not meant to be moved or displaced with other objects. Scheduling is inherently a dynamic allocation problem where objects come and go (i.e. they have a lifetime). An optimal packing does not imply optimal scheduling.
- The scheduling as a dynamic allocation problem is usually tackled as an offline problem where all the objects and their lifetimes are known in advance. In practice, we do not know the lifetime of jobs, which may stay running for a long time, but a human operator can always restart jobs or start new ones.
- A cluster typically have machines of a variety of specifications that differ in CPU speed, number of cores, amount of RAM, etc. Bin or vector packing assumes that all bins are of the same size.

Furthermore, in real life, job specification may not accurately reflect its actual usage. A job that says it uses a certain amount of CPU or RAM might end up using more, or using less. Or the usage may be sporadic. If it leaves garbage files around, then its disk space usage would grow indefinitely.

To define the cluster scheduling problem better, we need to understand that some of the dimensions are soft constraints, while others are hard. In general, temporal constraints are soft, while spatial constraints are hard, but there are exceptions.

- CPU is a soft constraint. If it is over-subscribed, jobs will just take longer to compute.
- Network bandwidth is also a soft constraint. If over-subscribed, then jobs will just take longer to send or receive data.
- RAM is a spatial constraint, except that can actually be over-subscribed because of virtual memory. When memory pressure is high, rarely used memory pages are swapped to disk to meet new memory demand. This swapping can cause performance issues, but it now allows a temporal trade-off, which makes RAM a soft constraint.
- In some deployment scenario, disk seek time is a scarce resource, but it is still a soft constraint. If over-subscribed, then jobs will simply spend more time waiting for disk I/O.
- The only remaining hard constraint is really the disk space.

These observations simplify cluster scheduling to become more like a one-dimensional bin packing problem, where jobs are scheduled only based on the amount of disk space it expects to use. To meet the soft constraints, we simply monitor each machine to see if the CPU, RAM, network, or I/O is oversubscribed. If so, the machine monitor will try to move contending jobs to different machines. But over-subscription will not prevent us from tentatively scheduling new jobs to run on this machine since we don't know about their performance characteristics yet; they may not cause any more contention.

It's not always a good idea trying to solve a practical problem with a theoretical one, since a theoretical solution is likely a lot harder, and will still miss the mark. Vector packing problem is not a practical representation of the cluster scheduling problem.

## No comments:

Post a Comment