Then, there comes CPU with power saving mode, which reduces the CPU frequency when underutilized. Some older CPUs increment TSC by one per cycle regardless of the frequency, so each TSC increment is not uniform (though still monotonic). For these CPUs, we can get a reliable count of instruction cycles, but we cannot use the TSC to track wall-clock time.
Later CPUs increment TSC by one according to the maximum CPU frequency, so the TSC increment takes uniform interval. However, when benchmarking code, the TSC no longer measures actual clock cycle. Code that runs in reduced CPU frequency will take more TSC counts to execute. A machine to be used for benchmarking purpose should turn off power saving mode. In particular, a laptop (having more stringent power requirement) is probably not well-suited for benchmarking.
There is also the issue of multi-processor machines. Each CPU may have slightly different frequency, and they're started at slightly different times at boot up. The TSC of each CPUs cannot be easily correlated. Even if a benchmark involves only a single process, the operating system might still migrate the process to another CPU, causing the TSC readout to be non-deterministic. On a multi-process machine, benchmark should be run with CPU affinity fixed on one CPU, using sched_setaffinity(). CPU affinity is desired also for cache and TLB reasons.
I do not currently know of a reliable way to benchmark multi-threaded program down to nanosecond precision. Here are some problems I could think of.
- sched_setaffinity() works on a pid_t (i.e. the whole process). In general, pthread_t and pid_t are not interchangeable.
- If we want to use one CPU as the master clock source, the intercommunication and synchronization overhead will be non-negligible. It should be later verified, but I'm guessing that the resolution would be ~100s nanoseconds.