Sunday, July 10, 2016

C preprocessor hygienic macros

A hygienic macro is a macro that defines variables for its own use without accidentally confusing them with the variables defined by the user of the macro. By design, C/C++ preprocessor macro performs only simple string substitution which is not hygienic. That means those using a macro has to be mindful of the macro's definition and avoid using variables of already used names. This also means that such macro cannot be arbitrarily nested in scope, since nesting will cause variable conflict.

But hygienic macro is really useful for defining language extensions in the form of syntactic sugars. My motivation example is a foreach loop that was not added to the C++ language until C++11. Although C++ now has a ranged-for syntax, it is still useful for a plain C library that implements containers. There may be other use cases for macro hygiene, so I want to explain the technique I have been using for a number of years.

The "ranged-for" syntax in C++11 iterates over all items in a container sequentially:
std::vector<int> xs;
for (int& x : xs) {
  // User code.
I defined a macro to accomplish something similar:
std::vector<int> xs;
foreach(int& x, xs) {
  // User code.
The idea is that this macro is a synthetic sugar that expands to a for-loop using an iterator:
std::vector<int> xs;
for (std::vector<int>::iterator it = xs.begin(); it != xs.end(); ++it) {
  int& x = *it;
  // User code.
But this synthetic sugar does not quite work: we need to define the variable int& x on the user's behalf. The user shouldn't be expected to add int& x = *it; to his block statement. For that, I used another for loop to introduce the variable but only run it once.
std::vector<int> xs;
for (std::vector<int>::iterator it = xs.begin(); it != xs.end(); ++it)
  for (int& x = *it; true; break) {
    // User code.
What if the user code also contains a break statement? It will only break the inner for loop that introduces the variable binding but the outer for loop will continue. Unfortunately C/C++ has no labeled break. One solution is to use a variable to keep track of how the break happened.
std::vector<int> xs;
for (bool next = true; next; (next = false))
  for (std::vector<int>::iterator it = xs.begin(); next && it != xs.end(); ++it)
    for (int& x = *it; !(next = false); ({ next = true; break; })) {
      // User code.
We wrap the user code around a sentinel variable initially set to false before executing the user code, and reset it to true after. If the user code breaks, the sentinel variable remains false, so the next iteration will not run. This works, but now the code introduces two implicit variables, it and next.

The simplest way to construct a hygienic macro is by giving it explicit variable names, so we can define the foreach macro like this:
#define FOR_EACH_NAMED(decl_var, container, it, next)                   \
  for (bool next = true; next; (next = false))                          \
    for (autovar(it, container.begin()); next && it != container.end(); ++it) \
      for (decl_var = *it; !(next = false); ({ next = true; break; }))
Instead of writing the explicit iterator type std::vector<int>::iterator, we use another macro autovar() here to declare a variable with a type that is inferred from an expression, using a GCC extension __typeof__(). C++11 now has decltype which should be used for C++ code.
#define autovar(var, exp) __typeof__(exp) var = (exp)
Now, the hygienic macro simply generates unique variable names and invokes the explicit macro.
#define foreach FOR_EACH

#define FOR_EACH(decl_var, container) \
  FOR_EACH_NAMED(decl_var, container, VAR(__it_), VAR(__next_))
Now the interesting part is the VAR() macro that creates a unique variable name using a given prefix. Using a human readable prefix makes the generated code somewhat easier to debug. We define a UNIQUE() macro to generate a unique token, and this needs to be concatenated with the prefix only after evaluating the unique token, not before. For this to work, we use a JOIN() macro that does double concatenation.
// Creates a hygienic identifier with a given prefix name.
#define VAR(name) JOIN(name, UNIQUE())

// Concatenates expanded macro.
#define JOIN(a, b) JOIN_2(a, b)
#define JOIN_2(a, b) a ## b
As to the implementation of UNIQUE(), we use __COUNTER__, which is a GCC extension, if present. Otherwise the current line number __LINE__ would suffice.
#ifdef __COUNTER__
#  define UNIQUE() __COUNTER__
#  define UNIQUE() __LINE__
In summary, a good way to construct hygienic C preprocessor macro is by first defining an explicit version, then an implicit hygienic version using a unique variable generator.
#define FOO_NAMED(arg0, arg1, /* ... */, var0, var1, /* ... */) /* ... */
#define FOO(arg0, arg1, /* ... */) \
  FOO_NAMED(arg0, arg1, /* ... */, VAR(__var0_), VAR(__var1_), /* ... */)