Optimizing Software in C++ by Agner Fog - HTML preview

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub, Kindle for a complete version.

14 Specific optimization topics

 

14.1 Use lookup tables

Reading a value from a table of constants is very fast if the table is cached. Usually it takes only a few clock cycles to read from a table in the level-1 cache. We can take advantage of this fact by replacing a function call with a table lookup if the function has only a limited number of possible inputs.

Let's take the integer factorial function (n!) as an example. The only allowed inputs are the integers from 0 to 12. Higher inputs give overflow and negative inputs give infinity. A typical implementation of the factorial function looks like this:

// Example 14.1a

int factorial (int n) {          // n!

   int i, f = 1;

   for (i = 2; i <= n; i++) f *= i;

   return f;

}

This calculation requires n-1 multiplications, which can take quite a long time. It is more efficient to use a lookup table:

// Example 14.1b

int factorial (int n) {          // n!

   // Table of factorials:

   static const int FactorialTable[13] = {1, 1, 2, 6, 24, 120, 720,

      5040, 40320, 362880, 3628800, 39916800, 479001600};

   if ((unsigned int)n < 13) {   // Bounds checking (see page 134)

      return FactorialTable[n];  // Table lookup

   }

   else {

      return 0;                  // return 0 if out of range

   }

}

This implementation uses a lookup table instead of calculating the value each time the function is called. I have added a bounds check on n here because the consequence of n being out of range is possibly more serious when n is an array index than when n is a loop count. The method of bounds checking is explained below on page 134.

The table should be declared const in order to enable constant propagation and other optimizations. In most cases it is also recommended to declare the table static. This makes sure that the table is initialized when the program is loaded rather than each time the function is called. You may declare the function inline.

Replacing a function with a lookup table is advantageous in most cases where the number of possible inputs is limited and there are no cache problems. It is not advantageous to use a lookup table if you expect the table to be evicted from the cache between each call, and the time it takes to calculate the function is less than the time it takes to reload the value from memory plus the costs to other parts of the program of occupying a cache line.

Table lookup cannot be vectorized with the current instruction set. Do not use lookup tables if this prevents a faster vectorized code.

Storing something in static memory can cause caching problems because static data are likely to be scattered around at different memory addresses. If caching is a problem then it may be useful to copy the table from static memory to stack memory outside the innermost loop. This is done by declaring the table inside a function but outside the innermost loop and without the static keyword:

// Example 14.1c

void CriticalInnerFunction () {

   // Table of factorials:

   const int FactorialTable[13] = {1, 1, 2, 6, 24, 120, 720,

      5040, 40320, 362880, 3628800, 39916800, 479001600};

   ...

   int i, a, b;

   // Critical innermost loop:

   for (i = 0; i < 1000; i++) {

      ...

      a = FactorialTable[b];

      ...

   }

}

The FactorialTable in example 14.1c is copied from static memory to the stack when CriticalInnerFunction is called. The compiler will store the table in static memory and insert a code that copies the table to stack memory at the start of the function. Copying the table takes extra time, of course, but this is permissible when it is outside the critical innermost loop. The loop will use the copy of the table that is stored in stack memory which is contiguous with other local variables and therefore likely to be cached more efficiently than static memory.

If you don't care to calculate the table values by hand and insert the values in the code then you may of course make the program do the calculations. The time it takes to calculate the table is not significant as long as it is done only once. One may argue that it is safer to calculate the table in the program than to type in the values because a typo in a hand- written table may go undetected.

The principle of table lookup can be used in any situation where a program chooses between two or more constants. For example, a branch that chooses between two constants can be replaced by a table with two entries. This may improve the performance if the branch is poorly predictable. For example:

// Example 14.2a

float a;  int b;

a = (b == 0) ? 1.0f : 2.5f;

If we assume that b is always 0 or 1 and that the value is poorly predictable, then it is advantageous to replace the branch by a table lookup:

// Example 14.2b

float a;  int b;

static const float OneOrTwo5[2] = {1.0f, 2.5f};

a = OneOrTwo5[b & 1];

Here, I have AND'ed b with 1 for the sake of security.  b & 1 is certain to have no other value than 0 or 1 (see page 135). This extra check on b can be omitted, of course, if the value of b is guaranteed to be 0 or 1. Writing a = OneOrTwo5[b!=0]; will also work, although slightly less efficiently. This method is inefficient, however, when b is a float or double because all the compilers I have tested implement OneOrTwo5[b!=0] as OneOrTwo5[(b!=0) ? 1 : 0] in this case so we don't get rid of the branch. It may seem illogical that the compiler uses a different implementation when b is floating point. The reason is, I guess, that compiler makers assume that floating point comparisons are more predictable than integer comparisons. The solution  a = 1.0f + b * 1.5f; is efficient when b is a float, but not if b is an integer because the integer-to-float conversion takes more time than the table lookup.

Lookup tables are particular advantageous as replacements for switch statements because switch statements often suffer from poor branch prediction. Example:

// Example 14.3a

int n;

switch (n) {

case 0:

   printf("Alpha");  break;

case 1:

   printf("Beta");   break;

case 2:

   printf("Gamma");  break;

case 3:

   printf("Delta");  break;

}

This can be improved by using a lookup table:

// Example 14.3b

int n;

static char const * const Greek[4] = {

   "Alpha", "Beta", "Gamma", "Delta"

};

if ((unsigned int)n < 4) { // Check that index is not out of range

   printf(Greek[n]);

}

The declaration of the table has const twice because both the pointers and the texts they point to are constant.

14.2 Bounds checking

In C++, it is often necessary to check if an array index is out of range. This may typically look like this:

// Example 14.4a

const int size = 16; int i;

float list[size];

...

if (i < 0 || i >= size) {

   cout << "Error: Index out of range";

}

else {

   list[i] += 1.0f;

}

The two comparisons  i < 0 and  i >= size can be replaced by a single comparison:

// Example 14.4b

if ((unsigned int)i >= (unsigned int)size) {

   cout << "Error: Index out of range";

}

else {

   list[i] += 1.0f;

}

A possible negative value of  i will appear as a large positive number when  i is interpreted as an unsigned integer and this will trigger the error condition. Replacing two comparisons by one makes the code faster because testing a condition is relatively expensive, while the type conversion generates no extra code at all.

This method can be extended to the general case where you want to check whether an integer is within a certain interval:

// Example 14.5a

const int min = 100, max = 110;  int i;

...

if (i >= min && i <= max) { ...

can be changed to:

// Example 14.5b

if ((unsigned int)(i - min) <= (unsigned int)(max - min)) { ...

There is an even faster way to limit the range of an integer if the length of the desired interval is a power of 2. Example:

// Example 14.6

float list[16]; int i;

...

list[i & 15] += 1.0f;

This needs a little explanation. The value of i&15 is guaranteed to be in the interval from 0 to 15. If i is outside this interval, for example i = 18, then the & operator (bitwise and) will cut off the binary value of i to four bits, and the result will be 2. The result is the same as I modulo 16. This method is useful for preventing program errors in case the array index is out of range and we don't need an error message if it is. It is important to note that this method works only for powers of 2 (i.e. 2, 4, 8, 16, 32, 64, ...). We can make sure that a value is less than 2n and not negative by AND'ing it with 2n -1. The bitwise AND operation isolates the least significant n bits of the number and sets all other bits to zero.

14.3 Use bitwise operators for checking multiple values at once

The bitwise operators &, |, ^, ~, <<, >> can test or manipulate all the bits of an integer in one operation. For example, if each bit of a 32-bit integer has a particular meaning, then you can set multiple bits in a single operation using the | operator; you can clear or mask out multiple bits with the & operator; and you can toggle multiple bits with the ^ operator.

The & operator is also useful for testing multiple conditions in a single operation. Example:

// Example 14.7a. Testing multiple conditions

enum Weekdays {

   Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, Saturday

};

Weekdays Day;

if (Day == Tuesday || Day == Wednesday || Day == Friday) {

   DoThisThreeTimesAWeek();

}

The if statement in this example has three conditions which are implemented as three branches. They can be joined into a single branch if the constants Sunday, Monday, etc. are defined as powers of 2:

// Example 14.7b. Testing multiple conditions using &

enum Weekdays {

   Sunday = 1, Monday = 2, Tuesday = 4, Wednesday = 8,

   Thursday = 0x10, Friday = 0x20, Saturday = 0x40

};

Weekdays Day;

if (Day & (Tuesday | Wednesday | Friday)) {

   DoThisThreeTimesAWeek();

}

By giving each constant a value that is a power of 2 in example 14.7b, we are in fact using each bit in Day for signifying one of the weekdays. The maximum number of constants we can define in this way is equal to the number of bits in an integer, usually 32. In 64-bit systems we can use 64-bit integers with hardly any loss of efficiency.

The expression (Tuesday | Wednesday | Friday) in example 14.7b is converted by the compiler to the value 0x2C so that the if condition can be calculated by a single & operation, which is very fast. The result of the & operation will be non-zero, and therefore count as true, if any of the bits for Tuesday, Wednesday or