Optimizing Software in C++ by Agner Fog - HTML preview

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub, Kindle for a complete version.

8 Optimizations in the compiler

 

8.1 How compilers optimize

Modern compilers can do a lot of modifications to the code in order to improve performance. It is useful for the programmer to know what the compiler can do and what it can not do. The following sections describe some of the compiler optimizations that it is relevant for the programmer to know about.

Function inlining

The compiler can replace a function call by the body of the called function. Example:

// Example 8.1a

float square (float a) {

   return a * a;}

 

float parabola (float x) {

   return square(x) + 1.0f;}

The compiler may replace the call to square by the code inside square:

// Example 8.1b

float parabola (float x) {

   return x * x + 1.0f;}

The advantages of function inlining are:

The overhead of call and return and parameter transfer are eliminated.

Code caching will be better because the code becomes contiguous.

The code becomes smaller if there is only one call to the inlined function.

Function inlining can open the possibility for other optimizations, as explained below.

The disadvantage of function inlining is that the code becomes bigger if there is more than one call to the inlined function and the function is big. The compiler is more likely to inline a function if it is small or if it is called from only one or a few places.

Constant folding and constant propagation

An expression or subexpression containing only constants will be replaced by the calculated result. Example:

// Example 8.2a

double a, b;

a = b + 2.0 / 3.0;

The compiler will replace this by

// Example 8.2b

a = b + 0.666666666666666666667;

This is actually quite convenient. It is easier to write 2.0/3.0 than to calculate the value and write it with many decimals. It is recommended to put a parenthesis around such a subexpression to make sure the compiler recognizes it as a subexpression. For example, b*2.0/3.0 will be calculated as (b*2.0)/3.0 rather than as b*(2.0/3.0) unless you put a parenthesis around the constant subexpression.

A constant can be propagated through a series of calculations:

// Example 8.3a

float parabola (float x) {

   return x * x + 1.0f;}

float a, b;

a = parabola (2.0f);

b = a + 1.0f;

The compiler may replace this by

// Example 8.3b

a = 5.0f;

b = 6.0f;

Constant folding and constant propagation is not possible if the expression contains a function which cannot be inlined or cannot be calculated at compile time. For example:

// Example 8.4

double a = sin(0.8);

The sin function is defined in a separate function library and you cannot expect the compiler to be able to inline this function and calculate it at compile time. Some compilers are able to calculate the most common math functions such as sqrt and pow at compile- time, but not the more complicated functions like sin.

Pointer elimination

A pointer or reference can be eliminated if the target pointed to is known. Example:

// Example 8.5a

void Plus2 (int * p) {

   *p = *p + 2;}

 

int a;

Plus2 (&a);

The compiler may replace this by

// Example 8.5b

a += 2;

Common subexpression elimination

If the same subexpression occurs more than once then the compiler may calculate it only once. Example:

// Example 8.6a

int a, b, c;

b = (a+1) * (a+1);

c = (a+1) / 4;

The compiler may replace this by

// Example 8.6b

int a, b, c, temp;

temp = a+1;

b = temp * temp;

c = temp / 4;

Register variables

The most commonly used variables are stored in registers (see page 27).

The maximum number of integer register variables is approximately six in 32-bit systems and fourteen in 64-bit systems.

The maximum number of floating point register variables is eight in 32-bit systems and sixteen in 64-bit systems. Some compilers have difficulties making floating point register variables in 32-bit systems unless the SSE2 (or later) instruction set is enabled.

The compiler will choose the variables that are used most for register variables. This includes pointers and references, which can be stored in integer registers. Typical candidates for register variables are temporary intermediates, loop counters, function parameters, pointers, references, 'this' pointer, common subexpressions, and induction variables (see below).

A variable cannot be stored in a register if its address is taken, i.e. if there is a pointer or reference to it. Therefore, you should avoid making any pointer or reference to a variable that could benefit from register storage.

Live range analysis

The live range of a variable is the range of code in which the variable is used. An optimizing compiler can use the same register for more than one variable if their live-ranges do not overlap or if they are sure to have the same value. This is useful when the number of available registers is limited. Example:

// Example 8.7

int SomeFunction (int a, int x[]) {

   int b, c;

   x[0] = a;

   b = a + 1;

   x[1] = b;

   c = b + 1;

   return c;

}

In this example, a, b and c can share the same register because their live ranges do not overlap. If  c = b + 1 is changed to  c = a + 2 then a and b cannot use the same register because their live ranges now overlap.

Compilers do not normally use this principle for objects stored in memory. It will not use the same memory area for different objects even when their live ranges do not overlap. See page 89 for an example of how to make different objects share the same memory area.

Join identical branches

The code can be made more compact by joining identical pieces of code. Example:

// Example 8.8a

double x, y, z;  bool b;

if (b) {

   y = sin(x);

   z = y + 1.;

}

else {

   y = cos(x);

   z = y + 1.;

}

The compiler may replace this by

// Example 8.8b

double x, y;  bool b;

if (b) {

   y = sin(x);

}

else {

   y = cos(x);

}

z = y + 1.;

Eliminate jumps

Jumps can be avoided by copying the code that it jumps to. Example:

// Example 8.9a

int SomeFunction (int a, bool b) {

   if (b) {

      a = a * 2;

   }

   else {

      a = a * 3;

   }

   return a + 1;

}

This code has a jump from a=a*2; to return a+1;. The compiler can eliminate this jump by copying the return statement:

// Example 8.9b

int SomeFunction (int a, bool b) {

   if (b) {

      a = a * 2;

      return a + 1;

   }

   else {

      a = a * 3;

      return a + 1;

   }

}

A branch can be eliminated if the condition can be reduced to always true or always false:

// Example 8.10a

if (true) {

   a = b;

}

else {

   a = c;

}

Can be reduced to:

// Example 8.10b

a = b;

A branch can also be eliminated if the condition is known from a previous branch. Example:

// Example 8.11a

int SomeFunction (int a, bool b) {

   if (b) {

      a = a * 2;

   }

   else {

      a = a * 3;

   }

   if (b) {

      return a + 1;

}

   else {

      return a - 1;

   }

}

The compiler may reduce this to:

// Example 8.11b

int SomeFunction (int a, bool b) {

   if (b) {

      a = a * 2;

      return a + 1;

   }

   else {

      a = a * 3;

      return a - 1;

   }

}

Loop unrolling

Some compilers will unroll loops if a high degree of optimization is requested. See page 45. This may be advantageous if the loop body is very small or if it opens the possibility for further optimizations. Loops with a very low repeat count may be completely unrolled to avoid the loop overhead. Example:

// Example 8.12a

int i, a[2];

for (i = 0; i < 2; i++) a[i] = i+1;

The compiler may reduce this to:

// Example 8.12b

int a[2];

a[0] = 1; a[1] = 2;

Unfortunately, some compilers unroll too much. Excessive loop unrolling is not optimal because it takes too much space in the code cache and it fills up the loop buffer that some microprocessors have. In some cases it can be useful to turn off the loop unroll option in the compiler.

Loop invariant code motion

A calculation may be moved out of a loop if it is independent of the loop counter. Example:

// Example 8.13a

int i, a[100], b;

for (i = 0; i < 100; i++) {

   a[i] = b * b + 1;

}

The compiler may change this to:

// Example 8.13b

int i, a[100], b, temp;

temp = b * b + 1;

for (i = 0; i < 100; i++) {

   a[i] = temp;

}

Induction variables

An expression that is a linear function of a loop counter can be calculated by adding a constant to the previous value. Example:

// Example 8.14a

int i, a[100];

for (i = 0; i < 100; i++) {

   a[i] = i * 9 + 3;

}

The compiler may avoid the multiplication by changing this to:

// Example 8.14b

int i, a[100], temp;

temp = 3;

for (i = 0; i < 100; i++) {

   a[i] = temp;

   temp += 9;

}

Induction variables are often used for calculating the addresses of array elements. Example:

// Example 8.15a

struct S1 {double a; double b;};

S1 list[100];  int i;

for (i = 0; i < 100; i++) {

   list[i].a = 1.0;

   list[i].b = 2.0;

}

In order to access an element in list, the compiler must calculate its address. The address of list[i] is equal to the address of the beginning of list plus i*sizeof(S1). This is a linear function of i which can be calculated by an induction variable. The compiler can use the same induction variable for accessing list[i].a and list[i].b. It can also eliminate i and use the induction variable as loop counter when the final value of the induction variable can be calculated in advance. This reduces the code to:

// Example 8.15b

struct S1 {double a; double b;};

S1 list[100], *temp;

for (temp = &list[0]; temp < &list[100]; temp++) {

   temp->a = 1.0;

   temp->b = 2.0;

}

The factor sizeof(S1) = 16 is actually hidden behind the C++ syntax in example 8.15b. The integer representation of &list[100] is (int)(&list[100]) = (int)(&list[0]) + 100*16, and temp++ actually adds 16 to the integer value of temp.

The compiler doesn't need induction variables to calculate the addresses of array elements of simple types because the CPU has hardware support for calculating the address of an array element if the address can be expressed as a base address plus a constant plus an index multiplied by a factor of 1, 2, 4 or 8, but not any other factor. If a and b in example 8.15a were float instead of double, then sizeof(S1) would be 8 and no induction variable would be needed because the CPU has hardware support for multiplying the index by 8.

The compilers I have studied do not make induction variables for floating point expressions or more complex integer expressions. See page 80 for an example of how to use induction variables for calculating a polynomial.

Scheduling

A compiler may reorder instructions for the sake of parallel execution. Example:

// Example 8.16

float a, b, c, d, e, f, x, y;

x = a + b + c;

y = d + e + f;

The compiler may interleave the two formulas in this example so that a+b is calculated first, then d+e, then c is added to the first sum, then f is added to the second sum, then the first result is stored in x, and last the second result is stored in y. The purpose of this is to help the CPU doing multiple calculations in parallel. Modern CPUs are actually able to reorder instructions without help of the compiler (see page 103), but the compiler can make this reordering easier for the CPU.

Algebraic reductions

Most compilers can reduce simple algebraic expressions using the fundamental laws of algebra. For example, a compiler may change the expression  <