Optimizing Software in C++ by Agner Fog - HTML preview

/ Home / Computer Sciences / Optimizing Software in C++

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub, Kindle for a complete version.

9 Optimizing memory access

9.1 Caching of code and data

A cache is a proxy for the main memory in a computer. The proxy is smaller and closer to the CPU than the main memory and therefore it is accessed much faster. There may be two or three levels of cache for the sake of fastest possible access to the most used data.

The speed of CPUs is increasing faster than the speed of RAM memory. Efficient caching is therefore becoming more and more important.

9.2 Cache organization

It is useful to know how a cache is organized if you are making programs that have big data structures with non-sequential access and you want to prevent cache contention. You may skip this section if you are satisfied with more heuristic guidelines.

Most caches are organized into lines and sets. Let me explain this with an example. My example is a cache of 8 kb size with a line size of 64 bytes. Each line covers 64 consecutive bytes of memory. One kilobyte is 1024 bytes, so we can calculate that the number of lines is 8*1024/64 = 128. These lines are organized as 32 sets x 4 ways. This means that a particular memory address cannot be loaded into an arbitrary cache line. Only one of the 32 sets can be used, but any of the 4 lines in the set can be used. We can calculate which set of cache lines to use for a particular memory address by the formula: (set) = (memory address) / (line size) % (number of sets). Here, / means integer division with truncation, and % means modulo. For example, if we want to read from memory address a = 10000, then we have (set) = (10000 / 64) % 32 = 28. This means that a must be read into one of the four cache lines in set number 28. The calculation becomes easier if we use hexadecimal numbers because all the numbers are powers of 2. Using hexadecimal numbers, we have a = 0x2710 and (set) = (0x2710 / 0x40) % 0x20 = 0x1C. Reading or writing a variable from address 0x2710 will cause the cache to load the entire 64 or 0x40 bytes from address 0x2700 to 0x273F into one of the four cache lines from set 0x1C. If the program afterwards reads or writes to any other address in this range then the value is already in the cache so we don't have to wait for another memory access.

Assume that a program reads from address 0x2710 and later reads from addresses 0x2F00, 0x3700, 0x3F00 and 0x4700. These addresses all belong to set number 0x1C. There are only four cache lines in each set. If the cache always chooses the least recently used cache line then the line that covered the address range from 0x2700 to 0x273F will be evicted when we read from 0x4700. Reading again from address 0x2710 will cause a cache miss. But if the program had read from different addresses with different set values then the line containing the address range from 0x2700 to 0x273F would still be in the cache. The problem only occurs because the addresses are spaced a multiple of 0x800 apart. I will call this distance the critical stride. Variables whose distance in memory is a multiple of the critical stride will contend for the same cache lines. The critical stride can be calculated as (critical stride) = (number of sets) x (line size) = (total cache size) / (number of ways).

If a program contains many variables and objects that are scattered around in memory then there is a risk that several variables happen to be spaced by a multiple of the critical stride and cause contentions in the data cache. The same can happen in the code cache if there are many functions scattered around in program memory. If several functions that are used in the same part of the program happen to be spaced by a multiple of the critical stride then this can cause contentions in the code cache. The subsequent sections describe various ways to avoid these problems.

More details about how caches work can be found in Wikipedia under CPU cache (en.wikipedia.org/wiki/L2_cache).

The details of cache organization for different processors are covered in manual 3: "The microarchitecture of Intel, AMD and VIA CPUs".

9.3 Functions that are used together should be stored together

The code cache works most efficiently if functions that are used near each other are also stored near each other in the code memory. The functions are usually stored in the order in which they appear in the source code. It is therefore a good idea to collect the functions that are used in the most critical part of the code together near each other in the same source file. Keep often used functions separate from seldom used functions, and put seldom used branches such as error handling in the end of a function or in a separate function.

Sometimes, functions are kept in different source files for the sake of modularity. For example, it may be convenient to have the member functions of a parent class in one source file and the derived class in another source file. If the member functions of parent class and derived class are called from the same critical part of the program then it can be advantageous to keep the two modules contiguous in program memory. This can be done by controlling the order in which the modules are linked together. The link order is usually the order in which the modules appear in the project window or makefile. You can check the order of functions in memory by requesting a map file from the linker. The map file tells the address of each function relative to the beginning of the program. The map file includes the addresses of library functions linked from static libraries (.lib or .a), but not dynamic libraries (.dll or .so). There is no easy way to control the addresses of dynamically linked library functions.

9.4 Variables that are used together should be stored together

Cache misses are very expensive. A variable can be fetched from the cache in just a few clock cycles, but it can take more than a hundred clock cycles to fetch the variable from RAM memory if it is not in the cache.

The cache works most efficiently if pieces of data that are used together are stored near each other in memory. Variables and objects should preferably be declared in the function in which they are used. Such variables and objects will be stored on the stack, which is very likely to be in the level-1 cache. The different kinds of variable storage are explained on page 26. Avoid global and static variables if possible, and avoid dynamic memory allocation (new and delete).

Object oriented programming can be an efficient way of keeping data together. Data members of a class (also called properties) are always stored together in an object of the class. Data members of a parent class and a derived class are stored together in an object of the derived class (see page 51).

The order in which data are stored can be important if you have big data structures. For example, if a program has two arrays, a and b, and the elements are accessed in the order a[0], b[0], a[1], b[1], ... then you may improve the performance by organizing the data as an array of structures:

// Example 9.1a

int Func(int);

const int size = 1024;

int a[size], b[size], i;

...

for (i = 0; i < size; i++) {

b[i] = Func(a[i]);

}

The data in this example can be accessed sequentially in memory if organized as follows:

// Example 9.1b

int Func(int);

const int size = 1024;

struct Sab {int a; int b;};

Sab ab[size];

int i;

...

for (i = 0; i < size; i++) {

ab[i].b = Func(ab[i].a);

}

There will be no extra overhead in the program code for making the structure in example 9.1b. On the contrary, the code becomes simpler because it needs only calculate element addresses for one array rather than two.

Some compilers will use different memory spaces for different arrays even if they are never used at the same time. Example:

// Example 9.2a

void F1(int x[]);

void F2(float x[]);

void F3(bool y) {

if (y) {

int a[1000];

F1(a);

}

else {

float b[1000];

F2(b);

}

Here it is possible to use the same memory area for a and b because their live ranges do not overlap. You can save a lot of cache space by joining a and b in a union:

// Example 9.2b

void F3(bool y) { union {

int a[1000];

float b[1000];

};

if (y) {

F1(a);

}

else {

F2(b);

}

Using a union is not a safe programming practice, of course, because you will get no warning from the compiler if the uses of a and b overlap. You should use this method only for big objects that take a lot of cache space. Putting simple variables into a union is not optimal because it prevents the use of register variables.

9.5 Alignment of data

A variable is accessed most efficiently if it is stored at a memory address which is divisible by the size of the variable. For example, a double takes 8 bytes of storage space. It should therefore preferably be stored at an address divisible by 8. The size should always be a power of 2. Objects bigger than 16 bytes should be stored at an address divisible by 16. You can generally assume that the compiler takes care of this alignment automatically.

The alignment of structure and class members may cause a waste of cache space, as explained in example 7.35 page 52.

You may choose to align large objects and arrays by the cache line size, which is typically 64 bytes. This makes sure that the beginning of the object or array coincides with the beginning of a cache line. Some compilers will align large static arrays automatically but you may as well specify the alignment explicitly by writing:

__declspec(align(64)) int BigArray[1024]; // Windows syntax

int BigArray[1024] __attribute__((aligned(64))); // Linux syntax

See page 95 and 120 for discussion of aligning dynamically allocated memory.

9.6 Dynamic memory allocation

Objects and arrays can be allocated dynamically with new and delete, or malloc and free. This can be useful when the amount of memory required is not known at compile time. Four typical uses of dynamic memory allocation can be mentioned here:

A large array can be allocated dynamically when the size of the array is not known at compile time.
A variable number of objects can be allocated dynamically when the total number of objects is not known at compile time.
Text strings and similar objects of variable size can be allocated dynamically.
Arrays that are too large for the stack can be allocated dynamically.

The advantages of dynamic memory allocation are:

Gives a more clear program structure in some cases.
Does not allocate more space than needed. This makes data caching more efficient than when a fixed-size array is made very big in order to cover the worst case situation of the maximum possible memory requirement.
Useful when no reasonable upper limit to the required amount of memory space can be given in advance.

The disadvantages of dynamic memory allocation are:

The process of dynamic allocation and deallocation of memory takes much more time than other kinds of storage. See page 26.
The heap space becomes fragmented when objects of different sizes are allocated and deallocated in random order. This makes data caching inefficient.
An allocated array may need to be resized in the event that it becomes full. This may require that a new bigger memory block is allocated and the entire contents copied to the new block. Any pointers to data in the old block then become invalid.
The heap manager will start garbage collection when the heap space has become too fragmented. This garbage collection may start at unpredictable times and cause delays in the program flow at inconvenient times when a user is waiting for response.
It is the responsibility of the programmer to make sure that everything that has been allocated is also deallocated. Failure to do so will cause the heap to be filled up. This is a common programming error known as memory leaks.
It is the responsibility of the programmer to make sure that no object is accessed after it has been deallocated. Failure to do so is also a common programming error.
The allocated memory may not be optimally aligned. See page 120 for how to align dynamically allocated memory.
It is difficult for the compiler to optimize code that uses pointers because it cannot rule out aliasing (see page 78).
A matrix or multidimensional array is less efficient when the row length is not known at compile time because of the extra work needed for calculating row addresses at each access. The compiler may not be able to optimize this with induction variables.

It is important to weigh the advantages over the disadvantages when deciding whether to use dynamic memory allocation. There is no reason to use dynamic memory allocation when the size of an array or the number of objects is known at compile time or a reasonable upper limit can be defined.

The cost of dynamic memory allocation is negligible when the number of allocations is limited. Dynamic memory allocation can therefore be advantageous when a program has one or a few arrays of variable size. The alternative solution of making the arrays very big to cover the worst case situation is a waste of cache space. A situation where a program has several large arrays and where the size of each array is a multiple of the critical stride (see above, page 87) is likely to cause contentions in the data cache.

If the number of elements in an array grows during program execution then it is preferable to allocate the final array size right from the beginning rather than allocating more space step by step. In most systems, you cannot increase the size of a memory block that has already been allocated. If the final size cannot be predicted or if the prediction turns out to be too small, then it is necessary to allocate a new bigger memory block and copy the contents of the old memory block into the beginning of the new bigger memory block. This is inefficient, of course, and causes the heap space to become fragmented. An alternative is to keep multiple memory blocks, either in the form of a linked list or with an index of memory blocks. A method with multiple memory blocks makes the access to individual array elements more complicated and time consuming.

A collection of a variable number of objects is often implemented as a linked list. Each element in a linked list has its own memory block and a pointer to the next block. A linked list is less efficient than a linear array for the following reasons:

Each object is allocated separately. The allocation, deallocation and garbage collection takes a considerable amount of time.
The objects are not stored contiguously in the memory. This makes data caching less efficient.
Extra memory space is used for the link pointers and for information stored by the heap manager for each allocated block.
Walking through a linked list takes more time than looping through a linear array. No link pointer can be loaded until the previous link pointer has been loaded. This makes a critical dependency chain which prevents out-of-order execution.

It is often more efficient to allocate one big block of memory for all the objects (memory pooling) than to allocate a small block for each object.

A little-known alternative to using new and delete is to allocate variable-size arrays with alloca. This is a function that allocates memory on the stack rather than the heap. The space is automatically deallocated when returning from the function in which alloca was called. There is no need to deallocate the space explicitly when alloca is used. The advantages of alloca over new and delete or malloc and free are:

There is very little overhead to the allocation process because the microprocessor has hardware support for the stack.
The memory space never becomes fragmented thanks to the first-in-last-out nature of the stack.
Deallocation has no cost because it goes automatically when the function returns. There is no need for garbage collection.
The allocated memory is contiguous with other objects on the stack, which makes data caching very efficient.

The following example shows how to make a variable-size array with alloca:

// Example 9.3

#include <malloc.h>

void SomeFunction (int n) {

if (n > 0) {

// Make dynamic array of n floats:

float * DynamicArray = (float *)alloca(n * sizeof(float));

// (Some compilers use the name