Optimizing Software in C++ by Agner Fog - HTML preview

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub, Kindle for a complete version.

10 Multithreading

 

The clock frequency of the CPU is limited by physical factors. The way to increase the throughput of CPU-intensive programs when the clock frequency is limited is to do multiple things at the same time. There are three ways to do things in parallel:

  • Using multiple CPUs or multi-core CPUs, as described in this chapter.
  • Using the out-of-order capabilities of modern CPUs, as described in chapter 11.
  • Using the vector operations of modern CPUs, as described in chapter 12.

Most modern CPUs have two or more cores, and it can be expected that the number of cores will grow in the future. To use multiple CPUs or CPU cores, we need to divide the work into multiple threads. There are two main principles here: functional decomposition and data decomposition. Functional decomposition here means that different threads are doing different kinds of jobs. For example, one tread can take care of the user interface, another thread can take care of communication with a remote database, and a third thread can do mathematical calculations. It is important that the user interface is not in the same thread as very time-consuming tasks because this would give annoyingly long and irregular response times. It is often useful to put time-consuming tasks into separate threads with low priority.

In many cases, however, there is a single task that consumes most of the resources. In this case we need to split up the data into multiple blocks in order to utilize the multiple processor cores. Each thread should then handle its own block of data. This is data decomposition.

It is important to distinguish between coarse-grained parallelism and fine-grained parallelism when deciding whether it is advantageous to do things in parallel. Coarse-grained parallelism refers to the situation where a long sequence of operations can be carried out independently of other tasks that are running in parallel. Fine-grained parallelism is the situation where a task is divided into many small subtasks, but it is impossible to work for very long on a particular subtask before coordination with other subtasks is necessary.

Multithreading works more efficiently with coarse-grained parallelism than with fine-grained parallelism because communication and synchronization between the different cores is slow. If the granularity is too fine then it is not advantageous to split the tasks into multiple threads. Out-of-order execution (chapter  11) and vector operations (chapter 12) are more useful methods for exploiting fine-grained parallelism.

The way to use multiple CPU cores is to divide the work into multiple threads. The use of threads is discussed on page 60. In the case of data decomposition, we should preferably have no more threads with the same priority than the number of cores or logical processors available in the system. The number of logical processors available can be determined by a system call (e.g. GetProcessAffinityMask in Windows).

There are several ways to divide the workload between multiple CPU cores:

  • Define multiple threads and put an equal amount of work into each thread. This method works with all compilers.
  • Use automatic parallelization. The Gnu, Intel and PathScale compilers can automatically detect opportunities for parallelization in the code and divide it into multiple threads, but the compiler may not be able to find the optimal decomposition of the data.
  • Use OpenMP directives. OpenMP is a standard for specifying parallel processing in C++ and Fortran. These directives are supported by Microsoft, Intel, PathScale and Gnu compilers. See www.openmp.org and the compiler manual for details.
  • Use function libraries with internal multi-threading, e.g. Intel Math Kernel Library.

The multiple CPU cores or logical processors usually share the same cache, at least at the last cache level, and in some cases even the same level-1 cache. The advantage of sharing the same cache is that communication between threads becomes faster and that threads can share the same code and read-only data. The disadvantage is that the cache will be filled up if the threads use different memory areas, and there will be cache contentions if the threads write to the same memory areas.

Data that are read-only can be shared between multiple threads, while data that are modified should be separate for each thread. It is not good to have two or more threads writing to the same cache line, because the threads will invalidate each other's caches and cause large delays. The easiest way to make thread-specific data is to declare it locally in the thread function so that it is stored on the stack. Each thread has its own stack. Alternatively, you may define a structure or class for containing thread-specific data and make one instance for each thread. This structure or class should be aligned by at least the cache line size in order to avoid multiple threads writing to the same cache line. The cache line size is typically 64 bytes on contemporary processors. The cache line size may possibly be more (128 or 256 bytes) on future processors.

There are various methods for communication and synchronization between threads, such as semaphores, mutexes and message systems. All of these methods are time consuming. Therefore, the data and resources should be organized so that the amount of necessary communication between threads is minimized. For example, if multiple threads are sharing the same queue, list, database, or other data structure then you may consider if it is possible to give each thread its own data structure and then merge the multiple data structures in the end when all threads have finished the time-consuming data processing.

Running multiple threads on a system with only one logical processor is not an advantage if the threads are competing for the same resources. But it can be a good idea to put time- consuming calculations into a separate thread with lower priority than the user interface. It is also useful to put file access and network access in separate threads so that one thread can do calculations while another thread is waiting for response from a hard disk or network.

Various development tools for supporting multi-threaded software are available from Intel. See Intel Technology Journal Vol. 11, Iss. 4, 2007 (www.intel.com/technology/itj/).

10.1 Hyperthreading

Some versions of Intel microprocessors are able to run two threads in each core. For example, a Core i7 processor with four cores can run eight threads simultaneously. This processor has four physical processors but eight logical processors.

Hyperthreading is Intel's term for running multiple threads in the same processor core. Two threads running in the same core will always compete for the same resources, such as cache and execution units. If any of the shared resources are limiting factors for the performance then there is no advantage to using hyperthreading. On the contrary, each thread may run at less than half speed because of cache evictions and other resource conflicts. But if a large fraction of the time goes to cache misses, branch misprediction, or long dependency chains then each thread will run at more than half the single-thread speed. In this case there is an advantage to using hyperthreading, but the performance is not doubled. A thread that shares the resources of the core with another thread will always run slower than a thread that runs alone in the core.

It is often necessary to do experiments in order to determine whether it is advantageous to use hyperthreading or not in a particular application.

If hyperthreading is not advantageous then it is necessary to query certain operating system functions (e.g. GetLogicalProcessorInformation in Windows) to determine if the processor has hyperthreading. If so, then you can avoid hyperthreading by using only the even-numbered logical processors (0, 2, 4, etc.). Older operating systems lack the necessary functions for distinguishing between the number of physical processors and the number of logical processors.

There is no way to tell a hyperthreading processor to give higher priority to one thread than another. Therefore, it can often happen that a low-priority thread steals resources from a higher-priority thread running in the same core. It is the responsibility of the operating system to avoid running two threads with widely different priority in the same processor core. Unfortunately, contemporary operating systems are not always avoiding this.

The Intel compiler is capable of making two threads where one thread is used for prefetching data for the other thread. However, in most cases you can rely on automatic prefetching so this feature is rarely needed.