Optimizing Subroutines in Assembly Language by Agner Fog - HTML preview

/ Home / Computer Sciences / Optimizing Subroutines in Assembly Language

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub, Kindle for a complete version.

14 Multithreading

There is a limit to how much processing power you can get out of a single CPU. Therefore, many modern computer systems have multiple CPU cores. The way to make use of multiple CPU cores is to divide the computing job between multiple threads. The optimal number of threads is usually equal to the number of CPU cores. The workload should ideally be evenly divided between the threads.

Multithreading is useful where the code has an inherent parallelism that is coarse-grained. Multithreading cannot be used for fine-grained parallelism because there is a considerable overhead cost of starting and stopping threads and synchronizing the threads. Communi- cation between threads can be quite costly, although these costs are reduced on newer processors. The computing job should preferably be divided into threads at the highest possible level. If the outermost loop can be parallelized, then it should be divided into one loop for each thread, each doing its share of the whole job.

Thread-local storage should preferably use the stack. Static thread-local memory is inefficient and should be avoided.

14.1 Hyperthreading

Some Intel processors can run two threads in the same core. The P4E has one core capable of running two threads, the Atom has two cores capable of running two threads each, and the Nehalem and Sandy Bridge have several cores capable of running two threads each. Other processors, including processors from AMD and VIA, are able to run multiple threads as well, but only one thread in each core.

Hyperthreading is Intel's term for running multiple threads in the same processor core. Two threads running in the same core will always compete for the same resources, such as cache, instruction decoder and execution units. If any of the shared resources are limiting factors for the performance then there is no advantage to using hyperthreading. On the contrary, each thread may run at less than half speed because of cache evictions and other resource conflicts. But if a large fraction of the time goes to cache misses, branch misprediction, or long dependency chains then each thread will run at more than half the single-thread speed. In this case there is an advantage to using hyperthreading, but the performance is not doubled. A thread that shares the resources of the core with another thread will always run slower than a thread that runs alone in the core.

It may be necessary to do experiments in order to determine whether it is advantageous to use hyperthreading or not in a particular application.

See manual 1: "Optimizing software in C++" for more details on multithreading and hyperthreading.