Optimizing Subroutines in Assembly Language by Agner Fog - HTML preview

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub, Kindle for a complete version.

18 Measuring performance

 

18.1 Testing speed

Many compilers have a profiler which makes it possible to measure how many times each function in a program is called and how long time it takes. This is very useful for finding any hot spot in the program. If a particular hot spot is taking a high proportion of the total execution time then this hot spot should be the target for your optimization efforts.

Many profilers are not very accurate, and certainly not accurate enough for fine-tuning a small part of the code. The most accurate way of testing the speed of a piece of code is to use the so-called time stamp counter. This is an internal 64-bit clock counter which can be read into EDX:EAX using the instruction RDTSC (read time stamp counter). The time stamp counter counts at the CPU clock frequency so that one count equals one clock cycle, which is the smallest relevant time unit. Some overclocked processors count at a slightly different frequency.

The time stamp counter is very useful because it can measure how many clock cycles a piece of code takes. Some processors are able to change the clock frequency in response to changing workloads. However, when searching for bottlenecks in small pieces of code, it is more informative to measure clock cycles than to measure microseconds.

The resolution of the time measurement is equal to the value of the clock multiplier (e.g. 11) on Intel processors with SpeedStep technology. The resolution is one clock cycle on other processors. The processors with SpeedStep technology have a performance counter for unhalted core cycles which gives more accurate measurements than the time stamp counter. Privileged access is necessary to enable this counter.

On all processors with out-of-order execution, you have to insert  XOR EAX,EAX / CPUID before and after each read of the counter in order to prevent it from executing in parallel with anything else. CPUID is a serializing instruction, which means that it flushes the pipeline and waits for all pending operations to finish before proceeding. This is very useful for testing purposes.

The biggest problem when counting clock ticks is to avoid interrupts. Protected operating systems do not allow you to clear the interrupt flag, so you cannot avoid interrupts and task switches during the test. This makes test results inaccurate and irreproducible. There are several alternative ways to overcome this problem:

1.  Run the test code with a high priority to minimize the risk of interrupts and task switches.

2.  If the piece of code you are testing is not too long then you may repeat the test several times and assume that the lowest of the clock counts measured represents a situation where no interrupt has occurred.

3.  If the piece of code you are testing takes so long time that interrupts are unavoidable then you may repeat the test many times and take the average of the clock count measurements.

4.  Make a virtual device driver to clear the interrupt flag.

5.  Use an operating system that allows clearing the interrupt flag (e.g. Windows 98 without network, in console mode).

6.  Start the test program in real mode using the old DOS operating system.

I have made a series of test programs that use method 1, 2 and possibly 6. These programs are available at www.agner.org/optimize/testp.zip.

A further complication occurs on processors with multiple cores if a thread can jump from one core to another. The time stamp counters on different cores are not necessarily synchronized. This is not a problem when testing small pieces of code if the above precautions are taken to minimize interrupts. But it can be a problem when measuring longer time intervals. You may need to lock the process to a single CPU core, for example with the function SetProcessAffinityMask in Windows. This is discussed in the document "Game Timing and Multicore Processors", Microsoft 2005 http://msdn.microsoft.com/en-us/library/ee417693.aspx.

You will soon observe when measuring clock cycles that a piece of code always takes longer time the first time it is executed where it is not in the cache. Furthermore, it may take two or three iterations before the branch predictor has adapted to the code. The first measurement gives the execution time when code and data are not in the cache. The subsequent measurements give the execution time with the best possible caching.

The alignment effects on the PPro, P2 and P3 processors make time measurements very difficult on these processors. Assume that you have a piece code and you want to make a change which you expect to make the code a few clocks faster. The modified code does not have exactly the same size as the original. This means that the code below the modification will be aligned differently and the instruction fetch blocks will be different. If instruction fetch and decoding is a bottleneck, which is often the case on these processors, then the change in the alignment may make the code several clock cycles faster or slower. The change in the alignment may actually have a larger effect on the clock count than the modification you have made. So you may be unable to verify whether the modification in itself makes the code faster or slower. It can be quite difficult to predict where each instruction fetch block begins, as explained in manual 3: "The microarchitecture of Intel, AMD and VIA CPUs".

Other processors do not have serious alignment problems. The P4 does, however, have a somewhat similar, though less severe, effect. This effect is caused by changes in the alignment of µops in the trace cache. The time it takes to jump to the least common (but predicted) branch after a conditional jump instruction may differ by up to two clock cycles on different alignments if trace cache delivery is the bottleneck. The alignment of µops in the trace cache lines is difficult to predict.

Most x86 processors also have a set of so-called performance monitor counters which can count events such as cache misses, misalignments, branch mispredictions, etc. These are very useful for diagnosing performance problems. The performance monitor counters are processor-specific. You need a different test setup for each type of CPU.

Details about the performance monitor counters can be found in Intel's "IA-32 Intel Architecture Software Developer’s Manual", vol. 3 and in AMD's "BIOS and Kernel Developer's Guide".

You need privileged access to set up the performance monitor counters. This is done most conveniently with a device driver. The test programs at www.agner.org/optimize/testp.zip give access to the performance monitor counters under 32-bit and 64-bit Windows and 16- bit real mode DOS. These test program support the different kinds of performance monitor counters in most Intel, AMD and VIA processors.

Intel and AMD are providing profilers that use the performance monitor counters of their respective processors. Intel's profiler is called Vtune and AMD's profiler is called CodeAnalyst.

18.2 The pitfalls of unit-testing

If you want to find out which version of a function performs best then it is not sufficient to measure clock cycles in a small test program that calls the function many times. Such a test is unlikely to give a realistic measure of cache misses because the test program may use less memory than the cache size. For example, an isolated test may show that it is advantageous to roll out a loop while a test where the function is inserted in the final program shows a large amount of cache misses when the loop is rolled out.

Therefore, it is important not only to count clock cycles when evaluating the performance of a function, but also to consider how much space it uses in the code cache, data cache and branch target buffer.

See the section named "The pitfalls of unit-testing" in manual 1: "Optimizing software in C++" for further discussion of this problem.