Optimizing Software in C++ by Agner Fog - HTML preview

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub, Kindle for a complete version.

16 Testing speed

 

Testing the speed of a program is an important part of the optimization job. You have to check if your modifications actually increase the speed or not.

There are various profilers available which are useful for finding the hot spots and measuring the overall performance of a program. The profilers are not always accurate, however, and it may be difficult to measure exactly what you want when the program spends most of its time waiting for user input or reading disk files. See page 16 for a discussion of profiling.

When a hot spot has been identified, then it may be useful to isolate the hot spot and make measurements on this part of the code only. This can be done with the resolution of the CPU clock by using the so-called time stamp counter. This is a counter that measures the number of clock pulses since the CPU was started. The length of a clock cycle is the reciprocal of the clock frequency, as explained on page 16. If you read the value of the time stamp counter before and after executing a critical piece of code then you can get the exact time consumption as the difference between the two clock counts.

The value of the time stamp counter can be obtained with the function ReadTSC listed below in example 16.1. This code works only for compilers that support intrinsic functions. Alternatively, you can use the header file timingtest.h from www.agner.org/optimize/testp.zip or get ReadTSC as a library function from www.agner.org/optimize/asmlib.zip.

// Example 16.1

#include <intrin.h>        // Or #include <ia32intrin.h> etc.

 

long long ReadTSC() {      // Returns time stamp counter

   int dummy[4];           // For unused returns

   volatile int DontSkip;  // Volatile to prevent optimizing

   long long clock;        // Time

   __cpuid(dummy, 0);      // Serialize

   DontSkip = dummy[0];    // Prevent optimizing away cpuid

   clock = __rdtsc();      // Read time

   return clock;

}

You can use this function to measure the clock count before and after executing the critical code. A test setup may look like this:

// Example 16.2

 

#include <stdio.h>

#include <asmlib.h>        // Use ReadTSC() from library asmlib..

                           // or from example 16.1

void CriticalFunction();   // This is the function we want to measure

 

...

 

const int NumberOfTests = 10;         // Number of times to test

int i; long long time1;

long long timediff[NumberOfTests];    // Time difference for each test

for (i = 0; i < NumberOfTests; i++) { // Repeat NumberOfTests times

 

   time1 = ReadTSC();                 // Time before test

 

   CriticalFunction();                // Critical function to test

 

   timediff[i] = ReadTSC() - time1;   // (time after) - (time before)

}

printf("\nResults:");                 // Print heading

for (i = 0; i < NumberOfTests; i++) { // Loop to print out results

   printf("\n%2i  %10I64i", i, timediff[i]);

}

The code in example 16.2 calls the critical function ten times and stores the time consumption of each run in an array. The values are then output after the test loop. The time that is measured in this way includes the time it takes to call the ReadTSC function. You can subtract this value from the counts. It is measured simply by removing the call to CriticalFunction in example 16.2.

The measured time is interpreted in the following way. The first count is usually higher than the subsequent counts. This is the time it takes to execute CriticalFunction when code and data are not cached. The subsequent counts give the execution time when code and data are cached as good as possible. The first count and the subsequent counts represent the "worst case" and "best case" values. Which of these two values is closest to the truth depends on whether CriticalFunction is called once or multiple times in the final program and whether there is other code that uses the cache in between the calls to CriticalFunction. If your optimization effort is concentrated on CPU efficiency then it is the "best case" counts that you should look at to see if a certain modification is profitable. On the other hand, if your optimization effort is concentrated on arranging data in order to improve cache efficiency, then you may also look at the "worst case" counts. In any event, the clock counts should be multiplied by the clock period and by the number of times CriticalFunction is called in a typical application to calculate the time delay that the end user is likely to experience.

Occasionally, the clock counts that you measure are much higher than normal. This happens when a task switch occurs during execution of CriticalFunction. You cannot avoid this in a protected operating system, but you can reduce the problem by increasing the thread priority before the test and setting the priority back to normal afterwards.

The clock counts are often fluctuating and it may be difficult to get reproducible results. This is because modern CPUs can change their clock frequency dynamically depending on the work load. The clock frequency is increased when the work load is high and decreased when the work load is low in order to save power. There are various ways to get more reproducible time measurements:

  • warm up the CPU by giving it some heavy work to do immediately before the code to test.
  • disable power-save options in the BIOS setup.
  • on Intel CPUs: use the core clock cycle counter (see below)

16.1 Using performance monitor counters

Many CPUs have a built-in test feature called performance monitor counters. A performance monitor counter is a counter inside the CPU which can be set up to count certain events, such as the number of machine instructions executed, cache misses, branch mispredictions, etc. These counters can be very useful for investigating performance problems. The performance monitor counters are CPU-specific and each CPU model has its own set of performance monitoring options.

CPU vendors are offering profiling tools that fit their CPUs. Intel's profiler is called VTune; AMD's profiler is called CodeAnalyst. These profilers are useful for identifying hot spots in the code.

For my own research, I have developed a test tool for using the performance monitor counters. My test tool supports both Intel, AMD and VIA processors, and it is available from www.agner.org/optimize/testp.zip. This tool is not a profiler. It is not intended for finding hot spots, but for studying a piece of code once the hot spots have been identified.

My test tool can be used in two ways. The first way is to insert the piece of code to test in the test program itself and recompile it. I am using this for testing single assembly instructions or small sequences of code. The second way is to set up the performance monitor counters before running a program you want to optimize, and reading the performance counters inside your program before and after the piece of code you want to test. You can use the same principle as in example 16.2 above, but read one or more performance monitor counters instead of (or in addition to) the time stamp counter. The test tool can set up and enable one or more performance monitor counters in all the CPU cores and leave them enabled (there is one set of counters in each CPU core). The counters will stay on until you turn them off or until the computer is reset or goes into sleep mode. See the manual for my test tool for details (www.agner.org/optimize/testp.zip).

A particularly useful performance monitor counter in Intel processors is called core clock cycles. The core clock cycles counter is counting clock cycles at the actual clock frequency that the CPU core is running at, rather than the external clock. This gives a measure that is almost independent of changes in the clock frequency. The core clock cycle counter is very useful when testing which version of a piece of code is fastest because you can avoid the problem that the clock frequency goes up and down.

Remember to insert a switch in your program to turn off the reading of the counters when you are not testing. Trying to read the performance monitor counters when they are disabled will crash the program.

16.2 The pitfalls of unit-testing

It is common practice to test each function or class separately in software development. This unit-testing is necessary for verifying the functionality of an optimized function, but unfortunately the unit-test does not give the full information about the performance of the function in terms of speed.

Assume that you have two different versions of a critical function and you want to find out which one is fastest. The typical way to test this is to make a small test program that calls the critical function many times with a suitable set of test data and measure how long time it takes. The version that performs best under this unit-test may have a larger memory footprint than the alternative version. The penalty of cache misses is not seen in the unit- test because the total amount of code and data memory used by the test program is likely to be less than the cache size.

When the critical function is inserted in the final program, it is very likely that code cache and data cache are critical resources. Modern CPUs are so fast that the clock cycles spent on executing instructions are less likely to be a bottleneck than memory access and cache size. If this is the case then the optimal version of the critical function may be the one that takes longer time in the unit-test but has a smaller memory footprint.

If, for example, you want to find out whether it is advantageous to roll out a big loop then you cannot rely on a unit-test without taking cache effects into account.

You can calculate how much memory a function uses by looking at a link map or an assembly listing. Use the "generate map file" option for the linker. Both code cache use and data cache use can be critical. The branch target buffer is also a cache that can be critical. Therefore, the number of jumps, calls and branches in a function should also be considered.

A realistic performance test should include not only a single function or hot spot but also the innermost loop that includes the critical functions and hot spots. The test should be performed with a realistic set of data in order to get reliable results for branch mispredictions. The performance measurement should not include any part of the program that waits for user input. The time used for file input and output should be measured separately.

The fallacy of measuring performance by unit-testing is unfortunately very common. Even some of the best optimized function libraries available use excessive loop unrolling so that the memory footprint is unreasonably large.

16.3 Worst-case testing

Most performance tests are done under the best-case conditions. All disturbing influences are removed, all resources are sufficient, and the caching conditions are optimal. Best-case testing is useful because it gives more reliable and reproducible results. If you want to compare the performance of two different implementations of the same algorithm, then you need to remove all disturbing influences in order to make the measurements as accurate and reproducible as possible.

However, there are cases where it is more relevant to test the performance under the worst- case conditions. For example, if you want to make sure that the response time to user input never exceeds an acceptable limit, then you should test the response time under worst-case conditions.

Programs that produce streaming audio or video should also be tested under worst-case conditions in order to make sure that they always keep up with the expected real-time speed. Delays or glitches in the output are unacceptable.

Each of the following methods could possibly be relevant when testing worst-case performance:

  • The first time you activate a particular part of the program, it is likely to be slower than the subsequent times because of lazy loading of the code, cache misses and branch mispredictions.
  • Test the whole software package, including all runtime libraries and frameworks, rather than isolating a single function. Switch between different parts of the software package in order to increase the likelihood that certain parts of the program code are uncached or even swapped to disk.
  • Software that relies on network resources and servers should be tested on a network with heavy traffic and a server in full use rather than a dedicated test server.
  • Use large data files and databases with lots of data.
  • Use an old computer with a slow CPU, an insufficient amount of RAM, a lot of irrelevant software installed, a lot of background processes running, and a slow and fragmented hard disk.
  • Test with different brands of CPUs, different types of graphics cards, etc.
  • Use an antivirus program that scans all files on access.
  • Run multiple processes or threads simultaneously. If the microprocessor has hyperthreading, then try to run two threads in the same processor core.
  • Try to allocate more RAM than there is, in order to force the swapping of memory to disk.
  • Provoke cache misses by making the code size or data used in the innermost loop bigger than the cache size. Alternatively, you may actively invalidate the cache. The operating system may have a function for this purpose, or you may use the _mm_clflush intrinsic function.
  • Provoke branch mispredictions by making the data more random than normal.