Optimizing Subroutines in Assembly Language by Agner Fog - HTML preview

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub, Kindle for a complete version.

12 Loops

 

The critical hot spot of a CPU-intensive program is almost always a loop. The clock frequency of modern computers is so high that even the most time-consuming instructions, cache misses and inefficient exceptions are finished in a fraction of a microsecond. The delay caused by inefficient code is only noticeable when repeated millions of times. Such high repeat counts are likely to be seen only in the innermost level of a series of nested loops. The things that can be done to improve the performance of loops is discussed in this chapter.

12.1 Minimize loop overhead

The loop overhead is the instructions needed for jumping back to the beginning of the loop and to determine when to exit the loop. Optimizing these instructions is a fairly general technique that can be applied in many situations. Optimizing the loop overhead is not needed, however, if some other bottleneck is limiting the speed. See page 93ff for a description of possible bottlenecks in a loop.

A typical loop in C++ may look like this:

// Example 12.1a. Typical for-loop in C++

for (int i = 0; i < n; i++) {

   // (loop body)

}

Without optimization, the assembly implementation will look like this:

; Example 12.1b. For-loop, not optimized

    mov  ecx, n        ; Load n

    xor  eax, eax      ; i = 0

LoopTop:

    cmp  eax, ecx      ; i < n

    jge  LoopEnd       ; Exit when i >= n

    ; (loop body)      ; Loop body goes here

    add  eax, 1        ; i++

    jmp  LoopTop       ; Jump back

LoopEnd:

It may be unwise to use the inc instruction for adding 1 to the loop counter. The inc instruction has a problem with writing to only part of the flags register, which makes it less efficient than the add instruction on Intel P4 processors and may cause false dependences on other processors.

The most important problem with the loop in example 12.1b is that there are two jump instructions. We can eliminate one jump from the loop by putting the branch instruction in the end:

; Example 12.1c. For-loop with branch in the end

    mov  ecx, n        ; Load n

    test ecx, ecx      ; Test n

    jng  LoopEnd       ; Skip if n <= 0

    xor  eax, eax      ; i = 0

LoopTop:

    ; (loop body)      ; Loop body goes here

    add  eax, 1        ; i++

    cmp  eax, ecx      ; i < n

    jl   LoopTop       ; Loop back if i < n

LoopEnd:

Now we have got rid of the unconditional jump instruction in the loop by putting the loop exit branch in the end. We have to put an extra check before the loop to cover the case where the loop should run zero times. Without this check, the loop would run one time when n = 0.

The method of putting the loop exit branch in the end can even be used for complicated loop structures that have the exit condition in the middle. Consider a C++ loop with the exit condition in the middle:

// Example 12.2a. C++ loop with exit in the middle

int i = 0;

while (true) {

   FuncA();                   // Upper loop body

   if (++i >= n) break;       // Exit condition here

   FuncB();                   // Lower loop body

}

This can be implemented in assembly by reorganizing the loop so that the exit comes in the end and the entry comes in the middle:

; Example 12.2b. Assembly loop with entry in the middle

    xor  eax, eax            ; i = 0

    jmp  LoopEntry           ; Jump into middle of loop

LoopTop:

    call FuncB               ; Lower loop body comes first

LoopEntry:

    call FuncA               ; Upper loop body comes last

    add  eax, 1

    cmp  eax, n

    jge  LoopTop             ; Exit condition in the end

The cmp instruction in example 12.1c and 12.2b can be eliminated if the counter ends at zero because we can rely on the add instruction for setting the zero flag. This can be done by counting down from n to zero rather counting up from zero to n:

; Example 12.3. Loop with counting down

    mov  ecx, n        ; Load n

    test ecx, ecx      ; Test n

    jng  LoopEnd       ; Skip if n <= 0

LoopTop:

; (loop body)      ; Loop body goes here

    sub  ecx, 1        ; n--

    jnz  LoopTop       ; Loop back if not zero

LoopEnd:

Now the loop overhead is reduced to just two instructions, which is the best possible. The jecxz and loop instructions should be avoided because they are less efficient.

The solution in example 12.3 is not good if i is needed inside the loop, for example for an array index. The following example adds 1 to all elements in an integer array:

; Example 12.4a. For-loop with array

    mov  ecx, n        ; Load n

    test ecx, ecx      ; Test n

    jng  LoopEnd       ; Skip if n <= 0

    xor  eax, eax      ; i = 0

    lea  esi, Array    ; Pointer to an array

LoopTop:

    ; Loop body: Add 1 to all elements in Array:

    add  dword ptr [esi+4*eax], 1

    add  eax, 1        ; i++

    cmp  eax, ecx      ; i < n

    jl   LoopTop       ; Loop back if i < n

LoopEnd:

The address of the start of the array is in esi and the index in eax. The index is multiplied by 4 in the address calculation because the size of each array element is 4 bytes.

It is possible to modify example 12.4a to make it count down rather than up, but the data cache is optimized for accessing data forwards, not backwards. Therefore it is better to count up through negative values from -n to zero. This is possible by making a pointer to the end of the array and using a negative offset from the end of the array:

; Example 12.4b. For-loop with negative index from end of array

    mov  ecx, n            ; Load n

    lea  esi, Array[4*ecx] ; Point to end of array

    neg  ecx               ; i = -n

    jnl  LoopEnd           ; Skip if (-n) >= 0

LoopTop:

    ; Loop body: Add 1 to all elements in Array:

    add  dword ptr [esi+4*ecx], 1

    add  ecx, 1            ; i++

    js   LoopTop           ; Loop back if i < 0

LoopEnd:

A slightly different solution is to multiply n by 4 and count from -4*n to zero:

; Example 12.4c. For-loop with neg. index multiplied by element size

    mov  ecx, n            ; Load n

    shl  ecx, 2            ; n * 4

    jng  LoopEnd           ; Skip if (4*n) <= 0

    lea  esi, Array[ecx]   ; Point to end of array

    neg  ecx               ; i = -4*n

LoopTop:

    ; Loop body: Add 1 to all elements in Array:

    add  dword ptr [esi+ecx], 1

    add  ecx, 4            ; i += 4

    js   LoopTop           ; Loop back if i < 0

There is no difference in speed between example 12.4b and 12.4c, but the latter method is useful if the size of the array elements is not 1, 2, 4 or 8 so that we cannot use the scaled index addressing.

The loop counter should always be an integer because floating point compare instructions are less efficient than integer compare instructions, even with the SSE2 instruction set. Some loops have a floating point exit condition by nature. A well-known example is a Taylor expansion which is ended when the terms become sufficiently small. It may be useful in such cases to always use the worst-case maximum repeat count (see page 107). The cost of repeating the loop more times than necessary is often less than what is saved by avoiding the calculation of the exit condition in the loop and using an integer counter as loop control. A further advantage of this method is that the loop exit branch becomes more predictable. Even when the loop exit branch is mispredicted, the cost of the misprediction is smaller with an integer counter because the integer instructions are likely to be executed way ahead of the slower floating point instructions so that the misprediction can be resolved much earlier.

12.2 Induction variables

If the floating point value of the loop counter is needed for some other purpose then it is better to have both an integer counter and a floating point counter. Consider the example of a loop that makes a sine table:

// Example 12.5a. C++ loop to make sine table

double Table[100];  int i;

for (i = 0; i < 100; i++) Table[i] = sin(0.01 * i);

This can be changed to:

// Example 12.5b. C++ loop to make sine table

double Table[100], x;  int i;

for (i = 0, x = 0.; i < 100; i++, x += 0.01) Table[i] = sin(x);

Here we have an integer counter i for the loop control and array index, and a floating point counter x for replacing 0.01*i. The calculation of x by adding 0.01 to the previous value is much faster than converting i to floating point and multiplying by 0.01. The assembly implementation looks like this:

; Example 12.5c. Assembly loop to make sine table

.data

align 8

M0_01   dq   0.01          ; Define constant 0.01

_Table  dq   100 dup (?)   ; Define Table

 

.code

    xor  eax, eax          ; i = 0

    fld  M0_01             ; Load constant 0.01

    fldz                   ; x = 0.

LoopTop:

    fld  st(0)             ; Copy x

    fsin                   ; sin(x)

    fstp _Table[eax*8]     ; Table[i] = sin(x)

    fadd st(0), st(1)      ; x += 0.01

    add  eax, 1            ; i++

    cmp  eax, 100          ; i < n

    jb   LoopTop           ; Loop

    fcompp                 ; Discard st(0) and st(1)

There is no need to optimize the loop overhead in this case because the speed is limited by the floating point calculations. Another possible optimization is to use a library function that calculates two or four sine values at a time in an XMM register. Such functions can be found in libraries from Intel and AMD, see manual 1: "Optimizing software in C++".

The method of calculating x in example 12.5c by adding 0.01 to the previous value rather than multiplying i by 0.01 is commonly known as using an induction variable. Induction variables are useful whenever it is easier to calculate some value in a loop from the previous value than to calculate it from the loop counter. An induction variable can be integer or floating point. The most common use of induction variables is for calculating array addresses, as in example 12.4c, but induction variables can also be used for more complex expressions. Any function that is an n'th degree polynomial of the loop counter can be calculated with just n additions and no multiplications by the use of n induction variables. See manual 1: "Optimizing software in C++" for an example.

The calculation of a function of the loop counter by the induction variable method makes a loop-carried dependency chain. If this chain is too long then it may be advantageous to calculate each value from a value that is two or more iterations back, as in the Taylor expansion example on page 110.

12.3 Move loop-invariant code

The calculation of any expression that doesn't change inside the loop should be moved out of the loop.

The same applies to if-else branches with a condition that doesn't change inside the loop. Such a branch can be avoided by making two loops, one for each branch, and making a branch that chooses between the two loops.

12.4 Find the bottlenecks

There are a number of possible bottlenecks that can limit the performance of a loop. The most likely bottlenecks are:

  • Cache misses and cache contentions
  • Loop-carried dependency chains
  • Instruction fetching
  • Instruction decoding
  • Instruction retirement
  • Register read stalls
  • Execution port throughput
  • Execution unit throughput
  • Suboptimal reordering and scheduling of µops
  • Branch mispredictions
  • Floating point exceptions and subnormal operands

If one particular bottleneck is limiting the performance then it doesn't help to optimize anything else. It is therefore very important to analyze the loop carefully in order to identify which bottleneck is the limiting factor. Only when the narrowest bottleneck has successfully been removed does it make sense to look at the next bottleneck. The various bottlenecks are discussed in the following sections. All these details are processor-specific. See manual 3: "The microarchitecture of Intel, AMD and VIA CPUs" for explanation of the processor- specific details mentioned below.

Sometimes a lot of experimentation is needed in order to find and fix the limiting bottleneck. It is important to remember that a solution found by experimentation is CPU-specific and unlikely to be optimal on CPUs with a different microarchitecture.

12.5 Instruction fetch, decoding and retirement in a loop

The details about how to optimize instruction fetching, decoding, retirement, etc. is processor-specific, as mentioned on page 63.

If code fetching is a bottleneck then it is necessary to align the loop entry by 16 and reduce instruction sizes in order to minimize the number of 16-byte boundaries in the loop.

If instruction decoding is a bottleneck then it is necessary to observe the CPU-specific rules about decoding patterns. Avoid complex instructions such as LOOP, JECXZ, LODS, STOS, etc.

Register read stalls can occur in some Intel processors. If register read stalls are likely to occur then it can be necessary to reorder instructions or to refresh registers that are read multiple times but not written to inside the loop.

Jumps and calls inside the loop should be avoided because it delays code fetching. Subroutines that are called inside the loop should be inlined if possible.

Branches inside the loop should be avoided if possible because they interfere with the prediction of the loop exit branch. However, branches should not be replaced by conditional moves if this increases the length of a loop-carried dependency chain.

If instruction retirement is a bottleneck then it may be preferred to make the total number of µops inside the loop a multiple of the retirement rate (4 for Core2, 3 for all other processors). Some experimentation is needed in this case.

12.6 Distribute µops evenly between execution units

Manual 4: "Instruction tables" contains tables of how many µops each instruction generates and which execution port each µop goes to. This information is CPU-specific, of course. It is necessary to calculate how many µops the loop generates in total and how many of these µops go to each execution port and each execution