Optimizing Subroutines in Assembly Language by Agner Fog - HTML preview

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub, Kindle for a complete version.

16 Problematic Instructions

 

16.1 LEA instruction (all processors)

The LEA instruction is useful for many purposes because it can do a shift operation, two additions, and a move in just one instruction. Example:

; Example 16.1a, LEA instruction

lea eax, [ebx+8*ecx-1000]

is much faster than

; Example 16.1b

mov  eax, ecx

shl  eax, 3

add  eax, ebx

sub  eax, 1000

A typical use of LEA is as a three-register addition: lea eax,[ebx+ecx]. The LEA instruction can also be used for doing an addition or shift without changing the flags.

The processors have no documented addressing mode with a scaled index register and nothing else. Therefore, an instruction like lea eax,[ebx*2] is actually coded as lea eax,[ebx*2+00000000H] with an immediate displacement of 4 bytes. The size of this instruction can be reduced by writing  lea eax,[ebx+ebx]. If you happen to have a register that is zero (like a loop counter after a loop) then you may use it as a base register to reduce the code size:

; Example 16.2, LEA instruction without base pointer

lea  eax, [ebx*4]       ; 7 bytes

lea  eax, [ecx+ebx*4]   ; 3 bytes

The size of the base and index registers can be changed with an address size prefix. The size of the destination register can be changed with an operand size prefix (See prefixes, page 26). If the operand size is less than the address size then the result is truncated. If the operand size is more than the address size then the result is zero-extended.

The shortest version of LEA in 64-bit mode has 32-bit operand size and 64-bit address size, e.g. LEA EAX,[RBX+RCX], see page 77. Use this version when the result is sure to be less than 232. Use the version with a 64-bit destination register for address calculation in 64-bit mode when the address may be bigger than 232.

LEA is slower than addition on some processors. The more complex forms of LEA with scale factor and offset are slower than the simple form on some processors. See manual 4: "Instruction tables" for details on each processor.

The preferred version in 32-bit mode has 32-bit operand size and 32-bit address size.  LEA with a 16-bit operand size is slow on AMD processors. LEA with a 16-bit address size in 32- bit mode should be avoided because the decoding of the address size prefix is slow on many processors.

LEA can also be used in 64-bit mode for loading a RIP-relative address. A RIP-relative address cannot be combined with base or index registers.

16.2 INC and DEC

The INC and DEC instructions do not modify the carry flag but they do modify the other arithmetic flags. Writing to only part of the flags register costs an extra µop on P4 and P4E. It can cause a partial flags stalls on some Intel processors if a subsequent instruction reads the carry flag or all the flag bits. On all processors, it can cause a false dependence on the carry flag from a previous instruction.

Use ADD and SUB when optimizing for speed. Use INC and DEC when optimizing for size or when no penalty is expected.

16.3 XCHG (all processors)

The XCHG register,[memory] instruction is dangerous. This instruction always has an implicit LOCK prefix which forces synchronization with other processors or cores. This instruction is therefore very time consuming, and should always be avoided unless the lock is intended.

The XCHG instruction with register operands may be useful when optimizing for size as explained on page 73.

16.4 Shifts and rotates (P4)

Shifts and rotates on general purpose registers are slow on the P4. You may consider using MMX or XMM registers instead or replacing left shifts by additions.

16.5 Rotates through carry (all processors)

RCR and RCL with CL or with a count different from one are slow on all processors and should be avoided.

16.6 Bit test (all processors)

BT, BTC, BTR, and BTS instructions should preferably be replaced by instructions like TEST, AND, OR, XOR, or shifts on older processors. Bit tests with a memory operand should be avoided on Intel processors. BTC, BTR, and BTS use 2 µops on AMD processors. Bit test instructions are useful when optimizing for size.

16.7 LAHF and SAHF (all processors)

LAHF is slow on P4 and P4E. Use SETcc instead for storing the value of a flag.

SAHF is slow on P4E and AMD processors. Use TEST instead for testing a bit in AH. Use FCOMI if available as a replacement for the sequence  FCOM / FNSTSW AX / SAHF.

LAHF and SAHF are not available in 64 bit mode on some early 64-bit Intel processors.

16.8 Integer multiplication (all processors)

An integer multiplication takes from 3 to 14 clock cycles, depending on the processor. It is therefore often advantageous to replace a multiplication by a constant with a combination of other instructions such as SHL, ADD, SUB, and LEA. For example IMUL EAX,5 can be replaced by LEA EAX,[EAX+4*EAX].

16.9 Division (all processors)

Both integer division and floating point division are quite time consuming on all processors. Various methods for reducing the number of divisions are explained in manual 1: "Optimizing software in C++". Several methods to improve code that contains division are discussed below.

Integer division by a power of 2 (all processors)

Integer division by a power of two can be done by shifting right. Dividing an unsigned integer by 2N:

; Example 16.3. Divide unsigned integer by 2^N

shr  eax, N

Dividing a signed integer by 2N:

; Example 16.4. Divide signed integer by 2^N

cdq

and  edx, (1 shl N) - 1    ; (Or:  shr edx,32-N)

add  eax, edx

sar  eax, N

Obviously, you should prefer the unsigned version if the dividend is certain to be non- negative.

Integer division by a constant (all processors)

A floating point number can be divided by a constant by multiplying with the reciprocal. If we want to do the same with integers, we have to scale the reciprocal by 2n and then shift the product to the right by n. There are various algorithms for finding a suitable value of n and compensating for rounding errors. The algorithm described below was invented by Terje Mathisen, Norway, and not published elsewhere. Other methods can be found in the book Hacker's Delight, by Henry S. Warren, Addison-Wesley 2003, and the paper: T. Granlund and P. L. Montgomery: Division by Invariant Integers Using Multiplication, Proceedings of the SIGPLAN 1994 Conference on Programming Language Design and Implementation.

The following algorithm will give you the correct result for unsigned integer division with truncation, i.e. the same result as the DIV instruction:

b = (the number of significant bits in d) - 1

r = w + b

f = 2r / d

If f is an integer then d is a power of 2: go to case A.

If f is not an integer, then check if the fractional part of f is < 0.5

If the fractional part of f < 0.5: go to case B.

If the fractional part of f > 0.5: go to case C.

case A (d = 2b):

result = x SHR b

case B (fractional part of f < 0.5):

round f down to nearest integer

result = ((x+1) * f) SHR r

case C (fractional part of f > 0.5):

round f up to nearest integer

result = (x * f) SHR r

Example:

Assume that you want to divide by 5.

5 = 101B.

w = 32.

b = (number of significant binary digits) - 1 = 2

r = 32+2 = 34

f = 234 / 5 = 3435973836.8 = 0CCCCCCCC.CCC...(hexadecimal)

The fractional part is greater than a half: use case C.

Round f up to 0CCCCCCCDH.

The following code divides EAX by 5 and returns the result in EDX:

; Example 16.5a. Divide unsigned integer eax by 5

mov   edx, 0CCCCCCCDH

mul   edx

shr   edx, 2

After the multiplication, EDX contains the product shifted right 32 places. Since r = 34 you have to shift 2 more places to get the result. To divide by 10, just change the last line to SHR EDX,3.

In case B we would have:

; Example 16.5b. Divide unsigned integer eax, case B

add   eax, 1

mov   edx, f

mul   edx

shr   edx, b

This code works for all values of x except 0FFFFFFFFH which gives zero because of overflow in the ADD EAX,1 instruction. If x = 0FFFFFFFFH is possible, then change the code to:

; Example 16.5c. Divide unsigned integer, case B, check for overflow

         mov     edx, f

         add     eax, 1

         jc      DOVERFL

         mul     edx

DOVERFL: shr     edx, b

If the value of x is limited, then you may use a lower value of r, i.e. fewer digits. There can be several reasons for using a lower value of r:

  • You may set r = w = 32 to avoid the SHR EDX,b in the end.
  • You may set r = 16+b and use a multiplication instruction that gives a 32-bit result rather than 64 bits. This will free the EDX register:

    ; Example 16.5d. Divide unsigned integer by 5, limited range

    imul eax,0CCCDH

    shr eax,18

  • You may choose a value of r that gives case C rather than case B in order to avoid the ADD EAX,1 instruction

The maximum value for