Optimizing Subroutines in Assembly Language by Agner Fog - HTML preview

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub, Kindle for a complete version.

17 Special topics

 

17.1 XMM versus floating point registers

Processors with the SSE instruction set can do single precision floating point calculations in XMM registers. Processors with the SSE2 instruction set can also do double precision calculations in XMM registers. Floating point calculations are approximately equally fast in XMM registers and the old floating point stack registers. The decision of whether to use the floating point stack registers ST(0) - ST(7) or XMM registers depends on the following factors.

Advantages of using ST() registers:

  • Compatible with old processors without SSE or SSE2.
  • Compatible with old operating systems without XMM support.
  • Supports long double precision.
  • Intermediate results are calculated with long double precision.
  • Precision conversions are free in the sense that they require no extra instructions and take no extra time. You may use ST() registers for expressions where operands have mixed precision.
  • Mathematical functions such as logarithms and trigonometric functions are supported by hardware instructions. These functions are useful when optimizing for size, but not necessarily faster than library functions using XMM registers.
  • Conversions to and from decimal numbers can use the FBLD and FBSTP instructions when optimizing for size.
  • Floating point instructions using ST() registers are smaller than the corresponding instructions using XMM registers. For example,  FADD ST(0),ST(1) is 2 bytes, while ADDSD XMM0,XMM1 is 4 bytes.

Advantages of using XMM or YMM registers:

  • Can do multiple operations with a single vector instruction.
  • Avoids the need to use FXCH for getting the desired register to the top of the stack.
  • No need to clean up the register stack after use.
  • Can be used together with MMX instructions.
  • No need for memory intermediates when converting between integers and floating point numbers.
  • 64-bit systems have 16 XMM/YMM registers, but only 8 ST() registers.
  • ST() registers cannot be used in device drivers in 64-bit Windows.
  • The instruction set for ST() registers is no longer developed. The instructions will probably still be supported for many years for the sake of backwards compatibility, but the instructions may work less efficiently in future processors.

17.2 MMX versus XMM registers

Integer vector instructions can use either the 64-bit MMX registers or the 128-bit XMM registers in processors with SSE2.

Advantages of using MMX registers:

  • Compatible with older microprocessors since the PMMX.
  • Compatible with old operating systems without XMM support.
  • No need for data alignment.

Advantages of using XMM registers:

  • The number of elements per vector is doubled in XMM registers as compared to MMX registers.
  • MMX registers cannot be used together with ST() registers.
  • A series of MMX instructions must end with EMMS.
  • 64-bit systems have 16 XMM registers, but only 8 MMX registers.
  • MMX registers cannot be used in device drivers in 64-bit Windows.
  • The instruction set for MMX registers is no longer developed and is going out of use. The MMX registers will probably still be supported in many years for reason of backwards compatibility.

17.3 XMM versus YMM registers

Floating point vector instructions can use the 128-bit XMM registers or their 256-bit extension named YMM registers when the AVX instruction set is available. See page 131 for details. Advantages of using the AVX instruction set and YMM registers:

  • Double vector size for floating point operations
  • Non-destructive 3-operand version of all XMM and YMM instructions

Advantages of using XMM registers:

  • Compatible with older processors
  • There is a penalty for switching between VEX instructions and XMM instructions without VEX prefix, see page 132. The programmer may inadvertently mix VEX and non-VEX instructions.
  • YMM registers cannot be used in device drivers without saving everything with XSAVE / XRESTOR., see page 134.

17.4 Freeing floating point registers (all processors)

You have to free all used floating point stack registers before exiting a subroutine, except for any register used for the result.

The fastest way of freeing one register is FSTP ST. To free two registers you may use either FCOMPP or twice FSTP ST,  whichever fits best into the decoding sequence or port load.

It is not recommended to use FFREE.

17.5 Transitions between floating point and MMX instructions

It is not possible to use 64-bit MMX registers and 80-bit floating point ST() registers in the same part of the code. You must issue an EMMS instruction after the last instruction that uses MMX registers if there is a possibility that later code uses floating point registers. You may avoid this problem by using 128-bit XMM registers instead.

On PMMX there is a high penalty for switching between floating point and MMX instructions. The first floating point instruction after an EMMS takes approximately 58 clocks extra, and the first MMX instruction after a floating point instruction takes approximately 38 clocks extra.

On processors with out-of-order execution there is no such penalty.

17.6 Converting from floating point to integer (All processors)

All conversions between floating point registers and integer registers must go via a memory location:

; Example 17.1.

fistp dword ptr [TEMP]

mov eax, [TEMP]

On older processors, and especially the P4, this code is likely to have a penalty for attempting to read from [TEMP] before the write to [TEMP] is finished. It doesn't help to put in a WAIT. It is recommended that you put in other instructions between the write to [TEMP] and the read from [TEMP] if possible in order to avoid this penalty. This applies to all the examples that follow.

The specifications for the C and C++ language requires that conversion from floating point numbers to integers use truncation rather than rounding. The method used by most C libraries is to change the floating point control word to indicate truncation before using an FISTP instruction, and changing it back again afterwards. This method is very slow on all processors. On many processors, the floating point control word cannot be renamed, so all subsequent floating point instructions must wait for the FLDCW instruction to retire. See page 152.

On processors with SSE or SSE2 instructions you can avoid all these problems by using XMM registers instead of floating point registers and use the CVT.. instructions to avoid the memory intermediate.

Whenever you have a conversion from a floating point register to an integer register, you should think of whether you can use rounding to nearest integer instead of truncation.

If you need truncation inside a loop then you should change the control word only outside the loop if the rest of the floating point instructions in the loop can work correctly in truncation mode.

You may use various tricks for truncating without changing the control word, as illustrated in the examples below. These examples presume that the control word is set to default, i.e. rounding to nearest or even.

; Example 17.2a. Rounding to nearest or even:

; extern "C" int round (double x);

_round  PROC    NEAR   ; (32 bit mode)

PUBLIC  _round

        fld     qword ptr [esp+4]

        fistp   dword ptr [esp+4]

        mov     eax, dword ptr [esp+4]

        ret

_round  ENDP

; Example 17.2b. Truncation towards zero:

; extern "C" int truncate (double x);

_truncate PROC    NEAR   ; (32 bit mode)

PUBLIC  _truncate

        fld     qword ptr [esp+4]   ; x

        sub     esp, 12             ; space for local variables

        fist    dword ptr [esp]     ; rounded value

        fst     dword ptr [esp+4]   ; float value

        fisub   dword ptr [esp]     ; subtract rounded value

        fstp    dword ptr [esp+8]   ; difference

        pop     eax                 ; rounded value

        pop     ecx                 ; float value

        pop     edx                 ; difference (float)

        test    ecx, ecx            ; test sign of x

        js      short NEGATIVE

        add     edx, 7FFFFFFFH      ; produce carry if difference < -0

        sbb     eax, 0              ; subtract 1 if x-round(x) < -0

        ret

NEGATIVE:

        xor     ecx, ecx

        test    edx, edx

        setg    cl                  ; 1 if difference > 0

        add     eax, ecx            ; add 1 if x-round(x) > 0

        ret

_truncate ENDP

; Example 17.2c. Truncation towards minus infinity:

; extern "C" int ifloor (double x);

_ifloor PROC    NEAR   ; (32 bit mode)

PUBLIC  _ifloor

        fld     qword ptr [esp+4]   ; x

        sub     esp, 8              ; space for local variables

        fist    dword ptr [esp]     ; rounded value

        fisub   dword ptr [esp]     ; subtract rounded value

        fstp    dword ptr [esp+4]   ; difference

        pop     eax                 ; rounded value

        pop     edx                 ; difference (float)

        add     edx, 7FFFFFFFH      ; produce carry if difference < -0

        sbb     eax, 0              ; subtract 1 if x-round(x) < -0

        ret

_ifloor ENDP

These procedures work for -231 < x < 231-1. They do not check for overflow or NAN's.

17.7 Using integer instructions for floating point operations

Integer instructions are generally faster than floating point instructions, so it is often advantageous to use integer instructions for doing simple floating point operations. The most obvious example is moving data. For example

; Example 17.3a. Moving floating point data

fld qword ptr [esi]

fstp qword ptr [edi]

can be replaced by:

; Example 17.3b

mov eax,[esi]

mov ebx,[esi+4]

mov [edi],eax

mov [edi+4],ebx

or:

; Example 17.3c

movq mm0,[esi]

movq [edi],mm0

In 64-bit mode, use:

; Example 17.3d

mov rax,[rsi]

mov [rdi],rax

Many other manipulations are possible if you know how floating point numbers are represented in binary format. See the chapter "Using integer operations for manipulating floating point variables" in manual 1: "Optimizing software in C++".

The bit positions are shown in this table:

Table 17.1. Floating point formats

img47.png

From this table we can find that the value 1.0 is represented as 3F80,0000H in single precision format, 3FF0,0000,0000,0000H in double precision, and 3FFF,8000,0000,0000,0000H in long double precision.

It is possible to generate simple floating point constants without using data in memory as explained on page 125.

Testing if a floating point value is zero

To test if a floating point number is zero, we have to test all bits except the sign bit, which may be either 0 or 1. For example:

; Example 17.4a. Testing floating point value for zero

fld    dword ptr [ebx]

ftst

fnstsw ax

and    ah, 40h

jnz    IsZero

can be replaced by

; Example 17.4b. Testing floating point value for zero

mov    eax, [ebx]

add    eax, eax

jz     IsZero

where the ADD EAX,EAX shifts out the sign bit. Double precision floats have 63 bits to test, but if subnormal numbers can be ruled out, then you can be certain that the value is zero if the exponent bits are all zero. Example:

; Example 17.4c. Testing double value for zero

fld    qword ptr [ebx]

ftst

fnstsw ax

and    ah, 40h

jnz    IsZero

can be replaced by

; Example 17.4d. Testing double value for zero

mov    eax, [ebx+4]

add    eax,eax

jz     IsZero

Manipulating the sign bit

A floating point number is negative if the sign bit is set and at least one other bit is set. Example (single precision):

; Example 17.5. Testing floating point value for negative

mov    eax, [NumberToTest]

cmp    eax, 80000000H

ja     IsNegative

You can change the sign of a floating point number simply by flipping the sign bit. This is useful when XMM registers are used, because there is no XMM change sign instruction. Example:

; Example 17.6. Change sign of four single-precision floats in xmm0

cmpeqd xmm1, xmm1    ; generate all 1's

pslld  xmm1, 31      ; 1 in the leftmost bit of each dword only

xorps  xmm0, xmm1    ; change sign of xmm0

You can get the absolute value of a floating point number by AND'ing out the sign bit:

; Example 17.7. Absolute value of four single-precision floats in xmm0

cmpeqd xmm1, xmm1    ; generate all 1's

psrld  xmm1,1        ; 1 in all but the leftmost bit of each dword

andps  xmm0 ,xmm1    ; set sign bits to 0

You can extract the sign bit of a floating point number:

; Example 17.8. Generate a bit-mask if single-precision floats in

; xmm0 are negative or -0.0

psrad   xmm0,31      ; copy sign bit into all bit positions

Manipulating the exponent

You can multiply a non-zero number by a power of 2 by simply adding to the exponent:

; Example 17.9. Multiply vector by power of 2

movaps  xmm0, [x]   ; four single-precision floats

movdqa  xmm1, [n]   ; four 32-bit positive integers

pslld   xmm1, 23    ; shift integers into exponent field

paddd   xmm0, xmm1  ; x * 2^n

Likewise, you can divide by a power of 2 by subtracting from the exponent. Note that this code does not work