Optimizing Subroutines in Assembly Language by Agner Fog - HTML preview

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub, Kindle for a complete version.

15 CPU dispatching

 

If you are using instructions that are not supported by all microprocessors, then you must first check if the program is running on a microprocessor that supports these instructions. If your program can benefit significantly from using a particular instruction set, then you may make one version of a critical part of the program that uses this instruction set, and another version which is compatible with old microprocessors.

Manual 1 "Optimizing software in C++" chapter 13 has important advices on CPU dispatching.

CPU dispatching can be implemented with branches or with a code pointer, as shown in the following example.

Example 15.1. Function with CPU dispatching

MyFunction proc near

    ; Jump through pointer. The code pointer initially points to

    ; MyFunctionDispatch. MyFunctionDispatch changes the pointer

    ; so that it points to the appropriate version of MyFunction.

    ; The next time MyFunction is called, it jumps directly to

    ; the right version of the function

    jmp     [MyFunctionPoint]

 

    ; Code for each version. Put the most probable version first:

 

MyFunctionAVX:

    ; AVX version of MyFunction

    ret

 

MyFunctionSSE2:

    ; SSE2 version of MyFunction

    ret

 

MyFunction386:

    ; Generic/80386 version of MyFunction

    ret

 

MyFunctionDispatch:

    ; Detect which instruction set is supported.

    ; Function InstructionSet is in asmlib

    call   InstructionSet              ; eax indicates instruction set

    mov    edx, offset MyFunction386

    cmp    eax, 4                      ; eax >= 4 if SSE2

    jb     DispEnd

    mov    edx, offset MyFunctionSSE2

    cmp    eax, 11                     ; eax >= 11 if AVX

    jb     DispEnd

    mov    edx, offset MyFunctionAVX

DispEnd:

    ; Save pointer to appropriate version of MyFunction

mov    [MyFunctionPoint], edx

    jmp    edx                         ; Jump to this version

 

.data

MyFunctionPoint DD MyFunctionDispatch  ; Code pointer

 

.code

MyFunction endp

The function InstructionSet, which detects which instruction set is supported, is provided in the library that can be downloaded from www.agner.org/optimize/asmlib.zip. Most operating systems also have functions for this purpose. Obviously, it is recommended to store the output from InstructionSet rather than calling it again each time the information is needed. See also www.agner.org/optimize/asmexamples.zip for detailed examples of functions with CPU dispatching.

15.1 Checking for operating system support for XMM and YMM registers

Unfortunately, the information that can be obtained from the CPUID instruction is not sufficient for determining whether it is possible to use XMM registers. The operating system has to save these registers during a task switch and restore them when the task is resumed. The microprocessor can disable the use of the XMM registers in order to prevent their use under old operating systems that do not save these registers. Operating systems that support the use of XMM registers must set bit 9 of the control register CR4 to enable the use of XMM registers and indicate its ability to save and restore these registers during task switches. (Saving and restoring registers is actually faster when XMM registers are enabled).

Unfortunately, the CR4 register can only be read in privileged mode. Application programs therefore have a problem determining whether they are allowed to use the XMM registers or not. According to official Intel documents, the only way for an application program to determine whether the operating system supports the use of XMM registers is to try to execute an XMM instruction and see if you get an invalid opcode exception. This is ridiculous, because not all operating systems, compilers and programming languages provide facilities for application programs to catch invalid opcode exceptions. The advantage of using XMM registers evaporates completely if you have no way of knowing whether you can use these registers without crashing your software.

These serious problems led me to search for an alternative way of checking if the operating system supports the use of XMM registers, and fortunately I have found a way that works reliably. If XMM registers are enabled, then the FXSAVE and FXRSTOR instructions can read and modify the XMM registers. If XMM registers are disabled, then FXSAVE and FXRSTOR cannot access these registers. It is therefore possible to check if XMM registers are enabled, by trying to read and write these registers with FXSAVE and FXRSTOR. The subroutines in www.agner.org/optimize/asmlib.zip use this method. These subroutines can be called from assembly as well as from high-level languages, and provide an easy way of detecting whether XMM registers can be used.

In order to verify that this detection method works correctly with all microprocessors, I first checked various manuals. The 1999 version of Intel's software developer's manual says about the FXRSTOR instruction: "The Streaming SIMD Extension fields in the save image (XMM0-XMM7 and MXCSR) will not be loaded into the processor if the CR4.OSFXSR bit is not set." AMD's Programmer’s Manual says effectively the same. However, the 2003 version of Intel's manual says that this behavior is implementation dependent. In order to clarify this, I contacted Intel Technical Support and got the reply, "If the OSFXSR bit in CR4 in not set, then XMMx registers are not restored when FXRSTOR is executed". They further confirmed that this is true for all versions of Intel microprocessors and all microcode updates. I regard this as a guarantee from Intel that my detection method will work on all Intel microprocessors. We can rely on the method working correctly on AMD processors as well since the AMD manual is unambiguous on this question. It appears to be safe to rely on this method working correctly on future microprocessors as well, because any micropro- cessor that deviates from the above specification would introduce a security problem as well as failing to run existing programs. Compatibility with existing programs is of great concern to microprocessor producers.

The detection method recommended in Intel manuals has the drawback that it relies on the ability of the compiler and the operating system to catch invalid opcode exceptions. A Windows application, for example, using Intel's detection method would therefore have to be tested in all compatible operating systems, including various Windows emulators running under a number of other operating systems. My detection method does not have this problem because it is independent of compiler and operating system. My method has the further advantage that it makes modular programming easier, because a module, subroutine library, or DLL using XMM instructions can include the detection procedure so that the problem of XMM support is of no concern to the calling program, which may even be written in a different programming language. Some operating systems provide system functions that tell which instruction set is supported, but the method mentioned above is independent of the operating system.

It is easier to check for operating support for YMM registers. The following method is described in "Intel Advanced Vector Extensions Programming Reference": Execute CPUID with eax = 1. Check that bit 27 and 28 in ecx are both 1 (OSXSAVE and AVX feature flags). If so, then execute XGETBV with ecx = 0 to get the XFEATURE_ENABLED_MASK. Check that bit 1 and 2 in eax are both set (XMM and YMM state support). If so, then it is safe to use YMM registers.

The above discussion has relied on the following documents:

Intel application note AP-900: "Identifying support for Streaming SIMD Extensions in the Processor and Operating System". 1999.

Intel application note AP-485: "Intel Processor Identification and the CPUID Instruction". 2002.

"Intel Architecture Software Developer's Manual, Volume 2: Instruction Set Reference", 1999.

"IA-32 Intel Architecture Software Developer's Manual, Volume 2: Instruction Set Reference", 2003.

"AMD64 Architecture Programmer’s Manual, Volume 4: 128-Bit Media Instructions", 2003. "Intel Advanced Vector Extensions Programming Reference", 2008, 2010.