Optimizing Subroutines in Assembly Language by Agner Fog - HTML preview

/ Home / Computer Sciences / Optimizing Subroutines in Assembly Language

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub, Kindle for a complete version.

5 Using intrinsic functions in C++

As already mentioned, there are three different ways of making assembly code: using intrinsic functions and vector classes in C++, using inline assembly in C++, and making separate assembly modules. Intrinsic functions are described in this chapter. The other two methods are described in the following chapters.

Intrinsic functions and vector classes are highly recommended because they are much easier and safer to use than assembly language syntax.

The Microsoft, Intel, Gnu and Clang C++ compilers have support for intrinsic functions. Most of the intrinsic functions generate one machine instruction each. An intrinsic function is therefore equivalent to an assembly instruction.

Coding with intrinsic functions is a kind of high-level assembly. It can easily be combined with C++ language constructs such as if-statements, loops, functions, classes and operator overloading. Using intrinsic functions is an easier way of doing high level assembly coding than using .if constructs etc. in an assembler or using the so-called high level assembler (HLA).

The invention of intrinsic functions has made it much easier to do programming tasks that previously required coding with assembly syntax. The advantages of using intrinsic functions are:

No need to learn assembly language syntax.
Seamless integration into C++ code.
Branches, loops, functions, classes, etc. are easily made with C++ syntax.
The compiler takes care of calling conventions, register usage conventions, etc.
The code is portable to almost all x86 platforms: 32-bit and 64-bit Windows, Linux, Mac OS, etc. Some intrinsic functions can even be used on Itanium and other non- x86 platforms.
The code is compatible with Microsoft, Gnu, Clang and Intel compilers.
The compiler takes care of register variables, register allocation and register spilling. The programmer doesn't have to care about which register is used for which variable.
Different instances of the same inlined function or operator can use different registers for its parameters. This eliminates the need for register-to-register moves. The same function coded with assembly syntax would typically use a specific register for each parameter; and a move instruction would be required if the value happens to be in a different register.
It is possible to define overloaded operators for the intrinsic functions. For example, the instruction that adds two 4-element vectors of floats is coded as ADDPS in assembly language, and as _mm_add_ps when intrinsic functions are used. But an overloaded operator can be defined for the latter so that it is simply coded as a + when using the so-called vector classes. This makes the code look like plain old C++.
The compiler can optimize the code further, for example by common subexpression elimination, loop-invariant code motion, scheduling and reordering, etc. This would have to be done manually if assembly syntax was used. The Gnu and Intel compilers provide the best optimization.

The disadvantages of using intrinsic functions are:

Not all assembly instructions have intrinsic function equivalents.
The function names are sometimes long and difficult to remember.
An expression with many intrinsic functions looks kludgy and is difficult to read.
Requires a good understanding of the underlying mechanisms.
The compilers may not be able to optimize code containing intrinsic functions as much as it optimizes other code, especially when constant propagation is needed.
Unskilled use of intrinsic functions can make the code less efficient than simple C++ code.
The compiler can modify the code or implement it in a less efficient way than the programmer intended. It may be necessary to look at the code generated by the compiler to see if it is optimized in the way the programmer intended.
Mixture of __m128 and __m256 types can cause severe delays if the programmer doesn't follow certain rules. Call _mm256_zeroupper() before any transition from modules compiled with AVX enabled to modules or library functions compiled without AVX.

5.1 Using intrinsic functions for system code

Intrinsic functions are useful for making system code and access system registers that are not accessible with standard C++. Some of these functions are listed below.

Functions for accessing system registers:

__rdtsc, __readpmc, __readmsr, __readcr0, __readcr2, __readcr3, __readcr4,

__readcr8, __writecr0, __writecr3, __writecr4, __writecr8, __writemsr,

_mm_getcsr, _mm_setcsr, __getcallerseflags.

Functions for input and output:

__inbyte, __inword, __indword, __outbyte, __outword, __outdword.

Functions for atomic memory read/write operations:

_InterlockedExchange, etc.

Functions for accessing FS and GS segments:

__readfsbyte, __writefsbyte, etc.

Cache control instructions (Require SSE or SSE2 instruction set):

_mm_prefetch, _mm_stream_si32, _mm_stream_pi, _mm_stream_si128, _ReadBarrier,

_WriteBarrier, _ReadWriteBarrier, _mm_sfence.

Other system functions:

__cpuid, __debugbreak, _disable, _enable.

5.2 Using intrinsic functions for instructions not available in standard C++

Some simple instructions that are not available in standard C++ can be coded with intrinsic functions, for example functions for bit-rotate, bit-scan, etc.:

_rotl8, _rotr8, _rotl16, _rotr16, _rotl, _rotr, _rotl64, _rotr64, _BitScanForward,

_BitScanReverse.

5.3 Using intrinsic functions for vector operations

Vector instructions are very useful for improving the speed of code with inherent parallelism. There are intrinsic functions for almost instructions on vector registers.

The use of these intrinsic functions for vector operations is thoroughly described in manual 1: "Optimizing software in C++".

5.4 Availability of intrinsic functions

The intrinsic functions are available on newer versions of Microsoft, Gnu and Intel compilers. Most intrinsic functions have the same names in all three compilers. You have to include a header file named intrin.h or emmintrin.h to get access to the intrinsic functions. The Codeplay compiler has limited support for intrinsic vector functions, but the function names are not compatible with the other compilers.

The intrinsic functions are listed in the help documentation for each compiler, in the appropriate header files, in msdn.microsoft.com, in "Intel 64 and IA-32 Architectures Software Developer’s Manual" (developer.intel.com) and in "Intel Intrinsic Guide" (softwareprojects.intel.com/avx/).