Optimizing Software in C++ by Agner Fog - HTML preview

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub, Kindle for a complete version.

12 Using vector operations

 

Today's microprocessors have vector instructions that make it possible to do operations on all elements of a vector simultaneously. This is also called Single-Instruction-Multiple-Data (SIMD) operations. The total size of each vector can be 64 bits (MMX), 128 bits (XMM), 256 bits (YMM), and soon also 512 bits (ZMM).

Vector operations are useful when doing calculations on large data sets where the same operation is performed on multiple data elements and the program logic allows parallel calculations. Examples are image processing, sound processing, and mathematical operations on vectors and matrixes. Algorithms that are inherently serial, such as most sorting algorithms, are not suited for vector operations. Algorithms that rely heavily on table lookup or require a lot of data shuffling, such as many encryption algorithms, are perhaps less suited for vector operations.

The vector operations use a set of special vector registers. The maximum size of each vector register is 128 bits (XMM) if the SSE2 instruction set is available, 256 bits (YMM) if the AVX instruction set is supported by the microprocessor and the operating system, and 512 bits when the AVX512 instruction set is available. The number of elements in each vector depends on the size and type of data elements, as follows:

img15.png

For example, a 128-bit XMM register can be organized as a vector of eight 16-bit integers or four float's when the SSE2 instruction set is available. The older MMX registers, which are 64 bits wide, should be avoided because they cannot be mixed with x87 style floating point code.

The 128-bit XMM vectors must be aligned by 16, i.e. stored at a memory address that is divisible by 16 (see below). The 256-bit YMM vectors are preferably aligned by 32 and the 512-bit ZMM registers by 64, but the alignment requirements are less strict when compiling for the AVX and later instruction sets.

Vector operations are particularly fast on newer processors. Many processors can calculate a vector just as fast as a scalar (Scalar means not a vector). The first generation of processors that support a new vector size often have execution units, memory ports, etc. of only half the size of the largest vector. These units are used twice for handling a full size vector.

The use of vector operations is more advantageous the smaller the data elements are. For example, you get four float additions in the same time that it takes to do two additions with double's. It is almost always advantageous to use vector operations on contemporary CPUs if the data fit nicely into the vector registers. It may not be advantageous if a lot of data manipulation is required for putting the right data into the right vector elements.

12.1 AVX instruction set and YMM registers

The 128-bit XMM registers are extended to 256-bit registers named YMM in the AVX instruction set. The main advantage of the AVX instruction set is that it allows larger floating point vectors. There are also other advantages that may improve the performance somewhat. The AVX2 instruction set also allows 256-bit integer vectors.

Code that is compiled for the AVX instruction set can run only if AVX is supported by both the CPU and the operating system. AVX is supported in Windows 7 and Windows Server 2008 R2 as well as in Linux kernel version 2.6.30 and later. The AVX instruction set is supported in the latest compilers from Microsoft, Intel, Gnu and Clang.

There is a problem when mixing code compiled with and without AVX support on some Intel processors. There is a performance penalty when going from AVX code to non-AVX code because of a change in the YMM register state. This penalty should be avoided by calling the intrinsic function _mm256_zeroupper() before any transition from AVX code to non- AVX code. This can be necessary in the following cases:

  • If part of a program is compiled with AVX support and another part of the program is compiled without AVX support then call _mm256_zeroupper() before leaving the AVX part.
  • If a function is compiled in multiple versions with and without AVX using CPU dispatching then call _mm256_zeroupper() before leaving the AVX part.
  • If a piece of code compiled with AVX support calls a function in a library other than the library that comes with the compiler, and the library has no AVX support, then call _mm256_zeroupper() before calling the library function.

12.2 AVX-512 instruction set and ZMM registers

The first processors with the AVX-512 instruction set are expected to be available in 2016 or 2017. The same coding rules apply as to the AVX instructions. A further extension of vector register sizes to 1024 bits is likely in a more distant future.

12.3 Automatic vectorization

Good compilers such as the Gnu, Clang and Intel compilers can use vector operations automatically in cases where the parallelism is obvious. See the compiler documentation for detailed instructions. Example:

// Example 12.1a. Automatic vectorization

const int size = 1024;

int a[size], b[size];

// ...

for (int i = 0; i < size; i++) {

   a[i] = b[i] + 2;

}

A good compiler will optimize this loop by using vector operations when the SSE2 or later instruction set is specified. The code will read four elements of b into a 128-bit vector register, do an addition with another vector register containing (2,2,2,2), and store the four results in a. This operation will then be repeated 1024/4 = 256 times and the speed will be improved by a factor 4 in the best cases. It is best when the loop count is divisible by the number of elements per vector. You may even add dummy elements at the end of the array to make the array size a multiple of the vector size.

There is a disadvantage when the arrays are accessed through pointers, e.g.:

// Example 12.1b. Vectorization with alignment problem

void AddTwo(int * __restrict aa, int * __restrict bb) {

   for (int i = 0; i < size; i++) {

      aa[i] = bb[i] + 2;

   }

}

The most efficient vector operations require that the arrays are aligned by 16, i.e. stored at a memory address that is divisible by 16. In example 12.1a, the compiler can align the arrays as required, but in example 12.1b, the compiler cannot know for sure whether the arrays are properly aligned or not. The loop can still be vectorized, but the code will be less efficient because the compiler has to take extra precautions to account for unaligned arrays. There are various things you can do to make the code more efficient when arrays are accessed through pointers or references:

  • If the Intel compiler is used, then use  #pragma vector aligned or the __assume_aligned directive to tell the compiler that the arrays are aligned, and make sure that they are.
  • Declare the function  inline. This may enable the compiler to reduce example 12.1b to 12.1a.
  • Enable the AVX or later instruction set if possible. The AVX instructions have very few restrictions on alignment and the resultant code will be efficient whether the arrays are aligned or not. See page 107 for how to use the AVX instructions.

The automatic vectorization works best if the following conditions are satisfied:

1.  Use a compiler that supports automatic vectorization, such as Gnu, Clang, Intel or PathScale.

2.  Use the latest version of the compiler. The compilers are becoming better and better at vectorization.

3.  Use appropriate compiler options to enable the desired instruction set (/arch:SSE2, /arch:AVX etc. for Windows, -msse2, -mavx, etc. for Linux) 4.  Align arrays and big structures by 16 for SSE2, preferably 32 for AVX.

5.  The loop count should preferably be a constant that is divisible by the number of elements in a vector.

6.  If arrays are accessed through pointers so that the alignment is not visible in the scope of the function where you want vectorization then follow the advice given above.

7.  If the arrays or structures are accessed through pointers or references then tell the compiler explicitly that pointers do not alias, if appropriate.

8.  Avoid branches at the vector element level

9.  Avoid table lookup at the vector element level

You may look at the assembly output listing to see if the code is indeed vectorized as intended (see page 84).

The compiler can also use vector operations where there is no loop if the same operation is performed on a sequence of consecutive variables. Example:

// Example 12.2

__declspec(align(16))     // Make all instances of S1 aligned

struct S1 {               // Structure of 4 floats

   float a, b, c, d;

};

 

void Func() {

   S1 x, y;

   ...

   x.a = y.a + 1.;

   x.b = y.b + 2.;

   x.c = y.c + 3.;

   x.d = y.d + 4.;

};

A structure of four float's fits into a 128-bit XMM register. In example 12.2, the optimized code will load the structure y into a vector register, add the constant vector (1,2,3,4), and store the result in x.

The compiler is not always able to predict correctly whether vectorization will be advantageous or not. The Intel compiler allows you to use the  #pragma vector always to tell the compiler to vectorize, or  #pragma novector to tell the compiler not to vectorize. The pragmas must be placed immediately before the loop or the series of statements that you want them to apply to.

It is advantageous to use the smallest data size that fits the application. In example 12.3a, for example, you can double the speed by using  short int instead of int.  A  short int is 16 bits wide, while an  int is 32 bits, so you can have eight numbers of type short int in one vector, while you can only have four numbers of type int. Therefore, it is advantageous to use the smallest integer size that is big enough to hold the numbers in question without generating overflow. Likewise, it is advantageous to use float rather than double if the code can be vectorized, because a float uses 32 bits while a double uses 64 bits.

The SSE2 vector instruction set cannot multiply integers of any size other than short int (16 bits). There are no instructions for integer division in vectors, but the asmlib function library and the vector class library have functions for integer vector division.

12.4 Using intrinsic functions

It is difficult to predict whether the compiler will vectorize a loop or not. The following example shows a code that some compilers may not vectorize automatically. The code has a branch that chooses between two expressions for every element in the arrays:

// Example 12.4a. Loop with branch

 

// Loop with branch

void SelectAddMul(short int aa[], short int bb[], short int cc[]) {

 

   for (int i = 0; i < 256; i++) {

      aa[i] = (bb[i] > 0) ? (cc[i] + 2) : (bb[i] * cc[i]);

   }

}

It is possible to vectorize code explicitly by using the so-called intrinsic functions. This is useful in situations like example 12.4a where current compilers don't vectorize the code automatically. It is also useful in situations where automatic vectorization leads to suboptimal code.

Intrinsic functions are primitive operations in the sense that each intrinsic function call is translated to just one or a few machine instructions. Intrinsic functions are supported by the Gnu, Clang, Intel, Microsoft and PathScale compilers. (The PGI compiler supports intrinsic functions, but in a very inefficient way. The Codeplay compiler has some support for intrinsic functions, but the function names are not compatible with the other compilers). The best performance is obtained with the Gnu, Clang and Intel compilers.

We want to vectorize the loop in example 12.4a so that we can handle eight elements at a time in vectors of eight 16-bit integers. The branch inside the loop can be implemented in various ways depending on the available instruction set. The most compatible way is to make a bit-mask which is all 1's when bb[i] > 0 is true, and all 0's when false. The value of cc[i]+2 is AND'ed with this mask, and bb[i]*cc[i] is AND'ed with the inverted mask. The expression that is AND'ed with all 1's is unchanged, while the expression that is AND'ed with all 0's gives zero. An OR combination of these two gives the chosen expression.

Example 12.4b shows how this can be implemented with intrinsic functions for the SSE2 instruction set:

// Example 12.4b. Vectorized with SSE2

#include <emmintrin.h>       // Define SSE2 intrinsic functions

 

// Function to load unaligned integer vector from array

static inline __m128i LoadVector(void const * p) {

   return _mm_loadu_si128((__m128i const*)p);

}

 

// Function to store unaligned integer vector into array

static inline void StoreVector(void * d, __m128i const & x) {

   _mm_storeu_si128((__m128i *)d, x);

}

 

// Branch/loop function vectorized:

void SelectAddMul(short int aa[], short int bb[], short int cc[]) {

 

   // Make a vector of (0,0,0,0,0,0,0,0)

   __m128i zero = _mm_set1_epi16(0);

   // Make a vector of (2,2,2,2,2,2,2,2)

   __m128i two  = _mm_set1_epi16(2);

 

   // Roll out loop by eight to fit the eight-element vectors:

   for (int i = 0; i < 256; i += 8) {

      // Load eight consecutive elements from bb into vector b:

      __m128i b = LoadVector(bb + i);

      // Load eight consecutive elements from cc into vector c:

      __m128i c = LoadVector(cc + i);

      // Add 2 to each element in vector c

      __m128i c2 = _mm_add_epi16(c, two);

      // Multiply b and c

      __m128i bc = _mm_mullo_epi16 (b, c);

      // Compare each element in b to 0 and generate a bit-mask:

      __m128i mask = _mm_cmpgt_epi16(b, zero);

      // AND each element in vector c2 with the bit-mask:

      c2 = _mm_and_si128(c2, mask);

      // AND each element in vector bc with the inverted bit-mask:

bc = _mm_andnot_si128(mask, bc);

      // OR the results of the two AND operations:

      __m128i a = _mm_or_si128(c2, bc);

      // Store the result vector in eight consecutive elements in aa:

      StoreVector(aa + i, a);

   }

}

The resulting code will be very efficient because it handles eight elements at a time and it avoids the branch inside the loop. Example 12.4b executes three to seven times faster than example 12.4a, depending on how predictable the branch inside the loop is.

The type __m128i defines a 128 bit vector containing integers. It can contain either sixteen integers of 8 bits each, eight integers of 16 bits each, four integers of 32 bits each, or two integers of 64 bits each. The type  __m128 defines a 128 bit vector of four float. The type  __m128d defines a 128 bit vector of two double.

The intrinsic vector functions have names that begin with _mm. These functions are listed in the compiler manual or in the programming manuals from Intel: "IA-32 Intel Architecture Software Developer’s Manual", Volume 2A and 2B. There are hundreds of different intrinsic functions and it can be difficult to find the right function for a particular purpose.

The clumsy AND-OR construction in example 12.4b can be replaced by a blend instruction if the SSE4.1 instruction set is available:

// Example 12.4c. Same example, vectorized with SSE4.1

 

// Function to load unaligned integer vector from array

static inline __m128i LoadVector(void const * p) {

   return _mm_