Optimizing Subroutines in Assembly Language by Agner Fog - HTML preview

/ Home / Computer Sciences / Optimizing Subroutines in Assembly Language

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub, Kindle for a complete version.

6 Using inline assembly

Inline assembly is another way of putting assembly code into a C++ file. The keyword asm or _asm or __asm or __asm__ tells the compiler that the code is assembly. Different compilers have different syntaxes for inline assembly. The different syntaxes are explained below.

The advantages of using inline assembly are:

It is easy to combine with C++.
Variables and other symbols defined in C++ code can be accessed from the assembly code.
Only the part of the code that cannot be coded in C++ is coded in assembly.
All assembly instructions are available.
The code generated is exactly what you write.
It is possible to optimize in details.
The compiler takes care of calling conventions, name mangling and saving registers.
The compiler can inline a function containing inline assembly.
Portable to different x86 platforms when using the Intel compiler.

The disadvantages of using inline assembly are:

Different compilers use different syntax.
Requires knowledge of assembly language.
Requires a good understanding of how the compiler works. It is easy to make errors.
The allocation of registers is mostly done manually. The compiler may allocate different registers for the same variables.
The compiler cannot optimize well across the inline assembly code.
It may be difficult to control function prolog and epilog code.
It may not be possible to define data.
It may not be possible to use macros and other directives.
It may not be possible to make functions with multiple entries.
You may inadvertently mix VEX and non-VEX instructions, whereby large penalties are incurred (see chapter 13.6).
The Microsoft compiler does not support inline assembly on 64-bit platforms.
The Borland compiler is poor on inline assembly.

The following sections illustrate how to make inline assembly with different compilers.

6.1 MASM style inline assembly

The most common syntax for inline assembly is a MASM-style syntax. This is the easiest way of making inline assembly and it is supported by most compilers, but not the Gnu compiler. Unfortunately, the syntax for inline assembly is poorly documented or not documented at all in the compiler manuals. I will therefore briefly describe the syntax here.

The following examples show a function that raises a floating point number x to an integer power n. The algorithm is to multiply x1, x2, x4, x8, etc. according to each bit in the binary representation of n. Actually, it is not necessary to code this in assembly because a good compiler will optimize it almost as much when you just write pow(x,n). My purpose here is just to illustrate the syntax of inline assembly.

First the code in C++ to illustrate the algorithm:

// Example 6.1a. Raise double x to the power of int n.

double ipow (double x, int n) {

unsigned int nn = abs(n); // absolute value of n

double y = 1.0; // used for multiplication

while (nn != 0) { // loop for each bit in nn

if (nn & 1) y *= x; // multiply if bit = 1

x *= x; // square x

nn >>= 1; // get next bit of nn

}

if (n < 0) y = 1.0 / y; // reciprocal if n is negative

return y; // return y = pow(x,n)

}

And then the optimized code using inline assembly with MASM style syntax:

// Example 6.1b. MASM style inline assembly, 32 bit mode

double ipow (double x, int n) {

__asm {

mov eax, n // Move n to eax

// abs(n) is calculated by inverting all bits and adding 1 if n < 0:

cdq // Get sign bit into all bits of edx

xor eax, edx // Invert bits if negative

sub eax, edx // Add 1 if negative. Now eax = abs(n)

fld1 // st(0) = 1.0

jz L9 // End if n = 0

fld qword ptr x // st(0) = x, st(1) = 1.0

jmp L2 // Jump into loop

L1: // Top of loop

fmul st(0), st(0) // Square x

L2: // Loop entered here

shr eax, 1 // Get each bit of n into carry flag

jnc L1 // No carry. Skip multiplication, goto next

fmul st(1), st(0) // Multiply by x squared i times for bit # i

jnz L1 // End of loop. Stop when nn = 0

fstp st(0) // Discard st(0)

test edx, edx // Test if n was negative

jns L9 // Finish if n was not negative

fld1 // st(0) = 1.0, st(1) = x^abs(n)

fdivr // Reciprocal

L9: // Finish

} // Result is in st(0)

#pragma warning(disable:1011) // Don't warn for missing return value

}

Note that the function entry and parameters are declared with C++ syntax. The function body, or part of it, can then be coded with inline assembly. The parameters x and n, which are declared with C++ syntax, can be accessed directly in the assembly code using the same names. The compiler simply replaces x and n in the assembly code with the appropriate memory operands, probably [esp+4] and [esp+12]. If the inline assembly code needs to access a variable that happens to be in a register, then the compiler will store it to a memory variable on the stack and then insert the address of this memory variable in the inline assembly code.

The result is returned in st(0) according to the 32-bit calling convention. The compiler will normally issue a warning because there is no return y; statement in the end of the function. This statement is not needed if you know which register to return the value in. The #pragma warning(disable:1011) removes the warning. If you want the code to work with different calling conventions (e.g. 64-bit systems) then it is necessary to store the result in a temporary variable inside the assembly block:

// Example 6.1c. MASM style, independent of calling convention

double ipow (double x, int n) {

double result; // Define temporary variable for result

__asm {

mov eax, n

cdq

xor eax, edx

sub eax, edx

fld1

jz L9

fld qword ptr x

jmp L2

L1:fmul st(0), st(0)

L2:shr eax, 1

jnc L1

fmul st(1), st(0)

jnz L1

fstp st(0)

test edx, edx

jns L9

fld1

fdivr

L9:fstp qword ptr result // store result to temporary variable

}

return result;

}

Now the compiler takes care of all aspects of the calling convention and the code works on all x86 platforms.

The compiler inspects the inline assembly code to see which registers are modified. The compiler will automatically save and restore these registers if required by the register usage convention. In some compilers it is not allowed to modify register ebp or ebx in the inline assembly code because these registers are needed for a stack frame. The compiler will generally issue a warning in this case.

It is possible to remove the automatically generated prolog and epilog code by adding __declspec(naked) to the function declaration. In this case it is the responsibility of the programmer to add any necessary prolog and epilog code and to save any modified registers if necessary. The only thing the compiler takes care of in a naked function is name mangling. Automatic variable name substitution may not work with naked functions because it depends on how the function prolog is made. A naked function cannot be inlined.

Accessing register variables

Register variables cannot be accessed directly by their symbolic names in MASM-style inline assembly. Accessing a variable by name in an inline assembly code will force the compiler to store the variable to a temporary memory location.

If you know which register a variable is in then you can simply write the name of the register. This makes the code more efficient but less portable.

For example, if the code in the above example is used in 64-bit Windows, then x will be in register XMM0 and n will be in register EDX. Taking advantage of this knowledge, we can improve the code:

// Example 6.1d. MASM style, 64-bit Windows

double ipow (double x, int n) {

const double one = 1.0; // define constant 1.0

__asm { // x is in xmm0

mov eax, edx // get n into eax

cdq

xor eax, edx

sub eax, edx

movsd xmm1, one // load 1.0

jz L9

jmp L2

L1:mulsd xmm0, xmm0 // square x

L2:shr eax, 1

jnc L1

mulsd xmm1, xmm0 // Multiply by x squared i times

jnz L1

movsd xmm0, xmm1 // Put result in xmm0

test edx, edx

jns L9

movsd xmm0, one

divsd xmm0, xmm1 // Reciprocal

L9: }

#pragma warning(disable:1011) // Don't warn for missing return value

}

In 64-bit Linux we will have n in register EDI so the line mov eax,edx should be changed to mov eax,edi.

Accessing class members and structure members

Let's take as an example a C++ class containing a list of integers:

// Example 6.2a. Accessing class data members

// define C++ class

class MyList {

protected:

int length; // Number of items in list

int buffer[100]; // Store items

public:

MyList(); // Constructor

void AttItem(int item); // Add item to list

int Sum(); // Compute sum of items

};

MyList::MyList() { // Constructor

length = 0;}

void MyList::AttItem(int item) { // Add item to list

if (length < 100) {

buffer[length++] = item;

}

int MyList::Sum() { // Member function Sum

int i, sum = 0;

for (i = 0; i < length; i++) sum += buffer[i];

return sum;}

Below, I will show how to code the member function MyList::Sum in inline assembly. I have not tried to optimize the code, my purpose here is simply to show the syntax.

Class members are accessed by loading 'this' into a pointer register and addressing class data members relative to the pointer with the dot operator (.).

// Example 6.2b. Accessing class members (32-bit)

int MyList::Sum() {

__asm {

mov ecx, this // 'this' pointer

xor eax, eax // sum = 0

xor edx, edx // loop index, i = 0

cmp [ecx].length, 0 // if (this->length != 0)

je L9

L1: add eax, [ecx].buffer[edx*4] // sum += buffer[i]

add edx, 1 // i++

cmp edx, [ecx].length // while (i < length)

jb L1

L9:

} // Return value is in eax

#pragma warning(disable:1011)

}

Here the 'this' pointer is accessed by its name 'this', and all class data members are addressed relative to 'this'. The offset of the class member relative to 'this' is obtained by writing the member name preceded by the dot operator. The index into the array named buffer must be multiplied by the size of each element in buffer [edx*4].

Some 32-bit compilers for Windows put 'this' in ecx, so the instruction mov ecx,this can be omitted. 64-bit systems require 64-bit pointers, so ecx should be replaced by rcx and edx by rdx. 64-bit Windows has 'this' in rcx, while 64-bit Linux has 'this' in rdi. Structure members are accessed in the same way by loading a pointer to the structure into a register and using the dot operator. There is no syntax check against accessing private and protected members. There is no way to resolve the ambiguity if more than one structure or class has a member with the same name. The MASM assembler can resolve such ambiguities by using the assume directive or by putting the name of the structure before the dot, but this is not possible with inline assembly.

Calling functions

Functions are called by their name in inline assembly. Member functions can only be called from other member functions of the same class. Overloaded functions cannot be called because there is no way to resolve the ambiguity. It is not possible to use mangled function names. It is the responsibility of the programmer to put any function parameters on the stack or in the right registers before calling a function and to clean up the stack after the call. It is also the programmer's responsibility to save any registers you want to preserve across the function call, unless these registers have callee-save status.

Because of these complications, I will recommend that you go out of the assembly block and use C++ syntax when making function calls.

Syntax overview

The syntax for MASM-style inline assembly is not well described in any compiler manual I have seen. I will therefore summarize the most important rules here.

In most cases, the MASM-style inline assembly is interpreted by the compiler without in