Optimizing Subroutines in Assembly Language by Agner Fog - HTML preview

/ Home / Computer Sciences / Optimizing Subroutines in Assembly Language

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub, Kindle for a complete version.

3 The basics of assembly coding

3.1 Assemblers available

There are several assemblers available for the x86 instruction set, but currently none of them is good enough for universal recommendation. Assembly programmers are in the unfortunate situation that there is no universally agreed syntax for x86 assembly. Different assemblers use different syntax variants. The most common assemblers are listed below.

MASM

The Microsoft assembler is included with Microsoft C++ compilers. Free versions can sometimes be obtained by downloading the Microsoft Windows driver kit (WDK) or the platforms software development kit (SDK) or as an add-on to the free Visual C++ Express Edition. MASM has been a de-facto standard in the Windows world for many years, and the assembly output of most Windows compilers uses MASM syntax. MASM has many advanced language features. The syntax is somewhat messy and inconsistent due to a heritage that dates back to the very first assemblers for the 8086 processor. Microsoft is still maintaining MASM in order to provide a complete set of development tools for Windows, but it is obviously not profitable and the maintenance of MASM is apparently kept at a minimum. New instruction sets are still added regularly, but the 64-bit version has several deficiencies. Newer versions can run only when the compiler is installed and only in Windows XP or later. Version 6 and earlier can run in any system, including Linux with a Windows emulator. Such versions are circulating on the web.

GAS

The Gnu assembler is part of the Gnu Binutils package that is included with most distributions of Linux, BSD and Mac OS X. The Gnu compilers produce assembly output that goes through the Gnu assembler before it is linked. The Gnu assembler traditionally uses the AT&T syntax that works well for machine-generated code, but it is very inconvenient for human-generated assembly code. The AT&T syntax has the operands in an order that differs from all other x86 assemblers and from the instruction documentation published by Intel and AMD. It uses various prefixes like % and $ for specifying operand types. The Gnu assembler is available for all x86 platforms.

Fortunately, newer versions of the Gnu assembler has an option for using Intel syntax instead. The Gnu-Intel syntax is almost identical to MASM syntax. The Gnu-Intel syntax defines only the syntax for instruction codes, not for directives, functions, macros, etc. The directives still use the old Gnu-AT&T syntax. Specify .intel_syntax noprefix to use the Intel syntax. Specify .att_syntax prefix to return to the AT&T syntax before leaving inline assembly in C or C++ code.

NASM

NASM is a free open source assembler with support for several platforms and object file formats. The syntax is more clear and consistent than MASM syntax. NASM is updated regularly with new instruction sets. NASM has fewer high-level features than MASM, but it is sufficient for most purposes. I will recommend NASM as a very good multi-platform assembler.

YASM

YASM is very similar to NASM and uses exactly the same syntax. In some periods, YASM has been the first to support new instruction sets, in other periods NASM. YASM and NASM may be used interchangeably.

FASM

The Flat assembler is another open source assembler for multiple platforms. The syntax is not compatible with other assemblers. FASM is itself written in assembly language - an enticing idea, but unfortunately this makes the development and maintenance of it less efficient.

WASM

The WASM assembler is included with the Open Watcom C++ compiler. The syntax resembles MASM but is somewhat different. Not fully up to date.

JWASM

JWASM is a further development of WASM. It is fully compatible with MASM syntax, including advanced macro and high level directives. JWASM is a good choice if MASM syntax is desired.

TASM

Borland Turbo Assembler is included with CodeGear C++ Builder. It is compatible with MASM syntax except for some newer syntax additions. TASM is no longer maintained but is still available. It is obsolete and does not support current instruction sets.

GOASM

GoAsm is a free assembler for 32- and 64-bits Windows including resource compiler, linker and debugger. The syntax is similar to MASM but not fully compatible. It is currently not up to date with the latest instruction sets. An integrated development environment (IDE) named Easy Code is also available.

HLA

High Level Assembler is actually a high level language compiler that allows assembly-like statements and produces assembly output. This was probably a good idea at the time it was invented, but today where the best C++ compilers support intrinsic functions, I believe that HLA is no longer needed.

Inline assembly

Microsoft and Intel C++ compilers support inline assembly using a subset of the MASM syntax. It is possible to access C++ variables, functions and labels simply by inserting their names in the assembly code. This is easy, but does not support C++ register variables. See page 36.

The Gnu compiler supports inline assembly with access to the full range of instructions and directives of the Gnu assembler in both Intel and AT&T syntax. The access to C++ variables from assembly uses a quite complicated method.

The Intel compilers for Linux and Mac systems support both the Microsoft style and the Gnu style of inline assembly.

Intrinsic functions in C++

This is the newest and most convenient way of combining low-level and high-level code. Intrinsic functions are high-level language representatives of machine instructions. For example, you can do a vector addition in C++ by calling the intrinsic function that is equivalent to an assembly instruction for vector addition. Furthermore, it is possible to define a vector class with an overloaded + operator so that a vector addition is obtained simply by writing +. Intrinsic functions are supported by Microsoft, Intel, Gnu and Clang compilers. See page 34 and manual 1: "Optimizing software in C++".

Which assembler to choose?

In most cases, the easiest solution is to use intrinsic functions in C++ code. The compiler can take care of most of the optimization so that the programmer can concentrate on choosing the best algorithm and organizing the data into vectors. System programmers can access system instructions by using intrinsic functions without having to use assembly language.

Where real low-level programming is needed, such as in highly optimized function libraries or device drivers, you may use an assembler.

It may be preferred to use an assembler that is compatible with the C++ compiler you are using. This allows you to use the compiler for translating C++ to assembly, optimize the assembly code further, and then assemble it. If the assembler is not compatible with the syntax generated by the compiler then you may generate an object file with the compiler and disassemble the object file to the assembly syntax you need. The objconv disassembler supports several different syntax dialects.

The NASM assembler is a good choice for many purposes because it supports many platforms and object file formats, it is well maintained, and usually up to date with the latest instruction sets.

The examples in this manual use MASM syntax, unless otherwise noted. The MASM syntax is described in Microsoft Macro Assembler Reference at msdn.microsoft.com.

See www.agner.org/optimize for links to various syntax manuals, coding manuals and discussion forums.

3.2 Register set and basic instructions

Registers in 16 bit mode

General purpose and integer registers

The 32-bit registers are also available in 16-bit mode if supported by the microprocessor and operating system. The high word of ESP should not be used because it is not saved during interrupts.

Floating point registers Full register

MMX registers may be available if supported by the microprocessor. XMM registers may be available if supported by microprocessor and operating system.

Segment registers

Registers in 32 bit mode

General purpose and integer registers

The MMX registers are only available if supported by the microprocessor. The ST and MMX registers cannot be used in the same part of the code. A section of code using MMX registers must be separated from any subsequent section using ST registers by executing an EMMS instruction.

The XMM registers are only available if supported both by the microprocessor and the operating system. Scalar floating point instructions use only 32 or 64 bits of the XMM registers for single or double precision, respectively. The YMM registers are available only if the processor and the operating system supports the AVX instruction set. The ZMM registers are available if the processor supports the AVX-512 instruction set.

Registers in 64 bit mode

The high 8-bit registers AH, BH, CH, DH can only be used in instructions that have no REX prefix.

Note that modifying a 32-bit partial register will set the rest of the register (bit 32-63) to zero, but modifying an 8-bit or 16-bit partial register does not affect the rest of the register. This can be illustrated by the following sequence:

; Example 3.1. 8, 16, 32 and 64 bit registers

mov rax, 1111111111111111H ; rax = 1111111111111111H

mov eax, 22222222H ; rax = 0000000022222222H

mov ax, 3333H ; rax = 0000000022223333H

mov al, 44H ; rax = 0000000022223344H

There is a good reason for this inconsistency. Setting the unused part of a register to zero is more efficient than leaving it unchanged because this removes a false dependence on previous values. But the principle of resetting the unused part of a register cannot be extended to 16 bit and 8 bit partial registers because this would break the backwards compatibility with 32-bit and 16-bit modes.

The only instruction that can have a 64-bit immediate data operand is MOV. Other integer instructions can only have a 32-bit sign extended operand. Examples:

; Example 3.2. Immediate operands, full and sign extended

mov rax, 1111111111111111H ; Full 64 bit immediate operand

mov rax, -1 ; 32 bit sign-extended operand

mov eax, 0ffffffffH ; 32 bit zero-extended operand

add rax, 1 ; 8 bit sign-extended operand

add rax, 100H ; 32 bit sign-extended operand

add eax, 100H ; 32 bit operand. result is zero-extended

mov rbx, 100000000H ; 64 bit immediate operand

add rax, rbx ; Use an extra register if big operand

It is not possible to use a 16-bit sign-extended operand. If you need to add an immediate value to a 64 bit register then it is necessary to first move the value into another register if the value is too big for fitting into a 32 bit sign-extended operand.

The ST and MMX registers cannot be used in the same part of the code. A section of code using MMX registers must be separated from any subsequent section using ST registers by executing an EMMS instruction. The ST and MMX registers cannot be used in device drivers for 64-bit Windows.

Scalar floating point instructions use only 32 or 64 bits of the XMM registers for single or double precision, respectively. The YMM registers are available only if the processor and the operating system supports the AVX instruction set. The ZMM registers are available only if the processor supports the AVX512 instruction set. It may be possible to use XMM16-31 and YMM16-31 when AVX512 is supported by the processor and the assembler.

Segment registers

Segment registers are only used for special purposes.

3.3 Addressing modes Addressing in 16-bit mode

16-bit code uses a segmented memory model. A memory operand can have any of these components:

A segment specification. This can be any segment register or a segment or group name associated with a segment register. (The default segment is DS, except if BP is used as base register). The segment can be implied from a label defined inside a segment.
A label defining a relocatable offset. The offset relative to the start of the segment is calculated by the linker.
An immediate offset. This is a constant. If there is also a relocatable offset then the values are added.
A base register. This can only be BX or BP.
An index register. This can only be SI or DI. There can be no scale factor.

A memory operand can have all of these components. An operand containing only an immediate offset is not interpreted as a memory operand by the MASM assembler, even if it has a []. Examples:

; Example 3.3. Memory operands in 16-bit mode

MOV AX, DS:[100H] ; Address has segment and immediate offset

ADD AX, MEM[SI]+4 ; Has relocatable offset and index and immediate

Data structures bigger than 64 kb are handled in the following ways. In real mode and virtual mode (DOS): Adding 1 to the segment register corresponds to adding 10H to the offset. In protected mode (Windows 3.x): Adding 8 to the segment register corresponds to adding 10000H to the offset. The value added to the segment must be a multiple of 8.

Addressing in 32-bit mode

32-bit code uses a flat memory model in most cases. Segmentation is possible but only used for special purposes (e.g. thread environment block in FS).

A memory operand can have any of these components:

A segment specification. Not used in flat mode.
A label defining a relocatable offset. The offset relative to the FLAT segment group is calculated by the linker.
An immediate offset. This is a constant. If there is also a relocatable offset then the values are added.
A base register. This can be any 32 bit register.
An index register. This can be any 32 bit register except ESP.
A scale factor applied to the index register. Allowed values are 1, 2, 4, 8.

A memory operand can have all of these components. Examples:

; Example 3.4. Memory operands in 32-bit mode

mov eax, fs:[10H] ; Address has segment and immediate offset

add eax, mem[esi] ; Has relocatable offset and index

add eax, [esp+ecx*4+8] ; Base, index, scale and immediate offset

Position-independent code in 32-bit mode

Position-independent code is required for making shared objects (*.so) in 32-bit Unix-like systems. The method commonly used for making position-independent code in 32-bit Linux and BSD is to use a global offset table (GOT) containing the addresses of all static objects. The GOT method is quite inefficient because the code has to fetch an address from the GOT every time it reads or writes data in the data segment. A faster method is to use an arbitrary reference point, as shown in the following example:

; Example 3.5a. Position-independent code, 32 bit, YASM syntax

SECTION .data

alpha: dd 1

beta: dd 2

SECTION .text

funca: ; This function returns alpha + beta

call get_thunk_ecx ; get ecx = eip

refpoint: ; ecx points here

mov eax, [ecx+alpha-refpoint] ; relative address

add eax, [ecx+beta -refpoint] ; relative address

ret