Optimizing Subroutines in Assembly Language by Agner Fog - HTML preview

/ Home / Computer Sciences / Optimizing Subroutines in Assembly Language

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub, Kindle for a complete version.

10 Optimizing for size

The code cache can hold from 8 to 32 kb of code, as explained in chapter 11 page 82. If there are problems keeping the critical parts of the code within the code cache, then you may consider reducing the size of the code. Reducing the code size can also improve the decoding of instructions. Loops that have no more than 64 bytes of code perform particularly fast on the Core2 processor.

You may even want to reduce the size of the code at the cost of reduced speed if speed is not important.

32-bit code is usually bigger than 16-bit code because addresses and data constants take 4 bytes in 32-bit code and only 2 bytes in 16-bit code. However, 16-bit code has other penalties, especially because of segment prefixes. 64-bit code does not need more bytes for addresses than 32-bit code because it can use 32-bit RIP-relative addresses. 64-bit code may be slightly bigger than 32-bit code because of REX prefixes and other minor differences, but it may as well be smaller than 32-bit code because the increased number of registers reduces the need for memory variables.

10.1 Choosing shorter instructions

Certain instructions have short forms. PUSH and POP instructions with an integer register take only one byte. XCHG EAX,reg32 is also a single-byte instruction and thus takes less space than a MOV instruction, but XCHG is slower than MOV. INC and DEC with a 32-bit register in 32- bit mode, or a 16-bit register in 16-bit mode take only one byte. The short form of INC and DEC is not available in 64-bit mode.

The following instructions take one byte less when they use the accumulator than when they use any other register: ADD, ADC, SUB, SBB, AND, OR, XOR, CMP, TEST with an immediate operand without sign extension. This also applies to the MOV instruction with a memory operand and no pointer register in 16 and 32 bit mode, but not in 64 bit mode. Examples:

; Example 10.1. Instruction sizes

add eax,1000 is smaller than add ebx,1000

mov eax,[mem] is smaller than mov ebx,[mem], except in 64 bit mode.

Instructions with pointers take one byte less when they have only a base pointer (except ESP, RSP or R12) and a displacement than when they have a scaled index register, or both base pointer and index register, or ESP, RSP or R12 as base pointer. Examples:

; Example 10.2. Instruction sizes

mov eax,array[ebx] is smaller than mov eax,array[ebx*4]

mov eax,[ebp+12] is smaller than mov eax,[esp+12]

Instructions with BP, EBP, RBP or R13 as base pointer and no displacement and no index take one byte more than with other registers:

; Example 10.3. Instruction sizes

mov eax,[ebx] is smaller than mov eax,[ebp], but

mov eax,[ebx+4] is same size as mov eax,[ebp+4].

Instructions with a scaled index pointer and no base pointer must have a four bytes displacement, even when it is 0:

; Example 10.4. Instruction sizes

lea eax,[ebx+ebx] is shorter than lea eax,[ebx*2].

Instructions in 64-bit mode need a REX prefix if at least one of the registers R8 - R15 or XMM8 - XMM15 are used. Instructions that use these registers are therefore one byte longer than instructions that use other registers, unless a REX prefix is needed anyway for other reasons:

; Example 10.5a. Instruction sizes (64 bit mode)

mov eax,[rbx] is smaller than mov eax,[r8].

; Example 10.5b. Instruction sizes (64 bit mode)

mov rax,[rbx] is same size as mov rax,[r8].

In example 10.5a, we can avoid a REX prefix by using register RBX instead of R8 as pointer. But in example 10.5b, we need a REX prefix anyway for the 64-bit operand size, and the instruction cannot have more than one REX prefix.

Floating point calculations can be done either with the old x87 style instructions with floating point stack registers ST(0)-ST(7) or the new SSE style instructions with XMM registers. The x87 style instructions are more compact than the latter, for example:

; Example 10.6. Floating point instruction sizes

fadd st(0), st(1) ; 2 bytes

addsd xmm0, xmm1 ; 4 bytes

The use of x87 style code may be advantageous even if it requires extra FXCH instructions. There is no big difference in execution speed between the two types of floating point instructions on current processors. However, it is possible that the x87 style instructions will be considered obsolete and will be less efficient on future processors.

Processors supporting the AVX instruction set can code XMM instructions in two different ways, with a VEX prefix or with the old prefixes. Sometimes the VEX version is shorter and sometimes the old version is shorter. However, there is a severe performance penalty to mixing XMM instructions without VEX prefix with instructions using YMM registers on some processors.

The AVX-512 instruction set uses a new 4-bytes prefix called EVEX. While the EVEX prefix is one or two bytes longer then the VEX prefix, it allows a more efficient coding of memory operands with pointer and offset. Memory operands with a 4-bytes offset can sometimes be replaced by a 1-byte scaled offset when the EVEX prefix is used. Thereby the total instruction length becomes smaller.

10.2 Using shorter constants and addresses

Many jump addresses, data addresses, and data constants can be expressed as sign- extended 8-bit constants. This saves a lot of space. A sign-extended byte can only be used if the value is within the interval from -128 to +127.

For jump addresses, this means that short jumps take two bytes of code, whereas jumps beyond 127 bytes take 5 bytes if unconditional and 6 bytes if conditional.

Likewise, data addresses take less space if they can be expressed as a pointer and a displacement between -128 and +127. The following example assumes that [mem1] and [mem2] are static memory addresses in the data segment and that the distance between them is less than 128 bytes:

; Example 10.7a, Static memory operands

mov ebx, [mem1] ; 6 bytes

add ebx, [mem2] ; 6 bytes

Reduce to:

; Example 10.7b, Replace addresses by pointer

mov eax, offset mem1 ; 5 bytes

mov ebx, [eax] ; 2 bytes

add ebx, [eax] + (mem2 - mem1) ; 3 bytes

In 64-bit mode you need to replace mov eax,offset mem1 with lea rax,[mem1], which is one byte longer. The advantage of using a pointer obviously increases if you can use the same pointer many times. Storing data on the stack and using EBP or ESP as pointer will thus make the code smaller than if you use static memory locations and absolute addresses, provided of course that the data are within +/-127 bytes of the pointer. Using PUSH and POP to write and read temporary integer data is even shorter.

Data constants may also take less space if they are between -128 and +127. Most instructions with immediate operands have a short form where the operand is a sign- extended single byte. Examples:

; Example 10.8, Sign-extended operands

push 200 ; 5 bytes

push 100 ; 2 bytes, sign extended

add ebx, 128 ; 6 bytes

sub ebx, -128 ; 3 bytes, sign extended

The only instructions with an immediate operand that do not have a short form with a sign- extended 8-bit constant are MOV, TEST, CALL and RET. A TEST instruction with a 32-bit immediate operand can be replaced with various shorter alternatives, depending on the logic in case. Some examples:

; Example 10.9, Alternatives to test with 32-bit constant

test eax, 8 ; 5 bytes

test ebx, 8 ; 6 bytes

test al, 8 ; 2 bytes

test bl, 8 ; 3 bytes

and ebx, 8 ; 3 bytes

bt ebx, 3 ; 4 bytes (uses carry flag)

cmp ebx, 8 ; 3 bytes

It is not recommended to use the versions with 16-bit constants in 32-bit or 64-bit modes, such as TEST AX,800H because it will cause a penalty for decoding a length-changing prefix on some processors, as explained in manual 3: "The microarchitecture of Intel, AMD and VIA CPUs".

Shorter alternatives for MOV register,constant are often useful. Examples:

; Example 10.10, Loading constants into 32-bit registers

mov eax, 0 ; 5 bytes

sub eax, eax ; 2 bytes

mov eax, 1 ; 5 bytes

sub eax, eax / inc eax ; 3 bytes

push 1 / pop eax ; 3 bytes

mov eax, -1 ; 5 bytes

or eax, -1 ; 3 bytes

You may also consider reducing the size of static data. Obviously, an array can be made smaller by using a smaller data size for the elements. For example 16-bit integers instead of 32-bit integers if the data are sure to fit into the smaller data size. The code for accessing 16-bit integers is slightly bigger than for accessing 32-bit integers, but the increase in code size is small compared to the decrease in data size for a large array. Instructions with 16-bit immediate data operands should be avoided in 32-bit and 64-bit mode because of the problem with decoding length-changing prefixes.

10.3 Reusing constants

If the same address or constant is used more than once then you may load it into a register. A MOV with a 4-byte immediate operand may sometimes be replaced by an arithmetic instruction if the value of the register before the MOV is known. Example:

; Example 10.11a, Loading 32-bit constants

mov [mem1], 200 ; 10 bytes

mov [mem2], 201 ; 10 bytes

mov eax, 100 ; 5 bytes

mov ebx, 150 ; 5 bytes

Replace with:

; Example 10.11b, Reuse constants

mov eax, 200 ; 5 bytes

mov [mem1], eax ; 5 bytes

inc eax ; 1 byte

mov [mem2], eax ; 5 bytes

sub eax, 101 ; 3 bytes

lea ebx, [eax+50] ; 3 bytes

10.4 Constants in 64-bit mode

In 64-bit mode, there are three ways to move a constant into a 64-bit register: with a 64-bit constant, with a 32-bit sign-extended constant, and with a 32-bit zero-extended constant:

; Example 10.12, Loading constants into 64-bit registers

mov rax, 123456789abcdef0h ; 10 bytes (64-bit constant)

mov rax, -100 ; 7 bytes (32-bit sign-extended)

mov eax, 100 ; 5 bytes (32-bit zero-extended)

Some assemblers use the sign-extended version rather than the shorter zero-extended version, even when the constant is within the range that fits into a zero-extended constant. You can force the assembler to use the zero-extended version by specifying a 32-bit destination register. Writes to a 32-bit register are always zero-extended into the 64-bit register.

10.5 Addresses and pointers in 64-bit mode

64-bit code should preferably use 64-bit register size for base and index in addresses, and 32-bit register size for everything else. Example:

; Example 10.13, 64-bit versus 32-bit registers

mov eax, [rbx + 4*rcx]

inc rcx

Here, you can save one byte by changing inc rcx to inc ecx. This will work because the value of the index register is certain to be less than 232. The base pointer however, may be bigger than 232 in some systems so you can't replace add rbx,4 by add ebx,4. Never use 32-bit registers as base or index inside the square brackets in 64-bit mode.

The rule of using 64-bit registers inside the square brackets of an indirect address and 32- bit registers everywhere else also applies to the LEA instruction. Examples:

; Example 10.14. LEA in 64-bit mode

lea eax, [ebx + ecx] ; 4 bytes (needs address size prefix)

lea eax, [rbx + rcx] ; 3 bytes (no prefix)

lea rax, [ebx + ecx] ; 5 bytes (address size and REX prefix)

lea rax, [rbx + rcx] ; 4 bytes (needs REX prefix)

The form with 32-bit destination and 64-bit address is preferred unless a 64-bit result is needed. This version takes no more time to execute than the version with 64-bit destination. The forms with address size prefix should never be used.

An array of 64-bit pointers in a 64-bit program can be made smaller by using 32-bit pointers relative to the image base or to some reference point. This makes the array of pointers smaller at the cost of making the code that uses the pointers bigger since it needs to add the image base. Whether this gives a net advantage depends on the size of the array. Example:

; Example 10.15a. Jump-table in