Optimizing Subroutines in Assembly Language by Agner Fog - HTML preview

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub, Kindle for a complete version.

Contents

 

1 Introduction

1.1 Reasons for using assembly code

1.2 Reasons for not using assembly code

1.3 Operating systems covered by this manual

2 Before you start

2.1 Things to decide before you start programming

2.2 Make a test strategy

2.3 Common coding pitfalls

3 The basics of assembly coding

3.1 Assemblers available

3.2 Register set and basic instructions

3.3 Addressing modes

3.4 Instruction code format

3.5 Instruction prefixes

4 ABI standards

4.1 Register usage

4.2 Data storage

4.3 Function calling conventions

4.4 Name mangling and name decoration

4.5 Function examples

5 Using intrinsic functions in C++

5.1 Using intrinsic functions for system code

5.2 Using intrinsic functions for instructions not available in standard C++

5.3 Using intrinsic functions for vector operations

5.4 Availability of intrinsic functions

6 Using inline assembly

6.1 MASM style inline assembly

6.2 Gnu style inline assembly

6.3 Inline assembly in Delphi Pascal

7 Using an assembler

7.1 Static link libraries

7.2 Dynamic link libraries

7.3 Libraries in source code form

7.4 Making classes in assembly

7.5 Thread-safe functions

7.6 Makefiles

8 Making function libraries compatible with multiple compilers and platforms

8.1 Supporting multiple name mangling schemes

8.2 Supporting multiple calling conventions in 32 bit mode

8.3 Supporting multiple calling conventions in 64 bit mode

8.4 Supporting different object file formats

8.5 Supporting other high level languages

9 Optimizing for speed

9.1 Identify the most critical parts of your code

9.2 Out of order execution

9.3 Instruction fetch, decoding and retirement

9.4 Instruction latency and throughput

9.5 Break dependency chains

9.6 Jumps and calls

10 Optimizing for size

10.1 Choosing shorter instructions

10.2 Using shorter constants and addresses

10.3 Reusing constants

10.4 Constants in 64-bit mode

10.5 Addresses and pointers in 64-bit mode

10.6 Making instructions longer for the sake of alignment

10.7 Using multi-byte NOPs for alignment

11 Optimizing memory access

11.1 How caching works

11.2 Trace cache

11.3 µop cache

11.4 Alignment of data

11.5 Alignment of code

11.6 Organizing data for improved caching

11.7 Organizing code for improved caching

11.8 Cache control instructions

12 Loops

12.1 Minimize loop overhead

12.2 Induction variables

12.3 Move loop-invariant code

12.4 Find the bottlenecks

12.5 Instruction fetch, decoding and retirement in a loop

12.6 Distribute µops evenly between execution units

12.7 An example of analysis for bottlenecks in vector loops

12.8 Same example on Core2

12.9 Same example on Sandy Bridge

12.10 Same example with FMA4

12.11 Same example with FMA3

12.12 Loop unrolling

12.13 Vector loops using mask registers (AVX512)

12.14 Optimize caching

12.15 Parallelization

12.16 Analyzing dependences

12.17 Loops on processors without out-of-order execution

12.18 Macro loops

13 Vector programming

13.1 Conditional moves in SIMD registers

13.2 Using vector instructions with other types of data than they are intended for

13.3 Shuffling data

13.4 Generating constants

13.5 Accessing unaligned data and partial vectors

13.6 Using AVX instruction set and YMM registers

13.7 Vector operations in general purpose registers

14 Multithreading

14.1 Hyperthreading

15 CPU dispatching

15.1 Checking for operating system support for XMM and YMM registers

16 Problematic Instructions

16.1 LEA instruction (all processors)

16.2 INC and DEC

16.3 XCHG (all processors)

16.4 Shifts and rotates (P4)

16.5 Rotates through carry (all processors)

16.6 Bit test (all processors)

16.7 LAHF and SAHF (all processors)

16.8 Integer multiplication (all processors)

16.9 Division (all processors)

16.10 String instructions (all processors)

16.11 Vectorized string instructions (processors with SSE4.2)

16.12 WAIT instruction (all processors)

16.13 FCOM + FSTSW AX (all processors)

16.14 FPREM (all processors)

16.15 FRNDINT (all processors)

16.16 FSCALE and exponential function (all processors)

16.17 FPTAN (all processors)

16.18 FSQRT (SSE processors)

16.19 FLDCW (Most Intel processors)

16.20 MASKMOV instructions

17 Special topics

17.1 XMM versus floating point registers

17.2 MMX versus XMM registers

17.3 XMM versus YMM registers

17.4 Freeing floating point registers (all processors)

17.5 Transitions between floating point and MMX instructions

17.6 Converting from floating point to integer (All processors)

17.7 Using integer instructions for floating point operations

17.8 Using floating point instructions for integer operations

17.9 Moving blocks of data (All processors)

17.10 Self-modifying code (All processors)

18 Measuring performance

18.1 Testing speed

18.2 The pitfalls of unit-testing

19 Literature

20 Copyright notice