Link Time Optimization (part 1)

Part one of quick notes on Link Time Optimization

Link time optimization (LTO) conducts optimization from the perspective of the entire program, leveraging knowledge of how the entire program is supposed to be structured.

Btw, what does a linker do?

Say you have defined your executable as below:

add_executable(executable
  src/A.cpp
  src/B.cpp
)

Compiler takes A.cpp and B.cpp and generates object file of it in machine code, A.o and B.o respectively. At this point, if the object file uses a function defined in B, the compiler does not yet know the function's final memory address, but rather leaves a placeholder for it, and marks the symbol "external".

This is where a linker comes into play. The linker takes object files generated by the compiler and resolves symbols that are marked "external" and combines them into a single executable file, executable. If the above example used add_library(...), then the linker will create a library (.dll or .so).

Simple Demonstration

Code sample available here Let's get started with a super minimal demo. We will have one file that has helper functions defined, and a main file that calls functions from this file.

#ifndef HELPER_HPP
#define HELPER_HPP

int simple_compute(int i);

#endif // HELPER_HPP

#include "helper.hpp"

int simple_compute(int i) {
    return (i * 3) + 5;
}

#include "helper.hpp"
#include <iostream> // for std::cout

int main() {
    volatile long long sum = 0;
    /*
    Using volatile to prevent compiler from optimizing this loop. 
    Removing `volatile` triggers loop optimization and make it run 
    almost 1000 times faster!
    */

    const in iterations = 1000000000; // 1e9
    // There is also another bug here. Can you see it? Hint: overfl..

    for (int i=0; i < iterations; ++i) {
         sum += simple_compute(i);
    }

    std::cout << "final sum: " << sum << std::endl;

    return 0;
}

Here is a build file to build the above example:

cmake_minimum_required(VERSION 3.9)
project(LTO_Example_CPP CXX)

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

message(STATUS "Configuring 'without_lto' target...")
add_executable(without_lto main.cpp helper.cpp)
target_compile_options(without_lto PRIVATE -O2)

message(STATUS "Configuring 'with_lto' target...")
include(CheckIPOSupported)
check_ipo_supported(RESULT ipo_available)

if(ipo_available)
    message(STATUS "LTO (Interprocedural Optimization) is supported by the compiler.")
    add_executable(with_lto main.cpp helper.cpp)
    target_compile_options(with_lto PRIVATE -O2)
    set_target_properties(with_lto PROPERTIES
        INTERPROCEDURAL_OPTIMIZATION TRUE
    )
    # you could also do add_compile_options(-flto)
else()
    message(WARNING "LTO (Interprocedural Optimization) is not supported by the compiler. 'with_lto' target will not be built with LTO.")
    add_executable(with_lto main.cpp helper.cpp)
    target_compile_options(with_lto PRIVATE -O2)
endif()

message(STATUS "\nBuild complete. To run the test, execute the following commands in your build directory:")
message(STATUS "  time ./without_lto")
message(STATUS "  time ./with_lto")

How do I know if it compiled with LTO?

You would need to look at assembly code (don't be scared!).

objdump -d -C ./without_lto > without_lto.asm
objdump -d -C ./with_lto > with_lto.asm

Look at inside without_lto.asm. On my file, it indeed shows a call to the simple_compute from the helper file. On the other hand, there is no call to simple_compute on wiht_lto.asm.

0000000000001140 <main>:
    1140:	f3 0f 1e fa          	endbr64 
    1144:	41 54                	push   %r12
    1146:	55                   	push   %rbp
    1147:	53                   	push   %rbx
    1148:	31 db                	xor    %ebx,%ebx
    114a:	48 83 ec 10          	sub    $0x10,%rsp
    114e:	48 c7 44 24 08 00 00 	movq   $0x0,0x8(%rsp)
    1155:	00 00 
    1157:	66 0f 1f 84 00 00 00 	nopw   0x0(%rax,%rax,1)
    115e:	00 00 
    1160:	48 89 df             	mov    %rbx,%rdi
    1163:	48 83 c3 01          	add    $0x1,%rbx
    1167:	e8 e4 01 00 00       	call   1350 <simple_compute(long long)> ### HERE
    116c:	49 89 c0             	mov    %rax,%r8
    116f:	48 8b 44 24 08       	mov    0x8(%rsp),%rax

Let's run the executables!

Expectation: with_lto should run faster than without_lto! BUT!!! From my PC, without_lto actually ran faster than with_lto.

Benchmark 1: ./without_lto
  Time (mean ± σ):      1.449 s ±  0.054 s    [User: 1.448 s, System: 0.000 s]
  Range (min … max):    1.411 s …  1.559 s    10 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
 
Benchmark 2: ./with_lto
  Time (mean ± σ):      1.558 s ±  0.009 s    [User: 1.558 s, System: 0.001 s]
  Range (min … max):    1.546 s …  1.570 s    10 runs
 
Summary
  './without_lto' ran
    1.08 ± 0.04 times faster than './with_lto'

(btw, hyperfine is a great benchmarking tool to avoid CPU warm-ups, etc.) So... what's happening?
One theory on that: My static linker (ld) is performing some very strange optimization that is separate from the compiler's LTO mechanism (i.e. "Linker-level code patching")

What is "Linker-level code patching" (linker relaxation)?

Modern linkers have bird-eye's view on the entire program, whereas compilers only are aware of the program that they are compiling. Leveraging this knowledge, linkers can identify and rewrite machine code, replacing them with more efficient instructions. The linker patches (re-writes) pieces of machine code and potential benefits include:

instruction cache locality: multiple instructions can be fetched from the same cache line -> this makes the call instruction almost cost-free Note that linker relaxation is highly sensitive to the effects of address space layout randomization (ASLR), in which the OS loads the programs at different address space each time.

Let's run the executables! 2

Anyways, going back to running executables. Because of what happened with linker-level code patching, we need to make an example that overcomes linker's optimization barrier. Let's use the modified build file to make a shared library out of helper.cpp instead.

cmake_minimum_required(VERSION 3.13)

project(LTO_Definitive_Test CXX)

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

# Shared_library: helper_lib
add_library(helper_lib SHARED helper.cpp)
target_compile_options(helper_lib PRIVATE -O2)
# On Linux/macOS, it's good practice to enable Position Independent Code
set_target_properties(helper_lib PROPERTIES POSITION_INDEPENDENT_CODE ON)


# Executable: without_lto_so
# Executable without LTO and linked against shared libarry
add_executable(without_lto_so main.cpp)
target_compile_options(without_lto_so PRIVATE -O2)
target_link_libraries(without_lto_so PRIVATE helper_lib)

# Executable: with_lto
add_executable(with_lto main.cpp helper.cpp)
target_compile_options(with_lto PRIVATE -O2)
set_target_properties(with_lto PROPERTIES INTERPROCEDURAL_OPTIMIZATION TRUE)

message(STATUS "\nBuild complete. Run the definitive benchmark:")
message(STATUS "  hyperfine './without_lto_so' './with_lto'")

What exactly is a shared library?

In early days when every program was statically linked, causing the code size to bloat and making the maintenance very difficult. Thus, shared libraries were introduced. Instead of copying referenced code into the final executable, using a shared library allowed only putting a small placeholder. Referenced shared libraries get loaded into memory once, which different programs can share. (there is more detail on shared libraries, which separate post would be appropriate.)

So, how does using a shared library help us avoid linker-level optimization? At compile time, the external function call has only a placeholder, and at link time, the placeholder gets resolved, but it still cannot inline the function because it does not know its implementation (which is left for dynamic linker).

Now, when we run the executables with the benchmarking tool, we see the expected results:

Benchmark 1: ./without_lto_so
  Time (mean ± σ):      1.690 s ±  0.011 s    [User: 1.688 s, System: 0.001 s]
  Range (min … max):    1.679 s …  1.715 s    10 runs
 
Benchmark 2: ./with_lto
  Time (mean ± σ):      1.572 s ±  0.037 s    [User: 1.571 s, System: 0.000 s]
  Range (min … max):    1.534 s …  1.661 s    10 runs
 
Summary
  './with_lto' ran
    1.07 ± 0.03 times faster than './without_lto_so'

Using LTO makes the code run about 7% faster!

On part 2 of LTO, we will run with more complex example, and how LTO helps (or not) with code performance.