Link Time Optimization (part 2)

Part two of quick notes on Link Time Optimization

Link time optimization (LTO) conducts optimization from the perspective of the entire program, leveraging knowledge of how the entire program is supposed to be structured.

LTO part 1

Now that we have looked at the really simple demonstration, I want to also address an example that deals with complex compute.

Let's replace simple to complex compute

Follow-along code example available here

#include "helper.hpp"

double complex_compute(long long i) {
    return sin(i) * cos(i+3);  
}

#include "helper.hpp" 
#include <functional> // for std::functional
#include <iostream>   // for std::cout
#include <string> //for std::string

int main(int argc, char* argv[]) {
    // Make sure we are getting valid arguments
    if (argc != 2) {
        std::cerr << "Usage: " << argv[0] << " <simple | complex>" << std::endl;
        return 1;
    }
    volatile double sum = 0;
    /*
    Using volatile to prevent compiler from optimizing this loop. 
    Removing `volatile` triggers loop optimization and make it run 
    almost 1000 times faster!
    */

    // Decide whether to use simple or complex compute function
    std::function<double(long long)> compute_func;

    std::string choice = argv[1];
    if (choice == "simple") {
        compute_func = simple_compute;
    } else if (choice == "complex") {
        compute_func = complex_compute;
    } else {
        std::cout << "Invalid choice " << choice << ". ";
        std::cout << "Choose either 'simple' or 'complex'" << std::endl;
    }

    const long long iterations = 100000000; // 1e8

    for (long long i = 0; i < iterations; ++i) {
        sum += compute_func(i);
    }

    std::cout << "(" << choice << ")" << " final sum: " << sum << std::endl;
    
    return 0;
}

What do you expect? Should it still run 7% faster with LTO?

Let's run the executables!

You can compile the code

mkdir build 
cd build
cmake ..
cmake --build .

and run the executables with hyperfine. From my machine, I get the following results

Benchmark 1: ./without_lto_so complex
  Time (mean ± σ):      2.756 s ±  0.059 s    [User: 2.755 s, System: 0.001 s]
  Range (min … max):    2.693 s …  2.869 s    10 runs
 
Benchmark 2: ./with_lto complex
  Time (mean ± σ):      2.771 s ±  0.079 s    [User: 2.770 s, System: 0.000 s]
  Range (min … max):    2.714 s …  2.914 s    10 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
 
Summary
  './without_lto_so complex' ran
    1.01 ± 0.04 times faster than './with_lto complex'

Hmmm... this time, the gain was only 1%. Why would this be? As you might have guessed already, it is because of Amdahl's Law.

Amdahl's Law

Amdahl's law states that "the overall performance improvement gained by optimizing a single part of a system is limited by the fraction of time that the improved part is actually used".

What does it mean for our use case? Let's say our system is composed of

system = making function call + compute

By using LTO, we have optimized the first part, making function call. However, in our "complex compute" demo, the compute got more complex, and therefore took more time to finish. However much we optimized function call, the overall system call inevitably depends so much more on the compute time. Thus, even though we saw with the simple compute example that LTO could actually speed up a simple program, it was not generalizable to a different use case.

Now we know, we have to KNOW the system we are optimizing!