Link Time Optimization (part 2)
Part two of quick notes on Link Time Optimization
Link time optimization (LTO) conducts optimization from the perspective of the entire program, leveraging knowledge of how the entire program is supposed to be structured.
Now that we have looked at the really simple demonstration, I want to also address an example that deals with complex compute.
Let's replace simple to complex compute
Follow-along code example available here
#include "helper.hpp"
double complex_compute(long long i) {
return sin(i) * cos(i+3);
}
#include "helper.hpp"
#include <functional> // for std::functional
#include <iostream> // for std::cout
#include <string> //for std::string
int main(int argc, char* argv[]) {
// Make sure we are getting valid arguments
if (argc != 2) {
std::cerr << "Usage: " << argv[0] << " <simple | complex>" << std::endl;
return 1;
}
volatile double sum = 0;
/*
Using volatile to prevent compiler from optimizing this loop.
Removing `volatile` triggers loop optimization and make it run
almost 1000 times faster!
*/
// Decide whether to use simple or complex compute function
std::function<double(long long)> compute_func;
std::string choice = argv[1];
if (choice == "simple") {
compute_func = simple_compute;
} else if (choice == "complex") {
compute_func = complex_compute;
} else {
std::cout << "Invalid choice " << choice << ". ";
std::cout << "Choose either 'simple' or 'complex'" << std::endl;
}
const long long iterations = 100000000; // 1e8
for (long long i = 0; i < iterations; ++i) {
sum += compute_func(i);
}
std::cout << "(" << choice << ")" << " final sum: " << sum << std::endl;
return 0;
}
What do you expect? Should it still run 7% faster with LTO?
Let's run the executables!
You can compile the code
mkdir build
cd build
cmake ..
cmake --build .
and run the executables with hyperfine
.
From my machine, I get the following results
Benchmark 1: ./without_lto_so complex
Time (mean ± σ): 2.756 s ± 0.059 s [User: 2.755 s, System: 0.001 s]
Range (min … max): 2.693 s … 2.869 s 10 runs
Benchmark 2: ./with_lto complex
Time (mean ± σ): 2.771 s ± 0.079 s [User: 2.770 s, System: 0.000 s]
Range (min … max): 2.714 s … 2.914 s 10 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Summary
'./without_lto_so complex' ran
1.01 ± 0.04 times faster than './with_lto complex'
Hmmm... this time, the gain was only 1%. Why would this be? As you might have guessed already, it is because of Amdahl's Law.
Amdahl's Law
Amdahl's law states that "the overall performance improvement gained by optimizing a single part of a system is limited by the fraction of time that the improved part is actually used".
What does it mean for our use case? Let's say our system is composed of
system = making function call + compute
By using LTO, we have optimized the first part, making function call. However, in our "complex compute" demo, the compute got more complex, and therefore took more time to finish. However much we optimized function call, the overall system call inevitably depends so much more on the compute time. Thus, even though we saw with the simple compute example that LTO could actually speed up a simple program, it was not generalizable to a different use case.
Now we know, we have to KNOW the system we are optimizing!