Life of a Program

System-level investigation on the life of a program


Life of a Program

Have you ever wondered what really happens when you run an executable, or a bash command, bash run.sh? Even if you have not wondered, it's a really useful concept to have in mind to have the full picture of what happens to your program when it runs. Let's dive in!

Core Concepts: eBPF and bpftrace

To understand a life of a program, bpftrace is extensively used throughout this post.

eBPF (extended Berkeley Packet Filter) is a technology within the Linux kernel that acts like a tiny, safe virtual machine. It allows programs to be run directly in the kernel's protected context, triggered by events like system calls or network activity. This provides a way to safely and efficiently extend the kernel's capabilities at runtime without changing its source code.

eBPF-based tracing is the use of this technology for observability. Tools like bpftrace compile high-level scripts into eBPF programs and attach them to specific kernel events. When these events happen, your eBPF program gathers detailed information (like process IDs, function arguments, and timings) with minimal performance impact, offering an incredibly powerful and safe method for real-time debugging, performance analysis, and security monitoring.

How eBPF Programs are Triggered

You don't trigger eBPF programs directly. Instead, you attach your eBPF program to a specific kernel event (a "hook"), and the kernel executes your program automatically whenever that event occurs.

Here's how it works:

  1. Choose an Event (the "Hook"): You select the kernel event you care about. Common hooks include:

    • System Calls: Entry to and exit from any syscall (e.g., open()).
    • Tracepoints: Stable, well-defined hooks placed by kernel developers at interesting points (e.g., tracepoint:syscalls:sys_enter_execve).
    • Kernel Functions (kprobes): Entry and exit of almost any function in the kernel's own code.
    • Userspace Functions (uprobes): Entry and exit of functions in a user-space application or library.
  2. Write Your eBPF Program (e.g., bpftrace script): You write a small program that specifies which hook to attach to and what to do when that hook is hit.

  3. The Kernel's Role: When you load your program (via bpftrace):

    • Verification: The kernel's Verifier inspects your eBPF program to ensure it's safe (won't crash the kernel, no infinite loops, etc.). This is crucial for the "protected context."
    • Attachment: If verified, the kernel attaches your eBPF program to the specified hook.
    • Execution: Your program then sits dormant. Any time the kernel hits that hook (i.e., the event occurs), your attached eBPF program is automatically executed. It gathers information and sends it back to user-space (e.g., bpftrace) for display.

You can read more about what bpftrace is, and what eBPF-based tracing is in the below resource (it shows a collection of very useful resources):

Things we learned at school

We learned that when you run something, three main syscalls are made: fork(), execve() and wait4(). What are they, and what do they do?

fork()

fork() creates a new process by duplicating the calling process. The new process, referred to as the "child," is an almost exact copy of the original "parent" process. It gets its own process ID (PID) but inherits the parent's memory, open file descriptors, and other attributes.

The return value of fork() is crucial:

  • In the parent process, it returns the PID of the newly created child.
  • In the child process, it returns 0.
  • If the fork fails, it returns -1.

This two-way return value allows the program to split its execution path and know whether it's running as the parent or the child.

execve()

execve() is the syscall that executes a program. Unlike fork(), it does not create a new process. Instead, it transforms the current process by replacing its memory space with a new program.

When a process calls execve(), the operating system loads the executable file into memory and starts running it from its entry point. The process ID remains the same. If execve() is successful, it never returns, because the original program that called it has been completely overwritten.

It takes three arguments: the path to the executable, an array of command-line arguments, and an array of environment variables for the new program.

wait4()

A parent process uses wait4() (or a similar wait syscall) to wait for a child process to change state, most commonly to terminate. This call suspends the parent's execution until the child's status changes.

wait4() allows the parent to get information about the child's exit status (e.g., whether it completed successfully or terminated with an error). This act of "waiting" is also essential for system hygiene. When a child process terminates, it becomes a "zombie" process—it's dead but still occupies an entry in the process table. The parent must call wait4() to "reap" the child, which cleans up this entry. If a parent doesn't reap its children, zombies can accumulate and consume system resources.

Putting them all together

So how do these three syscalls work together? They form the fundamental mechanism for creating and running new processes in Unix-like systems. The typical sequence, often called the fork-exec-wait pattern, looks like this:

  1. A parent process (like a command-line shell) wanting to run another program first calls fork() to create a child process that is a copy of itself.
  2. The child process, which knows it's the child because fork() returned 0, then calls execve(). This crucial step replaces the child's own program with the new program it is intended to run.
  3. The parent process, which received the child's unique Process ID (PID) from the fork() call, typically calls wait4() to pause its own execution and wait for the child to finish.

This pattern is a cornerstone of multitasking on Linux, allowing a program like a shell to run other programs and regain control after they are done.

Here is a C code snippet that demonstrates this flow:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/resource.h>
#include <sys/wait.h>
 
int main() {
    pid_t pid = fork();
 
    if (pid < 0) {
        // The fork failed
        perror("fork");
        exit(1);
    } else if (pid == 0) {
        // This is the CHILD process
        char *argv[] = {"/bin/ls", "-l", NULL};
        execve("/bin/ls", argv, NULL);
 
        // execve() only returns if an error occurred
        perror("execve");
        exit(1);
    } else {
        // This is the PARENT process
        int status;
        // Wait for the child to finish
        wait4(pid, &status, 0, NULL);
        printf("Child process finished.\n");
    }
    return 0;
}

Shell Scripts

(aside: Did you know that a simple command like echo often doesn't go through the whole process of process creation? Many shells have "built-in" commands that they execute directly)

For a command to create a new process, it needs to be an external executable file. Let's use ls -l as an example. Take a look at the below shell script:

#!/bin/bash
 
ls -l
 

We know from the above section that the following actions should happen:

  1. The bash process responsible for running the script calls fork() to create a copy of itself as a child process.
  2. The child process calls execve() to replace its own program image with the /bin/ls executable.
  3. The parent bash process calls wait4() to wait for the child process (ls) to finish.

Now, let's confirm with bpftrace to verify that this is what actually happens! The following script will trace the key system calls: clone (the syscall used by fork), execve, and wait4.

bpftrace -e '
tracepoint:syscalls:sys_enter_clone {
    time("%H:%M:%S ");
    printf("FORK/CLONE: parent %s(pid %d) creating child\n", comm, pid);
}
tracepoint:syscalls:sys_enter_execve {
    time("%H:%M:%S ");
    printf("EXECVE:     caller %s(pid %d) executing %s\n", comm, pid, str(args->filename));
}
tracepoint:syscalls:sys_enter_wait4 {
    time("%H:%M:%S ");
    printf("WAIT4:      %s(pid %d) waiting\n", comm, pid);
}
'

Example Output

If you run the bpftrace command in one terminal and execute your script in another, you will see an output similar to this (PIDs will vary):

Attaching 3 probes...
05:20:26 FORK/CLONE: parent bash(pid 2280194) creating child
05:20:26 WAIT4:      bash(pid 2280194) waiting
05:20:26 EXECVE:     caller bash(pid 2280462) executing ./simple_bash.sh
05:20:26 FORK/CLONE: parent simple_bash.sh(pid 2280462) creating child
05:20:26 WAIT4:      simple_bash.sh(pid 2280462) waiting
05:20:26 EXECVE:     caller simple_bash.sh(pid 2280463) executing /usr/bin/ls
05:20:26 WAIT4:      simple_bash.sh(pid 2280462) waiting
05:20:26 WAIT4:      bash(pid 2280194) waiting

Interpretation

This output perfectly illustrates the two-level fork-exec-wait process:

  1. Your Shell Runs the Script: The first three lines show your interactive bash shell (PID 2280194) forking a child, waiting, and that child (PID 2280462) executing the script.

  2. The Script Runs ls: The next three lines show the script's process (PID 2280462) forking its own child, waiting, and that new grandchild process (PID 2280463) executing the /usr/bin/ls command.

  3. Cleanup: The final wait4 calls show the parents "reaping" their children as they finish, all the way back to your original shell.

You can clearly see the chain of processes: Your Shell -> Script's Shell -> ls.

Closing Remarks

Understanding the "life of a program", how it's created, executed, and managed by the operating system through fundamental syscalls like fork(), execve(), and wait4(), is really useful and important software engineering skill. Especially when performance, resource utilization, and system stability are critical.