Did you ever have a chance to use strace? It's a really powerful tool, I use it pretty often. It allows spying on the running software which is very useful when you don't have access to the source code and want to understand, at least on a high level, what is happening. strace show you the system calls of a software. What is a system call and how to harness the magic of strace?

Note

This blog post is based on this video. If you want to read and watch more of such cool stuff, head over to fasterthanlime.

How useful is strace?

In the last company I was working in, we had a client with a platform which heavily relied on its content. We also had few environments: Integration - for devs, deployment happening every few minutes; Staging - for User Acceptance Testing done by the client; Performance - for performance tests and Production. The content which was served to end users was being done on the Production instances.

We wanted to have all the environments as close as possible to the Production, because it minimizes differences which makes debugging a lot easier (good luck trying to fix a bug which cannot be reproduced on lower environments because of the differences between lower envs and production).

We set up a job on Jenkins which could synchronize all the content form production servers to lower environments. Initially, it worked without a problem. In few minutes we had all the environments synchronized. But after some time, when the amount of content grew immensely, the synchronization started to take more and more time, and eventually it was running for several hours without the clear view of the end.

I started to suspect that there is a deadlock somewhere in the synchronization tool. Usually, confirming such suspicion is as easy as hooking up a profiler which can show you all the threads, and it's state. In this case, the problem was that the tool and the content platform client worked with, was designed in such a way, that debugging content synchronization process was equal to debugging the production environment. It was obviously out of the question. The platform was also global, so debugging it at night was not an option (not to mention that it's not pleasant).

We had to prove that there is an issue with the synchronization tool, so I came up with an idea to use strace. It would show us all the system calls made by the synchronization tool in non-invasive way. I just wanted to know what is happening - maybe it's waiting for some network call and doesn't have a timeout, or maybe there is some huge file which it's trying to read into memory and has to constantly swap the memory pages to be able to fit it in, or maybe it's working just fine, but the amount of content it needs to synchronize is so huge, that it takes hours to do so.

Soon enough, I saw a bunch of calls to futex. futex is a Fast Userspace Mutex, it's a Linux Kernel mechanism for shared access synchronization. The listing of system calls showed a rapid flood of calls to futex and then nothing - nothing for hours. It was a clear indication that the synchronization process tried to manage access to some shared memory and had a bug which deadlocked all the threads.

That was enough for the team to decide that we need to change the synchronization tool (as it turned out the tool we used was not actively maintained anymore).

Ok, what's a system call?

For the purpose of this blog post, you can think of a system call as just a library function. And the library you are using is the Linux Kernel. System call is a way of handling some work to the Linux Kernel. Work that only kernel can do - like talking to the network card, managing processes etc.

When your code is using a system call (directly or indirectly), the execution is passed to the kernel. This requires the OS to save the state of your process to be able to restore it later, when the kernel is done with doing what your code asked for.

This is costly, and it's called a context switch (context switch also happens when there are multiple processes running in the OS and the kernel jumps between processes to make progress). The more system calls the bigger overhead related with context switch.

How strace does its spying?

strace uses a system call (yep, system calls everywhere) called ptrace. It allows inspecting and controlling processes. It's part of the Yama which is a Linux Security Module. Yama defines which process has rights to attach to which process.

Let's end this blabber and do some Rust.

Poor man's strace in Rust

Spawning a child process

Ok, first, strace spawns the command we pass as an argument as a child process. We can do that with Rust standard library.

use std::process::Command;
use anyhow::Result;

fn main() -> Result<()> {
  let mut cmd = Command::new("echo");
  cmd.arg("test");

  let child = cmd.spawn()?;

  Ok(())
}

Let's add anyhow to simplify error handling.

cargo add anyhow

(When we run this, we won't see any output, because we are not using the output of the command.)

Allow to be traced by the parent

Next, our child process needs to allow to be traced by the parent. I don't know if there is something in the standard library to do that, but there is a nix crate which has traceme function in the ptrace module.

cargo add nix

But how can we call that function before the actual process is spawned? Keep in mind that we need to do that before actual process, because if the process is already spawned, then it's too late.

Fortunately, there is a method called pre_exec in the CommandExt trait. It allows running our code just before the child process is run. It is unsafe, so we need to run it in unsafe block.

use anyhow::Result;
use nix::sys::ptrace::traceme;
use std::{os::unix::process::CommandExt, process::Command};

fn main() -> Result<()> {
    let mut cmd = Command::new("echo");
    cmd.arg("test");

    unsafe {
        cmd.pre_exec(|| {
            traceme()?;
            Ok(())
        });
    }

    Ok(())
}

Spawn and wait for the change

Now, we can spawn our command and wait for the process status change. We can wait for the child process to change state using waitpid.

use anyhow::Result;
use nix::sys::{ptrace::traceme, wait::waitpid};
use nix::unistd::Pid;
use owo_colors::OwoColorize;
use std::{os::unix::process::CommandExt, process::Command};

fn main() -> Result<()> {
    let mut cmd = Command::new("echo");
    cmd.arg("test");

    unsafe {
        cmd.pre_exec(|| {
            traceme()?;
            Ok(())
        });
    }

    let child = cmd.spawn()?;
    let child_pid = Pid::from_raw(child.id() as i32);
    let res = waitpid(child_pid, None)?;

    eprintln!("{:?}", res.yellow());

    Ok(())
}

We are also using owo_colors to colorize the output easily.

cargo add owo-colors

Trace every syscall

Now, we tell ptrace to stop after every system call. To do that we can use syscall. Then we just wait for the process state change with waitpid. At this point we can read process registers and print them.

use anyhow::Result;
use nix::sys::ptrace::{getregs, syscall, traceme};
use nix::sys::wait::waitpid;
use nix::unistd::Pid;
use owo_colors::OwoColorize;
use std::os::unix::process::CommandExt;
use std::process::Command;

fn main() -> Result<()> {
    let mut cmd = Command::new("echo");
    cmd.arg("test");

    unsafe {
        cmd.pre_exec(|| {
            traceme()?;
            Ok(())
        });
    }

    let child = cmd.spawn()?;
    let child_pid = Pid::from_raw(child.id() as i32);
    let res = waitpid(child_pid, None)?;

    eprintln!("{:?}", res.yellow());

    loop {
        syscall(child_pid, None)?;
        _ = waitpid(child_pid, None)?;
        let regs = getregs(child_pid)?;
        eprintln!(
            "{}({:x}, {:x}, {:x}, ...) = {:x}",
            regs.orig_rax.green(),
            regs.rdi.blue(),
            regs.rsi.blue(),
            regs.rdx.blue(),
            regs.rax.yellow(),
        );
    }
}

Here, we use orig_rax instead of rax, because in x64 architecture rax register is used to store the result of a function call (that's why we are using it after =)

Skip syscall entry

There is one thing we need to take care to have proper output here. syscall sets up the tracking of the child process for every entry or exit of a system call as we can see in the documentation:

Arranges for the tracee to be stopped at the next entry to or exit from a system call (...)

Taking this into account, we need to skip one result of waitpid for each output.

use anyhow::Result;
use nix::sys::ptrace::{getregs, syscall, traceme};
use nix::sys::wait::waitpid;
use nix::unistd::Pid;
use owo_colors::OwoColorize;
use std::os::unix::process::CommandExt;
use std::process::Command;

fn main() -> Result<()> {
    let mut cmd = Command::new("echo");
    cmd.arg("test");

    unsafe {
        cmd.pre_exec(|| {
            traceme()?;
            Ok(())
        });
    }

    let child = cmd.spawn()?;
    let child_pid = Pid::from_raw(child.id() as i32);
    let res = waitpid(child_pid, None)?;

    eprintln!("{:?}", res.yellow());

    let mut is_sys_exit = false;
    loop {
        syscall(child_pid, None)?;
        _ = waitpid(child_pid, None)?;
        if is_sys_exit {
            let regs = getregs(child_pid)?;
            eprintln!(
                "{}({:x}, {:x}, {:x}, ...) = {:x}",
                regs.orig_rax.green(),
                regs.rdi.blue(),
                regs.rsi.blue(),
                regs.rdx.blue(),
                regs.rax.yellow(),
            );
        }
        is_sys_exit = !is_sys_exit;
    }
}

Improvements

At this point, we have working, primitive strace:

cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.02s
     Running `target/debug/rtrace`
Stopped(Pid(975304), SIGTRAP)
12(0, 7ffe47c1607c, 0, ...) = 560320436000
158(3001, 7ffe47c16030, 7f61fbc51800, ...) = ffffffffffffffea
9(0, 2000, 3, ...) = 7f61fbc35000
21(7f61fbc61130, 4, 0, ...) = fffffffffffffffe
257(ffffff9c, 7ffe47c15150, 80000, ...) = fffffffffffffffe
262(ffffff9c, 7ffe47c15150, 7ffe47c15260, ...) = fffffffffffffffe
257(ffffff9c, 7ffe47c15150, 80000, ...) = fffffffffffffffe
262(ffffff9c, 7ffe47c15150, 7ffe47c15260, ...) = fffffffffffffffe
257(ffffff9c, 7ffe47c15150, 80000, ...) = fffffffffffffffe
262(ffffff9c, 7ffe47c15150, 7ffe47c15260, ...) = 0
257(ffffff9c, 7ffe47c15150, 80000, ...) = fffffffffffffffe
262(ffffff9c, 7ffe47c15150, 7ffe47c15260, ...) = fffffffffffffffe
257(ffffff9c, 7ffe47c15150, 80000, ...) = fffffffffffffffe
262(ffffff9c, 7ffe47c15150, 7ffe47c15260, ...) = fffffffffffffffe
257(ffffff9c, 7ffe47c15150, 80000, ...) = fffffffffffffffe
262(ffffff9c, 7ffe47c15150, 7ffe47c15260, ...) = 0
257(ffffff9c, 7ffe47c15150, 80000, ...) = fffffffffffffffe
262(ffffff9c, 7ffe47c15150, 7ffe47c15260, ...) = fffffffffffffffe
257(ffffff9c, 7ffe47c15150, 80000, ...) = fffffffffffffffe
262(ffffff9c, 7ffe47c15150, 7ffe47c15260, ...) = fffffffffffffffe
257(ffffff9c, 7ffe47c15150, 80000, ...) = fffffffffffffffe
262(ffffff9c, 7ffe47c15150, 7ffe47c15260, ...) = 0
257(ffffff9c, 7ffe47c15150, 80000, ...) = fffffffffffffffe
262(ffffff9c, 7ffe47c15150, 7ffe47c15260, ...) = fffffffffffffffe
257(ffffff9c, 7ffe47c15150, 80000, ...) = fffffffffffffffe
262(ffffff9c, 7ffe47c15150, 7ffe47c15260, ...) = fffffffffffffffe
257(ffffff9c, 7ffe47c15150, 80000, ...) = fffffffffffffffe
262(ffffff9c, 7ffe47c15150, 7ffe47c15260, ...) = 0
257(ffffff9c, 7ffe47c15150, 80000, ...) = fffffffffffffffe
262(ffffff9c, 7ffe47c15150, 7ffe47c15260, ...) = fffffffffffffffe
257(ffffff9c, 7ffe47c15150, 80000, ...) = fffffffffffffffe
262(ffffff9c, 7ffe47c15150, 7ffe47c15260, ...) = fffffffffffffffe
257(ffffff9c, 7ffe47c15150, 80000, ...) = fffffffffffffffe
262(ffffff9c, 7ffe47c15150, 7ffe47c15260, ...) = 0
257(ffffff9c, 7f61fbc5ff8a, 80000, ...) = 3
262(3, 7f61fbc60b55, 7ffe47c15260, ...) = 0
9(0, 2e3a7, 1, ...) = 7f61fbc06000
3(3, 2e3a7, 1, ...) = 0
257(ffffff9c, 7f61fbc352f0, 80000, ...) = 3
0(3, 7ffe47c153c8, 340, ...) = 340
17(3, 7ffe47c14fe0, 310, ...) = 310
262(3, 7f61fbc60b55, 7ffe47c15260, ...) = 0
17(3, 7ffe47c14eb0, 310, ...) = 310
9(0, 1e6d70, 1, ...) = 7f61fba1f000
9(7f61fba41000, 15a000, 5, ...) = 7f61fba41000
9(7f61fbb9b000, 58000, 1, ...) = 7f61fbb9b000
9(7f61fbbf3000, 6000, 3, ...) = 7f61fbbf3000
9(7f61fbbf9000, cd70, 3, ...) = 7f61fbbf9000
3(3, 7f61fba1f378, 3, ...) = 0
3(3, 7f61fba1f378, 3, ...) = 0
9(0, 3000, 3, ...) = 7f61fba1c000
158(1002, 7f61fba1c740, ffff809e045e2f30, ...) = 0
218(7f61fba1ca10, 7f61fba1c740, 7f61fbc6b0b0, ...) = ee1c8
273(7f61fba1ca20, 18, 7f61fbc6b0b0, ...) = 0
334(7f61fba1d060, 20, 0, ...) = 0
10(7f61fbbf3000, 4000, 1, ...) = 0
10(56031f9ec000, 1000, 1, ...) = 0
10(7f61fbc68000, 2000, 1, ...) = 0
302(0, 3, 0, ...) = 0
11(7f61fbc06000, 2e3a7, 7f61fbc6a000, ...) = 0
318(7f61fbbfe2b8, 8, 1, ...) = 8
12(0, 7f61fbbf7aa0, 0, ...) = 560320436000
12(560320457000, 7f61fbbf7aa0, 7f61fbbff158, ...) = 560320457000
257(ffffff9c, 7f61fbbbf3f0, 80000, ...) = 3
262(3, 7f61fbbb5df3, 7f61fbbf89a0, ...) = 0
9(0, 561fa0, 1, ...) = 7f61fb400000
3(3, 561fa0, 2ad2, ...) = 0
262(1, 7f61fbbb5df3, 7ffe47c15e30, ...) = 0
test
1(1, 5603204384a0, 5, ...) = 5
3(1, 5603204384a0, 7f61fbbf3b20, ...) = 0
3(2, 1, 7f61fbbf3b20, ...) = 0
Error: ESRCH: No such process

But instead of names of the syscalls we have number representation. To improve that, we can use a json file which contains syscall number and syscall name.

mkdir res
curl https://raw.githubusercontent.com/fasterthanlime/rue/main/src/syscall.json > ./res/syscall.json

Now, we can extract important for us fields and print full syscall name:

use anyhow::Result;
use nix::sys::ptrace::{getregs, syscall, traceme};
use nix::sys::wait::waitpid;
use nix::unistd::Pid;
use owo_colors::OwoColorize;
use serde_json::Value;
use std::collections::HashMap;
use std::os::unix::process::CommandExt;
use std::process::Command;

fn main() -> Result<()> {
    let json: Value = serde_json::from_str(include_str!("../res/syscall.json"))?;
    let syscalls: HashMap<u64, String> = json["aaData"]
        .as_array()
        .unwrap()
        .iter()
        .map(|item| (item[0].as_u64().unwrap(), item[1].as_str().unwrap().into()))
        .collect();

    let mut cmd = Command::new("echo");
    cmd.arg("test");

    unsafe {
        cmd.pre_exec(|| {
            traceme()?;
            Ok(())
        });
    }

    let child = cmd.spawn()?;
    let child_pid = Pid::from_raw(child.id() as i32);
    let res = waitpid(child_pid, None)?;

    eprintln!("{:?}", res.yellow());

    let mut is_sys_exit = false;
    loop {
        syscall(child_pid, None)?;
        _ = waitpid(child_pid, None)?;
        if is_sys_exit {
            let regs = getregs(child_pid)?;
            eprintln!(
                "{}({:x}, {:x}, {:x}, ...) = {:x}",
                syscalls[&regs.orig_rax].green(),
                regs.rdi.blue(),
                regs.rsi.blue(),
                regs.rdx.blue(),
                regs.rax.yellow(),
            );
        }
        is_sys_exit = !is_sys_exit;
    }
}

And this is the result:

cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.02s
     Running `target/debug/rtrace`
Stopped(Pid(979371), SIGTRAP)
brk(0, 7fff4235a49c, 0, ...) = 5634fa3ef000
arch_prctl(3001, 7fff4235a450, 7fb22701c800, ...) = ffffffffffffffea
mmap(0, 2000, 3, ...) = 7fb227000000
access(7fb22702c130, 4, 0, ...) = fffffffffffffffe
openat(ffffff9c, 7fff42359570, 80000, ...) = fffffffffffffffe
newfstatat(ffffff9c, 7fff42359570, 7fff42359680, ...) = fffffffffffffffe
openat(ffffff9c, 7fff42359570, 80000, ...) = fffffffffffffffe
newfstatat(ffffff9c, 7fff42359570, 7fff42359680, ...) = fffffffffffffffe
openat(ffffff9c, 7fff42359570, 80000, ...) = fffffffffffffffe
newfstatat(ffffff9c, 7fff42359570, 7fff42359680, ...) = 0
openat(ffffff9c, 7fff42359570, 80000, ...) = fffffffffffffffe
newfstatat(ffffff9c, 7fff42359570, 7fff42359680, ...) = fffffffffffffffe
openat(ffffff9c, 7fff42359570, 80000, ...) = fffffffffffffffe
newfstatat(ffffff9c, 7fff42359570, 7fff42359680, ...) = fffffffffffffffe
openat(ffffff9c, 7fff42359570, 80000, ...) = fffffffffffffffe
newfstatat(ffffff9c, 7fff42359570, 7fff42359680, ...) = 0
openat(ffffff9c, 7fff42359570, 80000, ...) = fffffffffffffffe
newfstatat(ffffff9c, 7fff42359570, 7fff42359680, ...) = fffffffffffffffe
openat(ffffff9c, 7fff42359570, 80000, ...) = fffffffffffffffe
newfstatat(ffffff9c, 7fff42359570, 7fff42359680, ...) = fffffffffffffffe
openat(ffffff9c, 7fff42359570, 80000, ...) = fffffffffffffffe
newfstatat(ffffff9c, 7fff42359570, 7fff42359680, ...) = 0
openat(ffffff9c, 7fff42359570, 80000, ...) = fffffffffffffffe
newfstatat(ffffff9c, 7fff42359570, 7fff42359680, ...) = fffffffffffffffe
openat(ffffff9c, 7fff42359570, 80000, ...) = fffffffffffffffe
newfstatat(ffffff9c, 7fff42359570, 7fff42359680, ...) = fffffffffffffffe
openat(ffffff9c, 7fff42359570, 80000, ...) = fffffffffffffffe
newfstatat(ffffff9c, 7fff42359570, 7fff42359680, ...) = 0
openat(ffffff9c, 7fb22702af8a, 80000, ...) = 3
newfstatat(3, 7fb22702bb55, 7fff42359680, ...) = 0
mmap(0, 2e3a7, 1, ...) = 7fb226fd1000
close(3, 2e3a7, 1, ...) = 0
openat(ffffff9c, 7fb2270002f0, 80000, ...) = 3
read(3, 7fff423597e8, 340, ...) = 340
pread64(3, 7fff42359400, 310, ...) = 310
newfstatat(3, 7fb22702bb55, 7fff42359680, ...) = 0
pread64(3, 7fff423592d0, 310, ...) = 310
mmap(0, 1e6d70, 1, ...) = 7fb226dea000
mmap(7fb226e0c000, 15a000, 5, ...) = 7fb226e0c000
mmap(7fb226f66000, 58000, 1, ...) = 7fb226f66000
mmap(7fb226fbe000, 6000, 3, ...) = 7fb226fbe000
mmap(7fb226fc4000, cd70, 3, ...) = 7fb226fc4000
close(3, 7fb226dea378, 3, ...) = 0
mmap(0, 3000, 3, ...) = 7fb226de7000
arch_prctl(1002, 7fb226de7740, ffff804dd9217f30, ...) = 0
set_tid_address(7fb226de7a10, 7fb226de7740, 7fb2270360b0, ...) = ef1ab
set_robust_list(7fb226de7a20, 18, 7fb2270360b0, ...) = 0
rseq(7fb226de8060, 20, 0, ...) = 0
mprotect(7fb226fbe000, 4000, 1, ...) = 0
mprotect(5634f98d1000, 1000, 1, ...) = 0
mprotect(7fb227033000, 2000, 1, ...) = 0
prlimit64(0, 3, 0, ...) = 0
munmap(7fb226fd1000, 2e3a7, 7fb227035000, ...) = 0
getrandom(7fb226fc92b8, 8, 1, ...) = 8
brk(0, 7fb226fc2aa0, 0, ...) = 5634fa3ef000
brk(5634fa410000, 7fb226fc2aa0, 7fb226fca158, ...) = 5634fa410000
openat(ffffff9c, 7fb226f8a3f0, 80000, ...) = 3
newfstatat(3, 7fb226f80df3, 7fb226fc39a0, ...) = 0
mmap(0, 561fa0, 1, ...) = 7fb226800000
close(3, 561fa0, 2ad2, ...) = 0
newfstatat(1, 7fb226f80df3, 7fff4235a250, ...) = 0
test
write(1, 5634fa3f14a0, 5, ...) = 5
close(1, 5634fa3f14a0, 7fb226fbeb20, ...) = 0
close(2, 1, 7fb226fbeb20, ...) = 0
Error: ESRCH: No such process

Summary

That's it. What we did is create a child process, instruct the OS that the child allows to be traced by the parent process using traceme from nix crate. Then, we instruct the OS that we want to trace every system call entry and exit using syscall function. Next we are waiting for process state change which occurs on system call entry and exit and when the change appears we read process registries which inform us about which system call was used and what were the arguments.