Implementing simple strace
Did you ever have a chance to use strace
? It's a really powerful tool, I use it
pretty often. It allows spying on the running software which is very useful when
you don't have access to the source code and want to understand, at least on a
high level, what is happening. strace
show you the system calls of a software.
What is a system call and how to harness the magic of strace
?
Note
This blog post is based on this video. If you want to read and watch more of such cool stuff, head over to fasterthanlime.
How useful is strace?
In the last company I was working in, we had a client with a platform which heavily relied on its content. We also had few environments: Integration - for devs, deployment happening every few minutes; Staging - for User Acceptance Testing done by the client; Performance - for performance tests and Production. The content which was served to end users was being done on the Production instances.
We wanted to have all the environments as close as possible to the Production, because it minimizes differences which makes debugging a lot easier (good luck trying to fix a bug which cannot be reproduced on lower environments because of the differences between lower envs and production).
We set up a job on Jenkins which could synchronize all the content form production servers to lower environments. Initially, it worked without a problem. In few minutes we had all the environments synchronized. But after some time, when the amount of content grew immensely, the synchronization started to take more and more time, and eventually it was running for several hours without the clear view of the end.
I started to suspect that there is a deadlock somewhere in the synchronization tool. Usually, confirming such suspicion is as easy as hooking up a profiler which can show you all the threads, and it's state. In this case, the problem was that the tool and the content platform client worked with, was designed in such a way, that debugging content synchronization process was equal to debugging the production environment. It was obviously out of the question. The platform was also global, so debugging it at night was not an option (not to mention that it's not pleasant).
We had to prove that there is an issue with the synchronization tool, so I came up
with an idea to use strace
. It would show us all the system calls made by the
synchronization tool in non-invasive way. I just wanted to know what is
happening - maybe it's waiting for some network call and doesn't have a timeout,
or maybe there is some huge file which it's trying to read into memory and has to
constantly swap the memory pages to be able to fit it in, or maybe it's working
just fine, but the amount of content it needs to synchronize is so huge, that it
takes hours to do so.
Soon enough, I saw a bunch of calls to futex
. futex
is a Fast Userspace Mutex,
it's a Linux Kernel mechanism for shared access synchronization. The listing of
system calls showed a rapid flood of calls to futex
and then nothing - nothing
for hours. It was a clear indication that the synchronization process tried to
manage access to some shared memory and had a bug which deadlocked all the
threads.
That was enough for the team to decide that we need to change the synchronization tool (as it turned out the tool we used was not actively maintained anymore).
Ok, what's a system call?
For the purpose of this blog post, you can think of a system call as just a library function. And the library you are using is the Linux Kernel. System call is a way of handling some work to the Linux Kernel. Work that only kernel can do - like talking to the network card, managing processes etc.
When your code is using a system call (directly or indirectly), the execution is passed to the kernel. This requires the OS to save the state of your process to be able to restore it later, when the kernel is done with doing what your code asked for.
This is costly, and it's called a context switch (context switch also happens when there are multiple processes running in the OS and the kernel jumps between processes to make progress). The more system calls the bigger overhead related with context switch.
How strace
does its spying?
strace
uses a system call (yep, system calls everywhere) called ptrace
. It
allows inspecting and controlling processes. It's part of the
Yama which is
a Linux Security Module. Yama defines which process has rights to attach to which
process.
Let's end this blabber and do some Rust.
Poor man's strace
in Rust
Spawning a child process
Ok, first, strace
spawns the command we pass as an argument as a child process.
We can do that with Rust standard library.
use std::process::Command;
use anyhow::Result;
fn main() -> Result<()> {
let mut cmd = Command::new("echo");
cmd.arg("test");
let child = cmd.spawn()?;
Ok(())
}
Let's add anyhow
to simplify error handling.
cargo add anyhow
(When we run this, we won't see any output, because we are not using the output of the command.)
Allow to be traced by the parent
Next, our child process needs to allow to be traced by the parent. I don't know if
there is something in the standard library to do that, but there is a
nix
crate which has
traceme
function in
the ptrace
module.
cargo add nix
But how can we call that function before the actual process is spawned? Keep in mind that we need to do that before actual process, because if the process is already spawned, then it's too late.
Fortunately, there is a method called
pre_exec
in the CommandExt
trait. It allows running our code just before the child
process is run. It is unsafe, so we need to run it in unsafe
block.
use anyhow::Result;
use nix::sys::ptrace::traceme;
use std::{os::unix::process::CommandExt, process::Command};
fn main() -> Result<()> {
let mut cmd = Command::new("echo");
cmd.arg("test");
unsafe {
cmd.pre_exec(|| {
traceme()?;
Ok(())
});
}
Ok(())
}
Spawn and wait for the change
Now, we can spawn our command and wait for the process status change. We can wait
for the child process to change state using
waitpid
.
use anyhow::Result;
use nix::sys::{ptrace::traceme, wait::waitpid};
use nix::unistd::Pid;
use owo_colors::OwoColorize;
use std::{os::unix::process::CommandExt, process::Command};
fn main() -> Result<()> {
let mut cmd = Command::new("echo");
cmd.arg("test");
unsafe {
cmd.pre_exec(|| {
traceme()?;
Ok(())
});
}
let child = cmd.spawn()?;
let child_pid = Pid::from_raw(child.id() as i32);
let res = waitpid(child_pid, None)?;
eprintln!("{:?}", res.yellow());
Ok(())
}
We are also using owo_colors
to colorize
the output easily.
cargo add owo-colors
Trace every syscall
Now, we tell ptrace
to stop after every system call. To do that we can use
syscall
. Then we
just wait for the process state change with waitpid
. At this point we can read
process registers and print them.
use anyhow::Result;
use nix::sys::ptrace::{getregs, syscall, traceme};
use nix::sys::wait::waitpid;
use nix::unistd::Pid;
use owo_colors::OwoColorize;
use std::os::unix::process::CommandExt;
use std::process::Command;
fn main() -> Result<()> {
let mut cmd = Command::new("echo");
cmd.arg("test");
unsafe {
cmd.pre_exec(|| {
traceme()?;
Ok(())
});
}
let child = cmd.spawn()?;
let child_pid = Pid::from_raw(child.id() as i32);
let res = waitpid(child_pid, None)?;
eprintln!("{:?}", res.yellow());
loop {
syscall(child_pid, None)?;
_ = waitpid(child_pid, None)?;
let regs = getregs(child_pid)?;
eprintln!(
"{}({:x}, {:x}, {:x}, ...) = {:x}",
regs.orig_rax.green(),
regs.rdi.blue(),
regs.rsi.blue(),
regs.rdx.blue(),
regs.rax.yellow(),
);
}
}
Here, we use orig_rax
instead of rax
, because in x64 architecture rax
register is used to store the result of a function call (that's why we are using
it after =
)
Skip syscall entry
There is one thing we need to take care to have proper output here. syscall
sets
up the tracking of the child process for every entry or exit of a system call as
we can see in the
documentation:
Arranges for the tracee to be stopped at the next entry to or exit from a system call (...)
Taking this into account, we need to skip one result of waitpid
for each output.
use anyhow::Result;
use nix::sys::ptrace::{getregs, syscall, traceme};
use nix::sys::wait::waitpid;
use nix::unistd::Pid;
use owo_colors::OwoColorize;
use std::os::unix::process::CommandExt;
use std::process::Command;
fn main() -> Result<()> {
let mut cmd = Command::new("echo");
cmd.arg("test");
unsafe {
cmd.pre_exec(|| {
traceme()?;
Ok(())
});
}
let child = cmd.spawn()?;
let child_pid = Pid::from_raw(child.id() as i32);
let res = waitpid(child_pid, None)?;
eprintln!("{:?}", res.yellow());
let mut is_sys_exit = false;
loop {
syscall(child_pid, None)?;
_ = waitpid(child_pid, None)?;
if is_sys_exit {
let regs = getregs(child_pid)?;
eprintln!(
"{}({:x}, {:x}, {:x}, ...) = {:x}",
regs.orig_rax.green(),
regs.rdi.blue(),
regs.rsi.blue(),
regs.rdx.blue(),
regs.rax.yellow(),
);
}
is_sys_exit = !is_sys_exit;
}
}
Improvements
At this point, we have working, primitive strace
:
cargo run
Finished dev [unoptimized + debuginfo] target(s) in 0.02s
Running `target/debug/rtrace`
Stopped(Pid(975304), SIGTRAP)
12(0, 7ffe47c1607c, 0, ...) = 560320436000
158(3001, 7ffe47c16030, 7f61fbc51800, ...) = ffffffffffffffea
9(0, 2000, 3, ...) = 7f61fbc35000
21(7f61fbc61130, 4, 0, ...) = fffffffffffffffe
257(ffffff9c, 7ffe47c15150, 80000, ...) = fffffffffffffffe
262(ffffff9c, 7ffe47c15150, 7ffe47c15260, ...) = fffffffffffffffe
257(ffffff9c, 7ffe47c15150, 80000, ...) = fffffffffffffffe
262(ffffff9c, 7ffe47c15150, 7ffe47c15260, ...) = fffffffffffffffe
257(ffffff9c, 7ffe47c15150, 80000, ...) = fffffffffffffffe
262(ffffff9c, 7ffe47c15150, 7ffe47c15260, ...) = 0
257(ffffff9c, 7ffe47c15150, 80000, ...) = fffffffffffffffe
262(ffffff9c, 7ffe47c15150, 7ffe47c15260, ...) = fffffffffffffffe
257(ffffff9c, 7ffe47c15150, 80000, ...) = fffffffffffffffe
262(ffffff9c, 7ffe47c15150, 7ffe47c15260, ...) = fffffffffffffffe
257(ffffff9c, 7ffe47c15150, 80000, ...) = fffffffffffffffe
262(ffffff9c, 7ffe47c15150, 7ffe47c15260, ...) = 0
257(ffffff9c, 7ffe47c15150, 80000, ...) = fffffffffffffffe
262(ffffff9c, 7ffe47c15150, 7ffe47c15260, ...) = fffffffffffffffe
257(ffffff9c, 7ffe47c15150, 80000, ...) = fffffffffffffffe
262(ffffff9c, 7ffe47c15150, 7ffe47c15260, ...) = fffffffffffffffe
257(ffffff9c, 7ffe47c15150, 80000, ...) = fffffffffffffffe
262(ffffff9c, 7ffe47c15150, 7ffe47c15260, ...) = 0
257(ffffff9c, 7ffe47c15150, 80000, ...) = fffffffffffffffe
262(ffffff9c, 7ffe47c15150, 7ffe47c15260, ...) = fffffffffffffffe
257(ffffff9c, 7ffe47c15150, 80000, ...) = fffffffffffffffe
262(ffffff9c, 7ffe47c15150, 7ffe47c15260, ...) = fffffffffffffffe
257(ffffff9c, 7ffe47c15150, 80000, ...) = fffffffffffffffe
262(ffffff9c, 7ffe47c15150, 7ffe47c15260, ...) = 0
257(ffffff9c, 7ffe47c15150, 80000, ...) = fffffffffffffffe
262(ffffff9c, 7ffe47c15150, 7ffe47c15260, ...) = fffffffffffffffe
257(ffffff9c, 7ffe47c15150, 80000, ...) = fffffffffffffffe
262(ffffff9c, 7ffe47c15150, 7ffe47c15260, ...) = fffffffffffffffe
257(ffffff9c, 7ffe47c15150, 80000, ...) = fffffffffffffffe
262(ffffff9c, 7ffe47c15150, 7ffe47c15260, ...) = 0
257(ffffff9c, 7f61fbc5ff8a, 80000, ...) = 3
262(3, 7f61fbc60b55, 7ffe47c15260, ...) = 0
9(0, 2e3a7, 1, ...) = 7f61fbc06000
3(3, 2e3a7, 1, ...) = 0
257(ffffff9c, 7f61fbc352f0, 80000, ...) = 3
0(3, 7ffe47c153c8, 340, ...) = 340
17(3, 7ffe47c14fe0, 310, ...) = 310
262(3, 7f61fbc60b55, 7ffe47c15260, ...) = 0
17(3, 7ffe47c14eb0, 310, ...) = 310
9(0, 1e6d70, 1, ...) = 7f61fba1f000
9(7f61fba41000, 15a000, 5, ...) = 7f61fba41000
9(7f61fbb9b000, 58000, 1, ...) = 7f61fbb9b000
9(7f61fbbf3000, 6000, 3, ...) = 7f61fbbf3000
9(7f61fbbf9000, cd70, 3, ...) = 7f61fbbf9000
3(3, 7f61fba1f378, 3, ...) = 0
3(3, 7f61fba1f378, 3, ...) = 0
9(0, 3000, 3, ...) = 7f61fba1c000
158(1002, 7f61fba1c740, ffff809e045e2f30, ...) = 0
218(7f61fba1ca10, 7f61fba1c740, 7f61fbc6b0b0, ...) = ee1c8
273(7f61fba1ca20, 18, 7f61fbc6b0b0, ...) = 0
334(7f61fba1d060, 20, 0, ...) = 0
10(7f61fbbf3000, 4000, 1, ...) = 0
10(56031f9ec000, 1000, 1, ...) = 0
10(7f61fbc68000, 2000, 1, ...) = 0
302(0, 3, 0, ...) = 0
11(7f61fbc06000, 2e3a7, 7f61fbc6a000, ...) = 0
318(7f61fbbfe2b8, 8, 1, ...) = 8
12(0, 7f61fbbf7aa0, 0, ...) = 560320436000
12(560320457000, 7f61fbbf7aa0, 7f61fbbff158, ...) = 560320457000
257(ffffff9c, 7f61fbbbf3f0, 80000, ...) = 3
262(3, 7f61fbbb5df3, 7f61fbbf89a0, ...) = 0
9(0, 561fa0, 1, ...) = 7f61fb400000
3(3, 561fa0, 2ad2, ...) = 0
262(1, 7f61fbbb5df3, 7ffe47c15e30, ...) = 0
test
1(1, 5603204384a0, 5, ...) = 5
3(1, 5603204384a0, 7f61fbbf3b20, ...) = 0
3(2, 1, 7f61fbbf3b20, ...) = 0
Error: ESRCH: No such process
But instead of names of the syscalls we have number representation. To improve that, we can use a json file which contains syscall number and syscall name.
mkdir res
curl https://raw.githubusercontent.com/fasterthanlime/rue/main/src/syscall.json > ./res/syscall.json
Now, we can extract important for us fields and print full syscall name:
use anyhow::Result;
use nix::sys::ptrace::{getregs, syscall, traceme};
use nix::sys::wait::waitpid;
use nix::unistd::Pid;
use owo_colors::OwoColorize;
use serde_json::Value;
use std::collections::HashMap;
use std::os::unix::process::CommandExt;
use std::process::Command;
fn main() -> Result<()> {
let json: Value = serde_json::from_str(include_str!("../res/syscall.json"))?;
let syscalls: HashMap<u64, String> = json["aaData"]
.as_array()
.unwrap()
.iter()
.map(|item| (item[0].as_u64().unwrap(), item[1].as_str().unwrap().into()))
.collect();
let mut cmd = Command::new("echo");
cmd.arg("test");
unsafe {
cmd.pre_exec(|| {
traceme()?;
Ok(())
});
}
let child = cmd.spawn()?;
let child_pid = Pid::from_raw(child.id() as i32);
let res = waitpid(child_pid, None)?;
eprintln!("{:?}", res.yellow());
let mut is_sys_exit = false;
loop {
syscall(child_pid, None)?;
_ = waitpid(child_pid, None)?;
if is_sys_exit {
let regs = getregs(child_pid)?;
eprintln!(
"{}({:x}, {:x}, {:x}, ...) = {:x}",
syscalls[®s.orig_rax].green(),
regs.rdi.blue(),
regs.rsi.blue(),
regs.rdx.blue(),
regs.rax.yellow(),
);
}
is_sys_exit = !is_sys_exit;
}
}
And this is the result:
cargo run
Finished dev [unoptimized + debuginfo] target(s) in 0.02s
Running `target/debug/rtrace`
Stopped(Pid(979371), SIGTRAP)
brk(0, 7fff4235a49c, 0, ...) = 5634fa3ef000
arch_prctl(3001, 7fff4235a450, 7fb22701c800, ...) = ffffffffffffffea
mmap(0, 2000, 3, ...) = 7fb227000000
access(7fb22702c130, 4, 0, ...) = fffffffffffffffe
openat(ffffff9c, 7fff42359570, 80000, ...) = fffffffffffffffe
newfstatat(ffffff9c, 7fff42359570, 7fff42359680, ...) = fffffffffffffffe
openat(ffffff9c, 7fff42359570, 80000, ...) = fffffffffffffffe
newfstatat(ffffff9c, 7fff42359570, 7fff42359680, ...) = fffffffffffffffe
openat(ffffff9c, 7fff42359570, 80000, ...) = fffffffffffffffe
newfstatat(ffffff9c, 7fff42359570, 7fff42359680, ...) = 0
openat(ffffff9c, 7fff42359570, 80000, ...) = fffffffffffffffe
newfstatat(ffffff9c, 7fff42359570, 7fff42359680, ...) = fffffffffffffffe
openat(ffffff9c, 7fff42359570, 80000, ...) = fffffffffffffffe
newfstatat(ffffff9c, 7fff42359570, 7fff42359680, ...) = fffffffffffffffe
openat(ffffff9c, 7fff42359570, 80000, ...) = fffffffffffffffe
newfstatat(ffffff9c, 7fff42359570, 7fff42359680, ...) = 0
openat(ffffff9c, 7fff42359570, 80000, ...) = fffffffffffffffe
newfstatat(ffffff9c, 7fff42359570, 7fff42359680, ...) = fffffffffffffffe
openat(ffffff9c, 7fff42359570, 80000, ...) = fffffffffffffffe
newfstatat(ffffff9c, 7fff42359570, 7fff42359680, ...) = fffffffffffffffe
openat(ffffff9c, 7fff42359570, 80000, ...) = fffffffffffffffe
newfstatat(ffffff9c, 7fff42359570, 7fff42359680, ...) = 0
openat(ffffff9c, 7fff42359570, 80000, ...) = fffffffffffffffe
newfstatat(ffffff9c, 7fff42359570, 7fff42359680, ...) = fffffffffffffffe
openat(ffffff9c, 7fff42359570, 80000, ...) = fffffffffffffffe
newfstatat(ffffff9c, 7fff42359570, 7fff42359680, ...) = fffffffffffffffe
openat(ffffff9c, 7fff42359570, 80000, ...) = fffffffffffffffe
newfstatat(ffffff9c, 7fff42359570, 7fff42359680, ...) = 0
openat(ffffff9c, 7fb22702af8a, 80000, ...) = 3
newfstatat(3, 7fb22702bb55, 7fff42359680, ...) = 0
mmap(0, 2e3a7, 1, ...) = 7fb226fd1000
close(3, 2e3a7, 1, ...) = 0
openat(ffffff9c, 7fb2270002f0, 80000, ...) = 3
read(3, 7fff423597e8, 340, ...) = 340
pread64(3, 7fff42359400, 310, ...) = 310
newfstatat(3, 7fb22702bb55, 7fff42359680, ...) = 0
pread64(3, 7fff423592d0, 310, ...) = 310
mmap(0, 1e6d70, 1, ...) = 7fb226dea000
mmap(7fb226e0c000, 15a000, 5, ...) = 7fb226e0c000
mmap(7fb226f66000, 58000, 1, ...) = 7fb226f66000
mmap(7fb226fbe000, 6000, 3, ...) = 7fb226fbe000
mmap(7fb226fc4000, cd70, 3, ...) = 7fb226fc4000
close(3, 7fb226dea378, 3, ...) = 0
mmap(0, 3000, 3, ...) = 7fb226de7000
arch_prctl(1002, 7fb226de7740, ffff804dd9217f30, ...) = 0
set_tid_address(7fb226de7a10, 7fb226de7740, 7fb2270360b0, ...) = ef1ab
set_robust_list(7fb226de7a20, 18, 7fb2270360b0, ...) = 0
rseq(7fb226de8060, 20, 0, ...) = 0
mprotect(7fb226fbe000, 4000, 1, ...) = 0
mprotect(5634f98d1000, 1000, 1, ...) = 0
mprotect(7fb227033000, 2000, 1, ...) = 0
prlimit64(0, 3, 0, ...) = 0
munmap(7fb226fd1000, 2e3a7, 7fb227035000, ...) = 0
getrandom(7fb226fc92b8, 8, 1, ...) = 8
brk(0, 7fb226fc2aa0, 0, ...) = 5634fa3ef000
brk(5634fa410000, 7fb226fc2aa0, 7fb226fca158, ...) = 5634fa410000
openat(ffffff9c, 7fb226f8a3f0, 80000, ...) = 3
newfstatat(3, 7fb226f80df3, 7fb226fc39a0, ...) = 0
mmap(0, 561fa0, 1, ...) = 7fb226800000
close(3, 561fa0, 2ad2, ...) = 0
newfstatat(1, 7fb226f80df3, 7fff4235a250, ...) = 0
test
write(1, 5634fa3f14a0, 5, ...) = 5
close(1, 5634fa3f14a0, 7fb226fbeb20, ...) = 0
close(2, 1, 7fb226fbeb20, ...) = 0
Error: ESRCH: No such process
Summary
That's it. What we did is create a child process, instruct the OS that the child
allows to be traced by the parent process using traceme
from nix
crate. Then,
we instruct the OS that we want to trace every system call entry and exit using
syscall
function. Next we are waiting for process state change which occurs on
system call entry and exit and when the change appears we read process registries
which inform us about which system call was used and what were the arguments.