I'm working on very small (less than 1k LoC, including tests) command line application. I'm satisfied with implemented feature set, and I'm preparing to ship it to potential users.

As an exercise, I wanted to play with binary optimization techniques in Rust. As a result I was able to go from huge 75 MB binary in debug mode, to 3.9 MB after all optimizations without changing single line of rust code.

I won't go into details of the application (maybe in a separate post), but it's intended to work as an External Tool for IntelliJ IDEA to allow quick content synchronization between AEM (Adobe Experience Manager) instance and local file system. It will make my, and others, day-to-day work easier.

Because the target audience is technical people, I decided to ship the app as a binary which can be downloaded from GitHub.

The system I'm working on is 64 bit Ubuntu 20.04 with Quad Core Intel Core i7-7700HQ 3.8 GHz. I'm using rustc version 1.53.0.

Tricks

By default, Rust optimizes for execution speed. In order to optimize for size, we need to apply few tricks.

Do note that we saved some KB by using Rust in version >= 1.32. Since Rust 1.32, jemalloc, which is custom memory allocator, is removed. This allocator often outperforms system allocator, but it's not must have for every binary, at the end we can still change the allocator if we need.

Release mode

First, we need to compile in release mode. It's not really a trick, but a must-have while you plan to release an app. By default, cargo compiles our code in dev mode. The resulting binary is huge. On my system it's 75 MB in size! That's crazy, but it has a reason. In dev mode, our resulting binary contains a lot of debug information and metadata useful while debugging your program. Additionally, in release mode, rustc performs a lot of optimizations which are not present in dev mode.

Using release mode we are going down to from 75 MB to 9.9 MB.

Changing opt-level

This option controls how many optimizations are performed on our code. There is a compromise between optimization level and compile time. Higher optimization level results in faster code, but costs us time while compiling. That's why, by default in dev mode, it's set to 0 which means that there are no optimizations done. Those are level which we can set:

  • 0 — no optimizations — default for dev mode
  • 1 — basic optimization
  • 2 — more optimizations
  • 3 — all optimizations
  • "s" — optimize for size
  • "z" — optimize for size but also disables loop vectorization

Loop vectorization is a technique which leverages SIMD instructions on modern CPUs. SIMD stands for Single Instruction Multiple Data — it basically allows to execute one instruction on multiple pieces of data at the same time. For loops, it tries to unroll the loop to allow executing SIMD instruction which also means that resulting binary is bigger. The surprising result here was that using option "s" is actually better on my machine, so let's use that.

After using opt-level = "s", we are going down from 9.9 MB to 9.8 MB.

Changing LTO

LTO stands for Link Time Optimization. Linker is able to optimize the code during linking. This setting allows to tell the linker how wide part of code should it take into account while performing optimization. Possible settings:

  • false — only local crate is taken into account
  • true — search for optimizations across all crates within dependency graph
  • "thin" — similar to above but takes less time
  • "off" — disables LTO

After using lto = true, we are going down from 9.8 MB to 6.7 MB.

Just out of curiosity lto = "thin" resulted in 7.9 MB binary.

Changing codegen-units

This setting allows splitting code into multiple units which are then compiled in parallel. It makes the compilation faster, but performed optimizations are worse. This option takes integer bigger than 0 as a value. By setting it to 1 we are making compilation slower, but resulting code is smaller.

After setting codegen-units = 1, we are going down from 6.7 MB to 6.1 MB.

Changing panic

This setting controls how panics should behave. There are two options:

  • "unwind" — the default behavior for both dev and release mode
  • "abort" — tells rustc to stop execution without unwinding the stack

Thanks to the "abort" value, we are getting rid of additional code which unwinds the stack, thus making our binary smaller.

After setting panic = "abort", we are going from 6.1 MB to 5.6 MB.

Using strip

The last trick we'll use is the strip command line app. It comes with binutils. It turns out that binary contains debug symbols which are not needed for our binary to run. We can run strip <binary> to get rid of those symbols.

After using strip we are going down from 5.6 MB to 3.9 MB.

Summary

Using few tricks, I was able to optimize binary size from 75 MB down to 5.6 MB. None of those tricks required a code change. Tricks used:

  • compile in release mode to get rid of debug symbols and other debug-useful metadata
  • change optimization level, using opt-level setting, to "s" which optimizes binary for size in a cost of compilation time
  • lto option allows optimizing binary during linking, we set it to true, to optimize across all crates within dependency graph
  • codegen-units allows preventing splitting the code into smaller parts which allows smarter optimizations
  • panic set to abort removes some code which is responsible for stack unwinding which makes our binary smaller
  • strip binary allows to remove debug symbols which are not required to correctly run our application, so removing them makes our binary smaller

And that's all. I just want to note that there is more that we can do to optimize our binary size. By default, Rust ships its standard library stdlib with every program. It's statically linked with our binary, and we don't have control over how this library is compiled. For example, it's optimized for speed instead of size, we can't control the panic behavior of it etc. To fix that we can recompile stdlib using Xargo, or we can get rid of stdlib entirely using #![no_std].