5

Performance-Profiling Swift on Linux: Getting Started [FREE]

 2 years ago
source link: https://www.raywenderlich.com/31390549-performance-profiling-swift-on-linux-getting-started
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client
Home Server-Side Swift Tutorials

Performance-Profiling Swift on Linux: Getting Started

Learn how to profile Server-Side Swift with perf on Linux. You’ll discover the basic principles of profiling and how to view events, call-graph-traces and perform basic analysis.

By kelvin ma Sep 12 2022 · Article (35 mins) · Advanced

Version

PerformanceProfilingSwiftOnLinuxGettingStarted-feature.png

It happens to app developers. Sooner or later, every application you build hits its natural limits, and you’ll have to optimize for performance. If you’re developing on macOS, Instruments is the tool for the job. But if you’re developing on Linux, you might be wondering what to use instead.

Note: Perf works differently across architectures such as Intel and Apple Silicon. There are also differences when running Docker on macOS and Linux. Any differences you might encounter when following this tutorial will be called out.

Fortunately, Linux developers aren’t strangers to performance optimization. In this tutorial, you’ll profile a cryptocurrency trading application, using the same workflow you might apply to one of your own projects. You’ll use the perf tool to collect data about the performance of the application and generate a performance profile that shows which parts of the application are slow, and why they are slow. Along the way, you’ll learn:

  • How to use perf record to gather data about an application.
  • How to use perf script to view columns of performamce data.
  • How to read and interpret performance data.
  • How to configure perf to produce performance profiles tailored specifically for Swift binaries.
  • How to use swift demangle to post-process a performance profile.

Let’s get started!

Note: This tutorial assumes you know how to clone a Git repository, run terminal commands and run a Docker container. It also assumes you know how to build a simple Swift Package Manager project. For an overview of the Swift Package Manager, check our Introduction to the Swift Package Manager tutorial. For a gentle introduction to Docker, see our Docker on macOS: Getting Started tutorial.

Getting Started

First, click the Download Materials button at the top or bottom of this tutorial. This archive contains a simple application called crypto-bot. Unzip the file, and navigate to the starter project directory. It contains a Swift Package Manager project with a Package.swift manifest file. It also contains the dockerfile you’ll use to build the Docker image for this tutorial.

The Data File

Alongside the manifest and the dockerfile is the ftx.json data file. It contains serialized market data from the FTX cryptocurrency exchange. Open this file in a text file viewer. For example, to view it with the cat tool, run the following command in a terminal:

cat ftx.json
Warning: This file is large! Use a viewer designed for displaying large text files.

You should see a stream of cryptocurrency market data formatted as JSON. This is real data, captured from FTX’s public market data API. Although the file is over 8 MB in size, it represents just a few seconds of trading activity from a few markets in a single exchange. This should give you a sense of how much data real-world trading applications handle!

The Trading Application

Start building the Docker image by running the following command in the starter project directory:

docker build . --file dockerfile --tag swift-perf

This might take a few minutes to complete.

Meanwhile, look at the files in Sources/crypto-bot. For the purposes of this tutorial, we’ve removed all the components of crypto-bot not related to parsing or decoding market data. Open example.swift and find Main.main(), reproduced below:

static func main() throws {
  // 1
  let data = try File.read([UInt8].self, from: "ftx.json")
  // 2
  var input = ParsingInput<Grammar.NoDiagnostics<[UInt8]>>(data)

  var updates: [String: Int] = [:]
  // 3
  while let json = input.parse(as: JSON.Rule<Array<UInt8>.Index>.Root?.self) {
    // 4
    guard let message = decode(message: json) else {
      continue
    }
    updates[message.market, default: 0] +=
      message.orderbook.bids.count + message.orderbook.asks.count
  }
  // 5
  try input.parse(as: Grammar.End<Array<UInt8>.Index, UInt8>.self)
  // 6
  for (market, updates) in updates.sorted(by: { $0.value > $1.value }) {
    print("\(market): \(updates) update(s)")
  }
}

Key points about this function:

  1. This loads the market data file as a [UInt8] array. It’s faster than loading it as a String because the file is UTF-8-encoded, and the JSON module supports parsing JSON directly from UTF-8 data.
  2. This creates a ParsingInput view over the [UInt8] array and disables parsing diagnostics. These APIs are part of the JSON module.
  3. This loop tries to parse one complete JSON message at a time until it exhausts the parsing input. These APIs are also part of the JSON module.
  4. This attempts to decode the fields of an FTX.Message instance from a parsed JSON value, skipping it if it’s not a valid FTX.Message.
  5. This asserts that the parsing loop exited because the parser reached the end of the input — not because it encountered a parsing error.
  6. This prints out the number of events recorded in each cryptocurrency market.

Below main() is a function called decode(message:), reproduced below:

@inline(never)
func decode(message json: JSON) -> FTX.Message? {
    try? .init(from: json)
}

This function takes a parameter of type JSON, which is a Decoder, and passes it to FTX.Message’s Decodable implementation. It’s marked @inline(never) to make it easier to measure its performance.

Below decode(message:) is a similar function named decodeFast(message:). As its name suggests, it contains a significantly more efficient JSON decoding implementation. At the end of this tutorial, you will compare its performance to that of decode(message:).

The Supporting Code

The other files in Sources/crypto-bot are less important to this tutorial. Here’s an overview:

  • ftx.swift: This contains the definition of FTX.Message, which models a kind of market event called an orderbook update. Most of the market events in ftx.json are orderbook updates. The ftx.swift file also directs the compiler to synthesize a Codable implementation for FTX.Message.
  • decimal.swift: This defines a simple decimal type and directs the compiler to synthesize a Codable implementation for it.
  • system.swift: This contains helper code that main() uses to load ftx.json from disk.

Running the Application

If the Docker image has finished building, start a container with it:

docker run -it --rm --privileged --name swift-perf-instance -v $PWD:/crypto-bot swift-perf

This is a normal Docker run command, except it contains the option --privileged. You will need this because performance profiling uses specialized hardware in your CPU, which the default Docker containers can’t access.

Your prompt should look like this:

/crypto-bot$

Note the shell has already cd in the project directory.

Within the container, build crypto-bot with the Swift Package Manager. Ensure you compile it in release mode, because running performance benchmarks on unoptimized code would not be meaningful.

swift build -c release

Try running crypto-bot in the container. It should run for a second or two, and then print a long list of markets and orderbook update counts:

.build/release/crypto-bot

Output:

FTM-PERP: 7094 update(s)
AXS-PERP: 3595 update(s)
FTT-PERP: 3458 update(s)
...
MEDIA-PERP: 64 update(s)
Note: If you see warnings about missing symbols when running on Apple Silicon or ARM then you can safely ignore these.

That’s fast, but remember that ftx.json only holds a few seconds of market data. The application is nowhere near fast enough to keep up with the exchange in real time, which means you are going to have to optimize.

Measuring Performance

Before jumping into perf itself, let’s clarify what kind of performance you’re optimizing for. Broadly speaking, there are two categories of performance:

  • Latency is the amount of time that passes between the moment you need to compute a result and the moment you receive that result.
  • Resource utilization is the amount of processor cycles a computation consumes.

The difference between the two is important because you measure them with different tools. Timers are good at measuring latency. However, timers aren’t great at measuring resource utilization because they keep ticking when a different task is running. If your application awaits a lot, timing won’t tell you much about its resource utilization.

Samplers are an alternative to timers. Samplers interrupt your application randomly and record what code is executing at that instant. Samplers are good at measuring resource utilization because they collect samples only while your application is running.

You can sample an application manually by running it many times and randomly interrupting it to see what stack trace it halts on. The perf tool automates this process by calling into a CPU component called a performance monitoring unit (PMU). A PMU can sample your application thousands of times per run.

The crypto-bot application is compute-constrained, not network-constrained. This means it has a resource utilization problem, so you’ll need to use a sampler like perf to optimize it.

Profiling With Perf

The Docker image you built already comes with perf installed. To test the perf installation, sample the sleep command:

perf record sleep 1

You should see something like this:

[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.014 MB perf.data (8 samples) ]

Since sleep spends most of its time sleeping, perf only recorded a handful of samples from it.

Try sampling crypto-bot by running the following commands:

perf record .build/release/crypto-bot

This will result in something similar to:

FTM-PERP: 7094 update(s)
...
MEDIA-PERP: 64 update(s)
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.210 MB perf.data (5099 samples) ]

Above the perf record footer, you’ll see that perf reprinted all the normal output from crypto-bot as it was running.

If you look closer, you’ll notice perf collected many more samples from crypto-bot than it did from sleep, and it saved them to a file called perf.data. This file is called a performance profile. The perf tool has another command — script — that you can use to load samples from a profile.

Loading Samples With Perf Script

Try running perf script without any arguments:

perf script

This command is interactive. Press space or arrow keys to navigate and Q to exit. If you’re running Docker natively on Linux you’ll see something like:

perf       16 7056.026690:    161 cycles:ppp: ffffffffa1c735bb native_write_msr ([kernel.kallsyms])
perf       16 7056.026696:   1031 cycles:ppp: ffffffffa1c735b4 native_write_msr ([kernel.kallsyms])
perf       16 7056.026702:   6650 cycles:ppp: ffffffffa1c0d345 intel_tfa_pmu_enable_all ([kernel.kallsyms])
crypto-bot 16 7056.026712:  41385 cycles:ppp: ffffffffa208f6d5 apparmor_bprm_committing_creds ([kernel.kallsyms])
crypto-bot 16 7056.026763: 155508 cycles:ppp: ffffffffa1e58909 unmap_region ([kernel.kallsyms])
crypto-bot 16 7056.027434: 233299 cycles:ppp: ffffffffa2296747 _extract_crng ([kernel.kallsyms])
...

If you’re running Docker on macOS you’ll see something like:

crypto-bot   857 338493.651057:     250000 cpu-clock:  ffff800010c7030c [unknown] ([kernel.kallsyms])
crypto-bot   857 338493.651306:     250000 cpu-clock:      ffff838a6ce4 _dl_relocate_object (/usr/lib64/ld-2.26.so)
crypto-bot   857 338493.651558:     250000 cpu-clock:  ffff800010643a10 [unknown] ([kernel.kallsyms])
crypto-bot   857 338493.651807:     250000 cpu-clock:      ffff838a6cd8 _dl_relocate_object (/usr/lib64/ld-2.26.so)
crypto-bot   857 338493.652056:     250000 cpu-clock:      ffff838a5490 _dl_lookup_symbol_x (/usr/lib64/ld-2.26.so)
crypto-bot   857 338493.652305:     250000 cpu-clock:      ffff838a4c6c do_lookup_x (/usr/lib64/ld-2.26.so)
crypto-bot   857 338493.652556:     250000 cpu-clock:      ffff838a5490 _dl_lookup_symbol_x (/usr/lib64/ld-2.26.so)
crypto-bot   857 338493.652806:     250000 cpu-clock:      ffff838a4c2c do_lookup_x (/usr/lib64/ld-2.26.so)
crypto-bot   857 338493.653056:     250000 cpu-clock:      ffff838b3b10 strcmp (/usr/lib64/ld-2.26.so)
crypto-bot   857 338493.653305:     250000 cpu-clock:      ffff838a5490 _dl_lookup_symbol_x (/usr/lib64/ld-2.26.so)
crypto-bot   857 338493.653556:     250000 cpu-clock:  ffff800010643a7c [unknown] ([kernel.kallsyms])
crypto-bot   857 338493.653806:     250000 cpu-clock:      ffff838b3b24 strcmp (/usr/lib64/ld-2.26.so)
crypto-bot   857 338493.654056:     250000 cpu-clock:      ffff838b3ab0 strcmp (/usr/lib64/ld-2.26.so)
crypto-bot   857 338493.654306:     250000 cpu-clock:      ffff82ff4808 malloc_consolidate (/usr/lib64/libc-2.26.so)
crypto-bot   857 338493.654556:     250000 cpu-clock:  ffff800010c7030c [unknown] ([kernel.kallsyms])
...

This output contains a human-readable description of each sample in perf.data. By default, perf record names its output perf.data, so by default, perf script looks for a file with this name.

Note: perf script output can run wide in the terminal. Some of the following output snippets use ellipses (...) to represent elided output.

You can name the profile something else with the --output option, or -o for short. Try generating a profile named profile.perf by rerunning perf record with the following options:

perf record -o profile.perf .build/release/crypto-bot

To load it in perf script, use the --input option, or -i for short:

perf script -i profile.perf

Each line represents a single sample, so if perf record recorded 5099 samples, perf script emits 5099 lines of output. You can customize the output with the --fields option, or -F for short.

Getting Task Information

The default perf script output is complex, so try running perf script again with the -F comm option:

perf script -i profile.perf -F comm

The “comm” keyword stands for “command;” accordingly you will see perf script print the command associated with each sample:

perf 
perf 
crypto-bot 
crypto-bot 

You should see many samples associated with crypto-bot and a smaller number of samples associated with perf itself.

You can get the thread group and thread identifier for each sample by enabling the pid and tid fields. Pass multiple fields to perf script as a comma-separated list:

perf script -i profile.perf -F comm,pid,tid

You’ll see a new column of output containing two numbers. The first number is the thread group identifier; the second number is the thread identifier. The perf script tool always prints the fields in the same order, no matter what order you entered them in.

perf       542/542
perf       542/542
crypto-bot 542/542
crypto-bot 542/542

Because crypto-bot doesn’t use concurrency, every sample comes from the same thread. If you examined a performance profile from a concurrent application, you would see samples with many thread identifiers, but the same thread group identifier.

Getting Timeline Information

You can get the time stamp for each sample by enabling the time field. Try adding time to the fields list:

perf script -i profile.perf -F comm,pid,tid,time
perf       542/542 7040.081164: 
perf       542/542 7040.081168: 
crypto-bot 542/542 7040.081176: 
crypto-bot 542/542 7040.081243: 

The timestamps represent seconds elapsed since the system booted up.

Performance profiles can store many kinds of events. To get the event type for each sample, enable the event field:

perf script -i profile.perf -F comm,pid,tid,time,event

You’ll see a new column containing the event type for each sample. If you have access to hardware events, you’ll see:

perf       542/542 7040.081164: cycles:ppp: 
perf       542/542 7040.081168: cycles:ppp: 
crypto-bot 542/542 7040.081176: cycles:ppp: 
crypto-bot 542/542 7040.081243: cycles:ppp: 

All samples in profile.perf came from the cycles:ppp event. The name means “(very) precise CPU cycles”. This event plugs into a cycle counter in the PMU, and triggers a sample at every multiple of some interval.

If you’re running in an environment where the Docker container doesn’t have access to hardware events (like on macOS), you’ll see:

crypto-bot   861/861   364407.450503: cpu-clock: 
crypto-bot   861/861   364407.450641: cpu-clock: 
crypto-bot   861/861   364407.450901: cpu-clock: 
crypto-bot   861/861   364407.452287: cpu-clock: 
crypto-bot   861/861   364407.452632: cpu-clock: 

In this instance, all events came from the cpu-clock event. This event is triggered by a timer perf creates and triggers a sample at every multiple of some interval.

You can view the sampling intervals by enabling the period field:

perf script -i profile.perf -F comm,pid,tid,time,event,period

Note that perf script formats the period column before the event name. On macOS hosts, you’ll see a regular time interval:

crypto-bot   861/861   364407.450503:     250000 cpu-clock: 
crypto-bot   861/861   364407.450641:     250000 cpu-clock: 
crypto-bot   861/861   364407.450901:     250000 cpu-clock: 
crypto-bot   861/861   364407.452287:     250000 cpu-clock: 

On Linux, you’ll see a different period:

perf       542/542 7040.081164:   1012 cycles:ppp: 
perf       542/542 7040.081168:   8668 cycles:ppp: 
crypto-bot 542/542 7040.081176:  72998 cycles:ppp: 
crypto-bot 542/542 7040.081243: 358563 cycles:ppp: 

The sampling interval varies across samples because the kernel tunes it as your application runs. If you scroll down far enough, you should see the kernel eventually herds it into an equilibrium.

crypto-bot 542/542 7040.086985: 325029 cycles:ppp: 
crypto-bot 542/542 7040.087241: 322303 cycles:ppp: 
crypto-bot 542/542 7040.087495: 321438 cycles:ppp: 

Getting Landmark Information

Sampling works by taking snapshots of the CPU’s instruction pointer (IP). You can view the instruction pointers in the ip field:

perf script -i profile.perf -F comm,pid,tid,time,event,period,ip
perf       542/542 7040.081164:   1012 cycles:ppp: ffffffff878735b4
perf       542/542 7040.081168:   8668 cycles:ppp: ffffffff884016f2
crypto-bot 542/542 7040.081176:  72998 cycles:ppp: ffffffff87ad38f0
crypto-bot 542/542 7040.081243: 358563 cycles:ppp: ffffffff88277427

By itself, this field isn’t too useful. To get the string names of the symbols associated with the instruction pointers, enable the sym field. To get their distances from the instruction pointers, enable the symoff field, which stands for symbol offsets. Add sym and symoff to the list of fields:

perf script -i profile.perf -F comm,pid,tid,time,event,period,ip,sym,symoff
...   1012 cycles:ppp: ffffffff878735b4 native_write_msr+0x4
...   8668 cycles:ppp: ffffffff884016f2 nmi_restore+0x25
...  72998 cycles:ppp: ffffffff87ad38f0 install_exec_creds+0x0
... 358563 cycles:ppp: ffffffff88277427 clear_page_erms+0x7

You can also view the binaries that the instructions originated from in the dso field. The name stands for dynamic shared object. Try adding it to the list of fields:

perf script -i profile.perf -F comm,pid,tid,time,event,period,ip,sym,symoff,dso

You’ll see the file path to the binary in parentheses after the symbol name and offset, or [kernel.kallsyms] if the symbol lives in the kernel.

... cycles:ppp: ffffffff878735b4 native_write_msr+0x4 ([kernel.kallsyms])
... cycles:ppp: ffffffff884016f2 nmi_restore+0x25 ([kernel.kallsyms])
... cycles:ppp: ffffffff87ad38f0 install_exec_creds+0x0 ([kernel.kallsyms])
... cycles:ppp: ffffffff88277427 clear_page_erms+0x7 ([kernel.kallsyms])

Scroll down the list of samples and look for binaries you recognize. You should easily find:

  • /usr/lib/swift/linux/libswiftCore.so: This is the Swift standard library.
  • /crypto-bot/.build//release/crypto-bot: This is the crypto-bot executable itself.

Working With Perf Events

It’s possible to capture events besides the CPU cycles. Although it’s only a subset of the events most PMUs support, perf record can also capture:

  • instructions: Similar to cycles, except it samples on a counter that tracks instructions instead of clock cycles. The two can give varying results because some instructions take multiple cycles to complete.
  • branches: Samples on a counter that tracks branch instructions. Branch instructions are critical because they are very expensive if incorrectly predicted.
  • branch-misses: Samples on a counter that tracks branch predictor misses.
  • cache-misses: Samples on a counter that tracks CPU cache misses.

In a normal optimization workflow, you would use the general cycles event to determine which parts of your code are slow, and then you’d use the more specialized events to figure out why those areas are slow.

You can get the full list of events your PMU supports with the perf list command:

perf list
List of pre-defined events (to be used in -e):

  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  ...

  alignment-faults                                   [Software event]
  bpf-output                                         [Software event]
  ...

Your output might be different. Besides the name of the event, perf list tells you what kind of event (hardware, software, etc.) it is.

To record an event besides cycles, pass its name to perf record’s --event option, or its shortened form -e. To record multiple events at once, pass their names as a comma-separated list. For example, to sample branch misses and cache misses, pass the following options:

perf record -o branch-and-cache-misses.perf -e branch-misses,cache-misses .build/release/crypto-bot
Note: This will only work on Linux hosts. If you’re using macOS continue to the next section.

Try viewing the output in perf script:

perf script -i branch-and-cache-misses.perf -F comm,pid,tid,time,event,period,ip,sym,symoff,dso

You will see events of types branch-misses and cache-misses in place of cycles:ppp.

...  1 branch-misses: ffffffff879fcae1 perf_iterate_ctx+0x1 ([kernel.kallsyms])
...  1  cache-misses: ffffffff879fcae1 perf_iterate_ctx+0x1 ([kernel.kallsyms])
... 31 branch-misses: ffffffff878f0a61 __update_load_avg_cfs_rq+0x1a1 ([kern...
... 84  cache-misses: ffffffff878f0a61 __update_load_avg_cfs_rq+0x1a1 ([kern...

Collecting Call Graph Traces

So far, you’ve learned how to reconstruct a timeline of crypto-bot’s execution and read off some basic information about which function frame was at the top of the stack at each point in time. To get a better idea of what’s happening in an application you don’t only want to know what it’s executing at a moment, but also what sequence of calls it took to get there. This information is called a call graph trace. It’s like a stack dump, except there’s one for every sample in the recording.

Tracing With Frame Pointers

One way to capture a call graph trace is to follow the frame pointer (FP). In Swift and many other compiled languages, frame pointers form a linked list so debuggers and profilers can follow to reconstruct a call graph. Rerun perf record again, this time with the --call-graph fp option:

perf record -o profile.perf --call-graph fp .build/release/crypto-bot

You’ll see perf write more than twice as much data to disk than it did last time.

...
MEDIA-PERP: 64 update(s)
[ perf record: Woken up 3 times to write data ]
[ perf record: Captured and wrote 0.630 MB profile.perf (5121 samples) ]

Call graph traces take up a lot of space, which is why perf doesn’t record them by default.

Warning: Frame pointer call graphs require a frame pointer! If you compile a binary with -fomit-frame-pointer, you can’t generate call graph traces for it.

Viewing Call Graph Traces

Try viewing the regenerated profile in perf script:

perf script -i profile.perf -F comm,pid,tid,time,event,period,ip,sym,symoff,dso

In place of the ip, sym, symoff and dso fields, perf script now produces multiple rows of these fields for each sample in the profile. Because each sample now spans multiple lines, perf script now separates samples with empty newlines.

perf 14/14 8358.627472: 1 cycles:ppp: 
    7fff8e810345 intel_tfa_pmu_enable_all+0x800071803035 ([kernel.kallsyms])
    7fff8e80a598 x86_pmu_enable+0x800071803118 ([kernel.kallsyms])
    ...
    7fff8e806fd7 do_syscall_64+0x800071803057 ([kernel.kallsyms])
    7fff8f40308c entry_SYSCALL_64_after_hwframe+0x800071803044 ([kernel.k...

The first row is the same as the ip, sym, symoff and dso fields in the profile without call graph information. The subsequent rows list the function frames that were below it when perf recorded the sample.

The perf record tool supports other tracing methods besides frame pointer tracing, including:

  • dwarf: Uses compiler-generated metadata in the binaries in place of frame pointers. This generates more detailed information, but also increases profiling overhead.
  • lbr: Uses the last branch table in the CPU in place of frame pointers. Only Intel CPUs support this tracing method.

Working With Call Graph Traces

Call graph traces contain a lot of information, not all of which is relevant to your optimization problem.

Often, the startup phase of your application is different from its running phase. You can exclude the startup phase from the recordings by setting a delay with perf record’s --delay option, or -D for short. You express the delay in milliseconds. For example, to make perf record wait 500 milliseconds before recording:

perf record --delay=500 --call-graph fp -o profile-delayed.perf .build/release/crypto-bot

You should see perf record fewer samples to disk than it did without the delay.

...
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.336 MB profile-delayed.perf (2827 samples) ]

For more flexible time-slicing, you can attach a recorder to an already-running process with the --pid option:

.build/release/crypto-bot & perf record --call-graph fp -o profile-attached.perf --pid $(pidof crypto-bot)
Note: On powerful machines, like Apple Silicon Macs, the binary may finish running before perf is started! If this happens you’ll see an error because pidof crypto-bot doesn’t return anything.

You can also limit the number of samples perf records overall with the --freq option, or -F for short. The name stands for “sampling frequency.” On average, perf will attempt to collect the number of samples you specify every second. Try setting it to 100 samples per second:

perf record --freq 100 --call-graph fp -o profile-lite.perf .build/release/crypto-bot

You will see perf recorded only a fraction of the samples it did before.

[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.027 MB profile-fixed.perf (114 samples) ]

Configuring Perf for Swift

If you scroll through enough of the crypto-bot samples in perf script, you might notice the call graphs are detailed for the C-language portions of the call chain, but relatively sparse for the Swift-language portions. You might see a lot of [unknown] symbols, short call chains and call chains where the only resolved Swift symbols are general-purpose entrypoints such as swift_retain and swift_release.

Enabling DWARF Metadata

You might not see a lot of call chain information because the Swift compiler performs a lot of transformations on the Swift code you write. The machine code coming out of the compiler ends up looking significantly different from the source code that went in. C compilers are comparatively transparent in how they transform C code into machine code, which makes it easier to map points in a call graph to meaningful landmarks in source code.

Note: There is a bug in perf when viewing DWARF metadata on ARM machines. So on Apple Silicon or other ARM machines you’ll need to continue to use frame pointers.

To address that problem, Swift provides the missing binary-to-source mappings as DWARF metadata. As you have seen, perf record supports loading DWARF metadata through its --call-graph option. Set it to dwarf mode:

perf record --call-graph dwarf -o profile.perf .build/release/crypto-bot

On ARM environments, run:

perf record --call-graph fp -F 999 -o profile.perf .build/release/crypto-bot

If using DWARF mode, you’ll see perf record much more data to profile.perf than it did in frame pointer mode. For long-running applications, perf in DWARF mode might record more data than you can manageably store, in which case you might need to reduce the sampling frequency.

Open profile.perf in perf script again, and scroll down to the crypto-bot samples:

perf script -i profile.perf -F comm,pid,tid,time,event,period,ip,sym,symoff,dso

You’ll see far more detailed information about crypto-bot’s Swift call chain:

crypto-bot 86/86 4678.431018: 2273989 cycles:ppp: 
    39eb03 $sSe4fromxs7Decoder_p_tKcfCTj+0xffff006bdedf2003 (/usr/lib/swift...
     27288 $s4JSONAAO10DictionaryV6decode_6forKeyqd__qd__m_xtKSeRd+0xffff54...
     28118 $s4JSONAAO10DictionaryVy_xGs30KeyedDecodingContainerProtocolAAsA...
     27f10 $s4JSONAAO10DictionaryVy_xGs30KeyedDecodingContainerProtocolAAsA...
    ...
     3aa9e $s10crypto_bot6decode7messageAA3FTXO7MessageVSg4JSONAIO_tF+0xfff...
     3f6b6 $s10crypto_bot4MainO4mainyyKFZTf4d_n+0xffff5446a31ba0a6 (/crypto...

Demangling Swift Function Names

You might recognize some function names in the call graph symbols. They look strange because the Swift compiler mangled them. Demangle them with the swift demangle tool, which is part of the Swift tool chain. For example, to demangle “s10crypto_bot6decode7messageAA3FTXO7MessageVSg4JSONAIO_tF”, run the following in a terminal:

swift demangle s10crypto_bot6decode7messageAA3FTXO7MessageVSg4JSONAIO_tF
crypto_bot.decode(message: JSON.JSON) -> crypto_bot.FTX.Message?
Note: You can also demangle Swift symbols in Swift by loading the "swift_demangle" symbol from the Swift runtime. However, that’s out of the scope of this tutorial.

You can pipe the output of another tool through swift demangle, in which case it will automatically recognize all the mangled Swift symbols in the input and replace them with human-readable descriptions. To mass-demangle all the symbols from perf script, run the following commands in a terminal:

perf script -i profile.perf -F comm,pid,tid,time,event,period,ip,sym,symoff,dso | swift demangle > profile.txt

The final shell redirection (>) in those commands saves the demangled call graphs to a file named profile.txt.

Putting It All Together

Sifting through profiling data is a science of its own, and we won’t be able to scratch the surface of it in this tutorial. However, one fast way to get actionable statistics about your application is to use the wc tool to count occurrences of function names you are interested in.

Counting Function Names

First, regenerate a performance profile for crypto-bot, and save it to a file named unoptimized.perf for comparison:

perf record --call-graph dwarf -o unoptimized.perf .build/release/crypto-bot

On ARM and Apple Silicon run:

perf record --call-graph fp -F 999 -o unoptimized.perf .build/release/crypto-bot

Recall that the sample code contains an @inlinable(never) function called decode(message:) that wraps its JSON decoding implementation. The @inlinable(never) attribute forbids the compiler from restructuring its invocation, so you can be reasonably confident the number of samples containing crypto_bot.decode(message:) in their call graph traces reflects the amount of time spent executing that function.

This isn’t an entirely sound method of measuring the performance of an application. Notably, it doesn’t account for the variability of the sampling period, which can skew the results. But in a pinch, it can be a useful proxy metric.

To count occurrences of crypto_bot.decode(message:, run the following in the terminal:

perf script -i unoptimized.perf -F ip,sym | swift demangle | grep crypto_bot\.decode\(message\: | wc -l

This command contains four piped subcommands:

  1. perf script: This deserializes the binary unoptimized.perf file. It only loads the ip and sym fields because the others are unnecessary here.
  2. swift demangle: This allows us to search for demangled function names instead of mangled names. If we knew the mangled name of crypto_bot.decode(message:) ahead of time, this step wouldn’t be necessary.
  3. grep: This searches for the string crypto_bot.decode(message:, which is distinct enough to not return any false positives.
  4. wc: This counts the number of lines piped to its input. Becaise grep prints each match on a separate line, this gives the number of matches grep returned.

The unoptimized profile should contain 2,000 to 3,000 instances of crypto_bot.decode(message:. If using frame pointers you’ll see around 1,000 instances.

Observing Changes in Performance

To demonstrate what successful optimizations might look like when using this method, return to the while loop in Main.main in Sources/crypto-bot/example.swift. Replace the call to decode(message:) with a call to the more-efficient decodeFast(message:) function:

guard let message = decodeFast(message: json) else {
  continue
}

Recompile the application:

swift build -c release

Then, generate a new performance profile named optimized.perf:

perf record --call-graph dwarf -o optimized.perf .build/release/crypto-bot

If running on Apple Silicon or ARM, run:

perf record --call-graph fp -F 999 -o optimized.perf .build/release/crypto-bot

Rerun the commands from the last section, this time using the optimized.perf data file and the search string decodeFast(message::

perf script -i optimized.perf -F ip,sym | swift demangle | grep crypto_bot\.decodeFast\(message\: | wc -l

This time, you should see only a few hundred occurrences of decodeFast(message: on x86 machines or tens on ARM, which is compelling evidence that the decodeFast(message:) implementation is faster.

Where to Go From Here?

You can download the dockerfile and sources for crypto-bot by clicking the Download Materials button below this section.

In this tutorial, you’ve learned how to record an application with perf, how to view and interpret the recordings, how to generate call graph traces, and how to configure perf to produce detailed profiles specifically for Swift binaries. You’ve also learned some basic techniques for post-processing and sifting through this data programatically.

This tutorial should give you enough of an understanding of the fundamentals of performance sampling for you to start applying these techniques to your own projects. The data science of analyzing a performance profile is a huge topic, and much more remains to discover on your own!

To learn more about profiling Swift on Linux, check the Server-Side Swift performance guide. If you’re more interested in profiling Swift on macOS, check our Instruments Tutorial with Swift: Getting Started. I’ve also written about Low-level Swift Optimization Tips on my own blog.

If you have any suggestions, questions or performance-profiling tips you’d like to share, join the discussion below.

raywenderlich.com Weekly

The raywenderlich.com newsletter is the easiest way to stay up-to-date on everything you need to know as a mobile developer.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK