Performance-Profiling Swift on Linux: Getting Started [FREE]
source link: https://www.raywenderlich.com/31390549-performance-profiling-swift-on-linux-getting-started
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Performance-Profiling Swift on Linux: Getting Started
Learn how to profile Server-Side Swift with perf on Linux. You’ll discover the basic principles of profiling and how to view events, call-graph-traces and perform basic analysis.
Version
It happens to app developers. Sooner or later, every application you build hits its natural limits, and you’ll have to optimize for performance. If you’re developing on macOS, Instruments is the tool for the job. But if you’re developing on Linux, you might be wondering what to use instead.
Fortunately, Linux developers aren’t strangers to performance optimization. In this tutorial, you’ll profile a cryptocurrency trading application, using the same workflow you might apply to one of your own projects. You’ll use the perf
tool to collect data about the performance of the application and generate a performance profile that shows which parts of the application are slow, and why they are slow. Along the way, you’ll learn:
- How to use
perf record
to gather data about an application. - How to use
perf script
to view columns of performamce data. - How to read and interpret performance data.
- How to configure
perf
to produce performance profiles tailored specifically for Swift binaries. - How to use
swift demangle
to post-process a performance profile.
Let’s get started!
Getting Started
First, click the Download Materials button at the top or bottom of this tutorial. This archive contains a simple application called crypto-bot
. Unzip the file, and navigate to the starter project directory. It contains a Swift Package Manager project with a Package.swift manifest file. It also contains the dockerfile you’ll use to build the Docker image for this tutorial.
The Data File
Alongside the manifest and the dockerfile is the ftx.json data file. It contains serialized market data from the FTX cryptocurrency exchange. Open this file in a text file viewer. For example, to view it with the cat
tool, run the following command in a terminal:
cat ftx.json
You should see a stream of cryptocurrency market data formatted as JSON. This is real data, captured from FTX’s public market data API. Although the file is over 8 MB in size, it represents just a few seconds of trading activity from a few markets in a single exchange. This should give you a sense of how much data real-world trading applications handle!
The Trading Application
Start building the Docker image by running the following command in the starter project directory:
docker build . --file dockerfile --tag swift-perf
This might take a few minutes to complete.
Meanwhile, look at the files in Sources/crypto-bot. For the purposes of this tutorial, we’ve removed all the components of crypto-bot
not related to parsing or decoding market data. Open example.swift and find Main.main()
, reproduced below:
static func main() throws { // 1 let data = try File.read([UInt8].self, from: "ftx.json") // 2 var input = ParsingInput<Grammar.NoDiagnostics<[UInt8]>>(data) var updates: [String: Int] = [:] // 3 while let json = input.parse(as: JSON.Rule<Array<UInt8>.Index>.Root?.self) { // 4 guard let message = decode(message: json) else { continue } updates[message.market, default: 0] += message.orderbook.bids.count + message.orderbook.asks.count } // 5 try input.parse(as: Grammar.End<Array<UInt8>.Index, UInt8>.self) // 6 for (market, updates) in updates.sorted(by: { $0.value > $1.value }) { print("\(market): \(updates) update(s)") } }
Key points about this function:
- This loads the market data file as a
[UInt8]
array. It’s faster than loading it as aString
because the file is UTF-8-encoded, and theJSON
module supports parsing JSON directly from UTF-8 data. - This creates a
ParsingInput
view over the[UInt8]
array and disables parsing diagnostics. These APIs are part of theJSON
module. - This loop tries to parse one complete JSON message at a time until it exhausts the parsing input. These APIs are also part of the
JSON
module. - This attempts to decode the fields of an
FTX.Message
instance from a parsed JSON value, skipping it if it’s not a validFTX.Message
. - This asserts that the parsing loop exited because the parser reached the end of the input — not because it encountered a parsing error.
- This prints out the number of events recorded in each cryptocurrency market.
Below main()
is a function called decode(message:)
, reproduced below:
@inline(never) func decode(message json: JSON) -> FTX.Message? { try? .init(from: json) }
This function takes a parameter of type JSON
, which is a Decoder
, and passes it to FTX.Message
’s Decodable
implementation. It’s marked @inline(never)
to make it easier to measure its performance.
Below decode(message:)
is a similar function named decodeFast(message:)
. As its name suggests, it contains a significantly more efficient JSON decoding implementation. At the end of this tutorial, you will compare its performance to that of decode(message:)
.
The Supporting Code
The other files in Sources/crypto-bot are less important to this tutorial. Here’s an overview:
-
ftx.swift: This contains the definition of
FTX.Message
, which models a kind of market event called an orderbook update. Most of the market events in ftx.json are orderbook updates. The ftx.swift file also directs the compiler to synthesize aCodable
implementation forFTX.Message
. -
decimal.swift: This defines a simple decimal type and directs the compiler to synthesize a
Codable
implementation for it. -
system.swift: This contains helper code that
main()
uses to load ftx.json from disk.
Running the Application
If the Docker image has finished building, start a container with it:
docker run -it --rm --privileged --name swift-perf-instance -v $PWD:/crypto-bot swift-perf
This is a normal Docker run
command, except it contains the option --privileged
. You will need this because performance profiling uses specialized hardware in your CPU, which the default Docker containers can’t access.
Your prompt should look like this:
/crypto-bot$
Note the shell has already cd
in the project directory.
Within the container, build crypto-bot
with the Swift Package Manager. Ensure you compile it in release
mode, because running performance benchmarks on unoptimized code would not be meaningful.
swift build -c release
Try running crypto-bot
in the container. It should run for a second or two, and then print a long list of markets and orderbook update counts:
.build/release/crypto-bot
Output:
FTM-PERP: 7094 update(s) AXS-PERP: 3595 update(s) FTT-PERP: 3458 update(s) ... MEDIA-PERP: 64 update(s)
That’s fast, but remember that ftx.json only holds a few seconds of market data. The application is nowhere near fast enough to keep up with the exchange in real time, which means you are going to have to optimize.
Measuring Performance
Before jumping into perf
itself, let’s clarify what kind of performance you’re optimizing for. Broadly speaking, there are two categories of performance:
- Latency is the amount of time that passes between the moment you need to compute a result and the moment you receive that result.
- Resource utilization is the amount of processor cycles a computation consumes.
The difference between the two is important because you measure them with different tools. Timers are good at measuring latency. However, timers aren’t great at measuring resource utilization because they keep ticking when a different task is running. If your application await
s a lot, timing won’t tell you much about its resource utilization.
Samplers are an alternative to timers. Samplers interrupt your application randomly and record what code is executing at that instant. Samplers are good at measuring resource utilization because they collect samples only while your application is running.
You can sample an application manually by running it many times and randomly interrupting it to see what stack trace it halts on. The perf
tool automates this process by calling into a CPU component called a performance monitoring unit (PMU). A PMU can sample your application thousands of times per run.
The crypto-bot
application is compute-constrained, not network-constrained. This means it has a resource utilization problem, so you’ll need to use a sampler like perf
to optimize it.
Profiling With Perf
The Docker image you built already comes with perf
installed. To test the perf
installation, sample the sleep
command:
perf record sleep 1
You should see something like this:
[ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.014 MB perf.data (8 samples) ]
Since sleep
spends most of its time sleeping, perf
only recorded a handful of samples from it.
Try sampling crypto-bot
by running the following commands:
perf record .build/release/crypto-bot
This will result in something similar to:
FTM-PERP: 7094 update(s) ... MEDIA-PERP: 64 update(s) [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.210 MB perf.data (5099 samples) ]
Above the perf record
footer, you’ll see that perf
reprinted all the normal output from crypto-bot
as it was running.
If you look closer, you’ll notice perf
collected many more samples from crypto-bot
than it did from sleep
, and it saved them to a file called perf.data. This file is called a performance profile. The perf
tool has another command — script
— that you can use to load samples from a profile.
Loading Samples With Perf Script
Try running perf script
without any arguments:
perf script
This command is interactive. Press space or arrow keys to navigate and Q to exit. If you’re running Docker natively on Linux you’ll see something like:
perf 16 7056.026690: 161 cycles:ppp: ffffffffa1c735bb native_write_msr ([kernel.kallsyms]) perf 16 7056.026696: 1031 cycles:ppp: ffffffffa1c735b4 native_write_msr ([kernel.kallsyms]) perf 16 7056.026702: 6650 cycles:ppp: ffffffffa1c0d345 intel_tfa_pmu_enable_all ([kernel.kallsyms]) crypto-bot 16 7056.026712: 41385 cycles:ppp: ffffffffa208f6d5 apparmor_bprm_committing_creds ([kernel.kallsyms]) crypto-bot 16 7056.026763: 155508 cycles:ppp: ffffffffa1e58909 unmap_region ([kernel.kallsyms]) crypto-bot 16 7056.027434: 233299 cycles:ppp: ffffffffa2296747 _extract_crng ([kernel.kallsyms]) ...
If you’re running Docker on macOS you’ll see something like:
crypto-bot 857 338493.651057: 250000 cpu-clock: ffff800010c7030c [unknown] ([kernel.kallsyms]) crypto-bot 857 338493.651306: 250000 cpu-clock: ffff838a6ce4 _dl_relocate_object (/usr/lib64/ld-2.26.so) crypto-bot 857 338493.651558: 250000 cpu-clock: ffff800010643a10 [unknown] ([kernel.kallsyms]) crypto-bot 857 338493.651807: 250000 cpu-clock: ffff838a6cd8 _dl_relocate_object (/usr/lib64/ld-2.26.so) crypto-bot 857 338493.652056: 250000 cpu-clock: ffff838a5490 _dl_lookup_symbol_x (/usr/lib64/ld-2.26.so) crypto-bot 857 338493.652305: 250000 cpu-clock: ffff838a4c6c do_lookup_x (/usr/lib64/ld-2.26.so) crypto-bot 857 338493.652556: 250000 cpu-clock: ffff838a5490 _dl_lookup_symbol_x (/usr/lib64/ld-2.26.so) crypto-bot 857 338493.652806: 250000 cpu-clock: ffff838a4c2c do_lookup_x (/usr/lib64/ld-2.26.so) crypto-bot 857 338493.653056: 250000 cpu-clock: ffff838b3b10 strcmp (/usr/lib64/ld-2.26.so) crypto-bot 857 338493.653305: 250000 cpu-clock: ffff838a5490 _dl_lookup_symbol_x (/usr/lib64/ld-2.26.so) crypto-bot 857 338493.653556: 250000 cpu-clock: ffff800010643a7c [unknown] ([kernel.kallsyms]) crypto-bot 857 338493.653806: 250000 cpu-clock: ffff838b3b24 strcmp (/usr/lib64/ld-2.26.so) crypto-bot 857 338493.654056: 250000 cpu-clock: ffff838b3ab0 strcmp (/usr/lib64/ld-2.26.so) crypto-bot 857 338493.654306: 250000 cpu-clock: ffff82ff4808 malloc_consolidate (/usr/lib64/libc-2.26.so) crypto-bot 857 338493.654556: 250000 cpu-clock: ffff800010c7030c [unknown] ([kernel.kallsyms]) ...
This output contains a human-readable description of each sample in perf.data. By default, perf record
names its output perf.data, so by default, perf script
looks for a file with this name.
perf script
output can run wide in the terminal. Some of the following output snippets use ellipses (...
) to represent elided output.
You can name the profile something else with the --output
option, or -o
for short. Try generating a profile named profile.perf by rerunning perf record
with the following options:
perf record -o profile.perf .build/release/crypto-bot
To load it in perf script
, use the --input
option, or -i
for short:
perf script -i profile.perf
Each line represents a single sample, so if perf record
recorded 5099 samples, perf script
emits 5099 lines of output. You can customize the output with the --fields
option, or -F
for short.
Getting Task Information
The default perf script
output is complex, so try running perf script
again with the -F comm
option:
perf script -i profile.perf -F comm
The “comm
” keyword stands for “command;” accordingly you will see perf script
print the command associated with each sample:
perf perf crypto-bot crypto-bot
You should see many samples associated with crypto-bot
and a smaller number of samples associated with perf
itself.
You can get the thread group and thread identifier for each sample by enabling the pid
and tid
fields. Pass multiple fields to perf script
as a comma-separated list:
perf script -i profile.perf -F comm,pid,tid
You’ll see a new column of output containing two numbers. The first number is the thread group identifier; the second number is the thread identifier. The perf script
tool always prints the fields in the same order, no matter what order you entered them in.
perf 542/542 perf 542/542 crypto-bot 542/542 crypto-bot 542/542
Because crypto-bot
doesn’t use concurrency, every sample comes from the same thread. If you examined a performance profile from a concurrent application, you would see samples with many thread identifiers, but the same thread group identifier.
Getting Timeline Information
You can get the time stamp for each sample by enabling the time
field. Try adding time
to the fields list:
perf script -i profile.perf -F comm,pid,tid,time
perf 542/542 7040.081164: perf 542/542 7040.081168: crypto-bot 542/542 7040.081176: crypto-bot 542/542 7040.081243:
The timestamps represent seconds elapsed since the system booted up.
Performance profiles can store many kinds of events. To get the event type for each sample, enable the event
field:
perf script -i profile.perf -F comm,pid,tid,time,event
You’ll see a new column containing the event type for each sample. If you have access to hardware events, you’ll see:
perf 542/542 7040.081164: cycles:ppp: perf 542/542 7040.081168: cycles:ppp: crypto-bot 542/542 7040.081176: cycles:ppp: crypto-bot 542/542 7040.081243: cycles:ppp:
All samples in profile.perf came from the cycles:ppp
event. The name means “(very) precise CPU cycles”. This event plugs into a cycle counter in the PMU, and triggers a sample at every multiple of some interval.
If you’re running in an environment where the Docker container doesn’t have access to hardware events (like on macOS), you’ll see:
crypto-bot 861/861 364407.450503: cpu-clock: crypto-bot 861/861 364407.450641: cpu-clock: crypto-bot 861/861 364407.450901: cpu-clock: crypto-bot 861/861 364407.452287: cpu-clock: crypto-bot 861/861 364407.452632: cpu-clock:
In this instance, all events came from the cpu-clock
event. This event is triggered by a timer perf creates and triggers a sample at every multiple of some interval.
You can view the sampling intervals by enabling the period
field:
perf script -i profile.perf -F comm,pid,tid,time,event,period
Note that perf script
formats the period
column before the event
name. On macOS hosts, you’ll see a regular time interval:
crypto-bot 861/861 364407.450503: 250000 cpu-clock: crypto-bot 861/861 364407.450641: 250000 cpu-clock: crypto-bot 861/861 364407.450901: 250000 cpu-clock: crypto-bot 861/861 364407.452287: 250000 cpu-clock:
On Linux, you’ll see a different period:
perf 542/542 7040.081164: 1012 cycles:ppp: perf 542/542 7040.081168: 8668 cycles:ppp: crypto-bot 542/542 7040.081176: 72998 cycles:ppp: crypto-bot 542/542 7040.081243: 358563 cycles:ppp:
The sampling interval varies across samples because the kernel tunes it as your application runs. If you scroll down far enough, you should see the kernel eventually herds it into an equilibrium.
crypto-bot 542/542 7040.086985: 325029 cycles:ppp: crypto-bot 542/542 7040.087241: 322303 cycles:ppp: crypto-bot 542/542 7040.087495: 321438 cycles:ppp:
Getting Landmark Information
Sampling works by taking snapshots of the CPU’s instruction pointer (IP). You can view the instruction pointers in the ip
field:
perf script -i profile.perf -F comm,pid,tid,time,event,period,ip
perf 542/542 7040.081164: 1012 cycles:ppp: ffffffff878735b4 perf 542/542 7040.081168: 8668 cycles:ppp: ffffffff884016f2 crypto-bot 542/542 7040.081176: 72998 cycles:ppp: ffffffff87ad38f0 crypto-bot 542/542 7040.081243: 358563 cycles:ppp: ffffffff88277427
By itself, this field isn’t too useful. To get the string names of the symbols associated with the instruction pointers, enable the sym
field. To get their distances from the instruction pointers, enable the symoff
field, which stands for symbol offsets. Add sym
and symoff
to the list of fields:
perf script -i profile.perf -F comm,pid,tid,time,event,period,ip,sym,symoff
... 1012 cycles:ppp: ffffffff878735b4 native_write_msr+0x4 ... 8668 cycles:ppp: ffffffff884016f2 nmi_restore+0x25 ... 72998 cycles:ppp: ffffffff87ad38f0 install_exec_creds+0x0 ... 358563 cycles:ppp: ffffffff88277427 clear_page_erms+0x7
You can also view the binaries that the instructions originated from in the dso
field. The name stands for dynamic shared object. Try adding it to the list of fields:
perf script -i profile.perf -F comm,pid,tid,time,event,period,ip,sym,symoff,dso
You’ll see the file path to the binary in parentheses after the symbol name and offset, or [kernel.kallsyms]
if the symbol lives in the kernel.
... cycles:ppp: ffffffff878735b4 native_write_msr+0x4 ([kernel.kallsyms]) ... cycles:ppp: ffffffff884016f2 nmi_restore+0x25 ([kernel.kallsyms]) ... cycles:ppp: ffffffff87ad38f0 install_exec_creds+0x0 ([kernel.kallsyms]) ... cycles:ppp: ffffffff88277427 clear_page_erms+0x7 ([kernel.kallsyms])
Scroll down the list of samples and look for binaries you recognize. You should easily find:
- /usr/lib/swift/linux/libswiftCore.so: This is the Swift standard library.
-
/crypto-bot/.build//release/crypto-bot: This is the
crypto-bot
executable itself.
Working With Perf Events
It’s possible to capture events besides the CPU cycles
. Although it’s only a subset of the events most PMUs support, perf record
can also capture:
-
instructions: Similar to
cycles
, except it samples on a counter that tracks instructions instead of clock cycles. The two can give varying results because some instructions take multiple cycles to complete. - branches: Samples on a counter that tracks branch instructions. Branch instructions are critical because they are very expensive if incorrectly predicted.
- branch-misses: Samples on a counter that tracks branch predictor misses.
- cache-misses: Samples on a counter that tracks CPU cache misses.
In a normal optimization workflow, you would use the general cycles
event to determine which parts of your code are slow, and then you’d use the more specialized events to figure out why those areas are slow.
You can get the full list of events your PMU supports with the perf list
command:
perf list
List of pre-defined events (to be used in -e): branch-instructions OR branches [Hardware event] branch-misses [Hardware event] bus-cycles [Hardware event] ... alignment-faults [Software event] bpf-output [Software event] ...
Your output might be different. Besides the name of the event, perf list
tells you what kind of event (hardware, software, etc.) it is.
To record an event besides cycles
, pass its name to perf record
’s --event
option, or its shortened form -e
. To record multiple events at once, pass their names as a comma-separated list. For example, to sample branch misses and cache misses, pass the following options:
perf record -o branch-and-cache-misses.perf -e branch-misses,cache-misses .build/release/crypto-bot
Try viewing the output in perf script
:
perf script -i branch-and-cache-misses.perf -F comm,pid,tid,time,event,period,ip,sym,symoff,dso
You will see events of types branch-misses
and cache-misses
in place of cycles:ppp
.
... 1 branch-misses: ffffffff879fcae1 perf_iterate_ctx+0x1 ([kernel.kallsyms]) ... 1 cache-misses: ffffffff879fcae1 perf_iterate_ctx+0x1 ([kernel.kallsyms]) ... 31 branch-misses: ffffffff878f0a61 __update_load_avg_cfs_rq+0x1a1 ([kern... ... 84 cache-misses: ffffffff878f0a61 __update_load_avg_cfs_rq+0x1a1 ([kern...
Collecting Call Graph Traces
So far, you’ve learned how to reconstruct a timeline of crypto-bot
’s execution and read off some basic information about which function frame was at the top of the stack at each point in time. To get a better idea of what’s happening in an application you don’t only want to know what it’s executing at a moment, but also what sequence of calls it took to get there. This information is called a call graph trace. It’s like a stack dump, except there’s one for every sample in the recording.
Tracing With Frame Pointers
One way to capture a call graph trace is to follow the frame pointer (FP). In Swift and many other compiled languages, frame pointers form a linked list so debuggers and profilers can follow to reconstruct a call graph. Rerun perf record
again, this time with the --call-graph fp
option:
perf record -o profile.perf --call-graph fp .build/release/crypto-bot
You’ll see perf
write more than twice as much data to disk than it did last time.
... MEDIA-PERP: 64 update(s) [ perf record: Woken up 3 times to write data ] [ perf record: Captured and wrote 0.630 MB profile.perf (5121 samples) ]
Call graph traces take up a lot of space, which is why perf
doesn’t record them by default.
-fomit-frame-pointer
, you can’t generate call graph traces for it.
Viewing Call Graph Traces
Try viewing the regenerated profile in perf script
:
perf script -i profile.perf -F comm,pid,tid,time,event,period,ip,sym,symoff,dso
In place of the ip
, sym
, symoff
and dso
fields, perf script
now produces multiple rows of these fields for each sample in the profile. Because each sample now spans multiple lines, perf script
now separates samples with empty newlines.
perf 14/14 8358.627472: 1 cycles:ppp: 7fff8e810345 intel_tfa_pmu_enable_all+0x800071803035 ([kernel.kallsyms]) 7fff8e80a598 x86_pmu_enable+0x800071803118 ([kernel.kallsyms]) ... 7fff8e806fd7 do_syscall_64+0x800071803057 ([kernel.kallsyms]) 7fff8f40308c entry_SYSCALL_64_after_hwframe+0x800071803044 ([kernel.k...
The first row is the same as the ip
, sym
, symoff
and dso
fields in the profile without call graph information. The subsequent rows list the function frames that were below it when perf
recorded the sample.
The perf record
tool supports other tracing methods besides frame pointer tracing, including:
- dwarf: Uses compiler-generated metadata in the binaries in place of frame pointers. This generates more detailed information, but also increases profiling overhead.
- lbr: Uses the last branch table in the CPU in place of frame pointers. Only Intel CPUs support this tracing method.
Working With Call Graph Traces
Call graph traces contain a lot of information, not all of which is relevant to your optimization problem.
Often, the startup phase of your application is different from its running phase. You can exclude the startup phase from the recordings by setting a delay with perf record
’s --delay
option, or -D
for short. You express the delay in milliseconds. For example, to make perf record
wait 500 milliseconds before recording:
perf record --delay=500 --call-graph fp -o profile-delayed.perf .build/release/crypto-bot
You should see perf
record fewer samples to disk than it did without the delay.
... [ perf record: Woken up 2 times to write data ] [ perf record: Captured and wrote 0.336 MB profile-delayed.perf (2827 samples) ]
For more flexible time-slicing, you can attach a recorder to an already-running process with the --pid
option:
.build/release/crypto-bot & perf record --call-graph fp -o profile-attached.perf --pid $(pidof crypto-bot)
pidof crypto-bot
doesn’t return anything.
You can also limit the number of samples perf
records overall with the --freq
option, or -F
for short. The name stands for “sampling frequency.” On average, perf
will attempt to collect the number of samples you specify every second. Try setting it to 100
samples per second:
perf record --freq 100 --call-graph fp -o profile-lite.perf .build/release/crypto-bot
You will see perf
recorded only a fraction of the samples it did before.
[ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.027 MB profile-fixed.perf (114 samples) ]
Configuring Perf for Swift
If you scroll through enough of the crypto-bot
samples in perf script
, you might notice the call graphs are detailed for the C-language portions of the call chain, but relatively sparse for the Swift-language portions. You might see a lot of [unknown]
symbols, short call chains and call chains where the only resolved Swift symbols are general-purpose entrypoints such as swift_retain
and swift_release
.
Enabling DWARF Metadata
You might not see a lot of call chain information because the Swift compiler performs a lot of transformations on the Swift code you write. The machine code coming out of the compiler ends up looking significantly different from the source code that went in. C compilers are comparatively transparent in how they transform C code into machine code, which makes it easier to map points in a call graph to meaningful landmarks in source code.
To address that problem, Swift provides the missing binary-to-source mappings as DWARF metadata. As you have seen, perf record
supports loading DWARF metadata through its --call-graph
option. Set it to dwarf
mode:
perf record --call-graph dwarf -o profile.perf .build/release/crypto-bot
On ARM environments, run:
perf record --call-graph fp -F 999 -o profile.perf .build/release/crypto-bot
If using DWARF mode, you’ll see perf
record much more data to profile.perf than it did in frame pointer mode. For long-running applications, perf
in DWARF mode might record more data than you can manageably store, in which case you might need to reduce the sampling frequency.
Open profile.perf in perf script
again, and scroll down to the crypto-bot
samples:
perf script -i profile.perf -F comm,pid,tid,time,event,period,ip,sym,symoff,dso
You’ll see far more detailed information about crypto-bot
’s Swift call chain:
crypto-bot 86/86 4678.431018: 2273989 cycles:ppp: 39eb03 $sSe4fromxs7Decoder_p_tKcfCTj+0xffff006bdedf2003 (/usr/lib/swift... 27288 $s4JSONAAO10DictionaryV6decode_6forKeyqd__qd__m_xtKSeRd+0xffff54... 28118 $s4JSONAAO10DictionaryVy_xGs30KeyedDecodingContainerProtocolAAsA... 27f10 $s4JSONAAO10DictionaryVy_xGs30KeyedDecodingContainerProtocolAAsA... ... 3aa9e $s10crypto_bot6decode7messageAA3FTXO7MessageVSg4JSONAIO_tF+0xfff... 3f6b6 $s10crypto_bot4MainO4mainyyKFZTf4d_n+0xffff5446a31ba0a6 (/crypto...
Demangling Swift Function Names
You might recognize some function names in the call graph symbols. They look strange because the Swift compiler mangled them. Demangle them with the swift demangle
tool, which is part of the Swift tool chain. For example, to demangle “s10crypto_bot6decode7messageAA3FTXO7MessageVSg4JSONAIO_tF
”, run the following in a terminal:
swift demangle s10crypto_bot6decode7messageAA3FTXO7MessageVSg4JSONAIO_tF
crypto_bot.decode(message: JSON.JSON) -> crypto_bot.FTX.Message?
"swift_demangle"
symbol from the Swift runtime. However, that’s out of the scope of this tutorial.
You can pipe the output of another tool through swift demangle
, in which case it will automatically recognize all the mangled Swift symbols in the input and replace them with human-readable descriptions. To mass-demangle all the symbols from perf script
, run the following commands in a terminal:
perf script -i profile.perf -F comm,pid,tid,time,event,period,ip,sym,symoff,dso | swift demangle > profile.txt
The final shell redirection (>
) in those commands saves the demangled call graphs to a file named profile.txt.
Putting It All Together
Sifting through profiling data is a science of its own, and we won’t be able to scratch the surface of it in this tutorial. However, one fast way to get actionable statistics about your application is to use the wc
tool to count occurrences of function names you are interested in.
Counting Function Names
First, regenerate a performance profile for crypto-bot
, and save it to a file named unoptimized.perf for comparison:
perf record --call-graph dwarf -o unoptimized.perf .build/release/crypto-bot
On ARM and Apple Silicon run:
perf record --call-graph fp -F 999 -o unoptimized.perf .build/release/crypto-bot
Recall that the sample code contains an @inlinable(never)
function called decode(message:)
that wraps its JSON decoding implementation. The @inlinable(never)
attribute forbids the compiler from restructuring its invocation, so you can be reasonably confident the number of samples containing crypto_bot.decode(message:)
in their call graph traces reflects the amount of time spent executing that function.
This isn’t an entirely sound method of measuring the performance of an application. Notably, it doesn’t account for the variability of the sampling period, which can skew the results. But in a pinch, it can be a useful proxy metric.
To count occurrences of crypto_bot.decode(message:
, run the following in the terminal:
perf script -i unoptimized.perf -F ip,sym | swift demangle | grep crypto_bot\.decode\(message\: | wc -l
This command contains four piped subcommands:
-
perf script: This deserializes the binary unoptimized.perf file. It only loads the
ip
andsym
fields because the others are unnecessary here. -
swift demangle: This allows us to search for demangled function names instead of mangled names. If we knew the mangled name of
crypto_bot.decode(message:)
ahead of time, this step wouldn’t be necessary. -
grep: This searches for the string
crypto_bot.decode(message:
, which is distinct enough to not return any false positives. -
wc: This counts the number of lines piped to its input. Becaise
grep
prints each match on a separate line, this gives the number of matchesgrep
returned.
The unoptimized profile should contain 2,000 to 3,000 instances of crypto_bot.decode(message:
. If using frame pointers you’ll see around 1,000 instances.
Observing Changes in Performance
To demonstrate what successful optimizations might look like when using this method, return to the while
loop in Main.main
in Sources/crypto-bot/example.swift. Replace the call to decode(message:)
with a call to the more-efficient decodeFast(message:)
function:
guard let message = decodeFast(message: json) else { continue }
Recompile the application:
swift build -c release
Then, generate a new performance profile named optimized.perf:
perf record --call-graph dwarf -o optimized.perf .build/release/crypto-bot
If running on Apple Silicon or ARM, run:
perf record --call-graph fp -F 999 -o optimized.perf .build/release/crypto-bot
Rerun the commands from the last section, this time using the optimized.perf data file and the search string decodeFast(message:
:
perf script -i optimized.perf -F ip,sym | swift demangle | grep crypto_bot\.decodeFast\(message\: | wc -l
This time, you should see only a few hundred occurrences of decodeFast(message:
on x86 machines or tens on ARM, which is compelling evidence that the decodeFast(message:)
implementation is faster.
Where to Go From Here?
You can download the dockerfile and sources for crypto-bot
by clicking the Download Materials button below this section.
In this tutorial, you’ve learned how to record an application with perf
, how to view and interpret the recordings, how to generate call graph traces, and how to configure perf
to produce detailed profiles specifically for Swift binaries. You’ve also learned some basic techniques for post-processing and sifting through this data programatically.
This tutorial should give you enough of an understanding of the fundamentals of performance sampling for you to start applying these techniques to your own projects. The data science of analyzing a performance profile is a huge topic, and much more remains to discover on your own!
To learn more about profiling Swift on Linux, check the Server-Side Swift performance guide. If you’re more interested in profiling Swift on macOS, check our Instruments Tutorial with Swift: Getting Started. I’ve also written about Low-level Swift Optimization Tips on my own blog.
If you have any suggestions, questions or performance-profiling tips you’d like to share, join the discussion below.
raywenderlich.com Weekly
The raywenderlich.com newsletter is the easiest way to stay up-to-date on everything you need to know as a mobile developer.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK