Linker notes on Power ISA
source link: https://maskray.me/blog/2023-02-26-linker-notes-on-power-isa
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Linker notes on Power ISA
UNDER CONSTRUCTION
This article describes target-specific things about Power ISA in ELF linkers. The architecture was originally named "PowerPC". In 2016 the architecture was rebranded as "Power ISA". The ISA manual says: "In 2006, Freescale and IBM collaborated on the creation of the Power ISA Version 2.03, which represented the reunification of the architecture by combining Book E content with the more general purpose PowerPC Version 2.02."
The terms "PowerPC" and "powerpc" remain popular in numerous places,
including the powerpc-*-*-*
and
powerpc64-*-*-*
in official target triple names. The
abbreviation "PPC" ("ppc") is used in numerous places as well. For
simplicity, I will refer to the 32-bit architecture as "PPC32" and the
64-bit architecture as "PPC64".
ABI documents
- Power Architecture™ 32-bit Application Binary Interface Supplement 1.0 - Linux® & Embedded revised in 2011.
- 64-bit PowerPC ELF Application Binary Interface Supplement 1.9. This is commonly referred to as ELFv1 and is obsolete. Some targets still use this ABI.
- 64-Bit ELF V2 ABI Specification: Power Architecture
The 32-bit ELF ABI is more or less not cared for by maintainers and only remains relevant among some enthusiasts. In 2019, I spent one week studying PPC32 ABI and added the PPC32 port to ld.lld.
For a 64-bit object file, the presence of a section .opd
is a good indicator for ELFv1. e_flags
being 2 is a good
indicator for ELFv2. e_flags
being 0 is either an ELFv1
object file, or an object file not using any feature affected by the
differences.
A new ABI for little-endian PowerPC64 Design & Implementation (2014) describes the motivation for introducing ELFv2.
Global Offset Table
PPC32 GOT
On PPC32, _GLOBAL_OFFSET_TABLE_
is defined at the start
of the section .got
. .got
has 3 reserved
entries. _GLOBAL_OFFSET_TABLE_[0]
stores the link-time
address of _DYNAMIC
, which is used by glibc
sysdeps/powerpc/powerpc32/dl-machine.h
.
_GLOBAL_OFFSET_TABLE_[1]
and
_GLOBAL_OFFSET_TABLE_[2]
are for lazy binding PLT
(_dl_runtime_resolve
and link map).
.plt
is like .got.plt
for other
architectures. .plt[n]
holds the address of a PLT entry
(somewhere in .glink
).
Like x86-32, PPC32 lacks of memory load with PC-relative addressing. As a poor man's replacement, PPC32 sets up r30 to hold a GOT base for PIC code. The GOT base is different for small PIC and large PIC.
- For
-fpic
and-fpie
, r30 refers to_GLOBAL_OFFSET_TABLE_
in the component. - For
-fPIC
and-fPIE
, r30 refers to.got2
for the current translation unit. This has implications for PLT-generating relocations as we will see below.
.section ".got2","aw"
.align 2
.LCTOC1 = .+32768
.LC0:
.long var
...
bcl 20,31,.L2
.L2:
mflr 30 # r30 = lr
addis 30,30,.LCTOC1-.L2@ha
addi 30,30,.LCTOC1-.L2@l # finish setting up the GOT base
lwz 9,.LC0-.LCTOC1(30) # load the address of var relative to the GOT base
The component may have multiple translation units and each has a
different .got2
. In the output file, .got2
in
one file may have an arbitrary offset relative to the output
.got2
.
PPC64 GOT
On PPC64, .got
has 1 reserved entry: the link-time
address of .TOC.
. .TOC.
is defined at the
start of the section .got
plus 0x8000.
PPC64 ELFv2 Table of Contents (TOC)
Different from most architectures, PPC64 uses .toc
instead of .got
to hold the addresses of global variables
and address-taken functions.
extern int var0, var1;
int foo() { return var0 + var1; }
addis 3, 2, .LC0@toc@ha
addis 4, 2, .LC1@toc@ha
ld 3, .LC0@toc@l(3)
ld 4, .LC1@toc@l(4)
lwz 3, 0(3)
lwz 4, 0(4)
add 3, 4, 3
extsw 3, 3
blr
.section .toc,"aw",@progbits
.LC0:
.tc var0[TC],var0
.LC1:
.tc var1[TC],var1
While with .got
relocatable object files do not
reference .got
directly, the TOC scheme may be thought of
as a compiler-managed GOT: .toc
is explicit in relocatable
object files. A .tc
directive is a fancy way to produce a
R_PPC64_ADDR64
relocation. If the linker decides to create
a TOC entry, the entry will be a link-time constant
(-no-pie
) or be associated with a dynamic relocation
(-pie
or -shared
).
The TOC layout is under control of the compiler and presumably the compiler can leverage better information to optimize the layout for locality. Well, I disagree with this point. The compiler does not know the global information. A linker is better placed to do such link-time optimization.
.plt
is like .got.plt
for other
architectures. .plt
has the type SHT_NOBITS
and an alignment of 4.
TOC-indirect to TOC-relative optimization
See All about Global Offset Table#GOT optimization.
Procedure Linkage Table
PPC32 PLT
Power Architecture® 32-bit Application Binary Interface Supplement 1.0 - Linux® & Embedded specifies two PLT ABIs: BSS-PLT and Secure-PLT.
BSS-PLT is the older one. While .plt
on other
architectures are created by the linker, BSS-PLT let ld.so generate the
PLT entries. This has the advantage that the section can be made
SHT_NOBITS
and therefore not occupy file size. The downside
is the security concern of writable and executable memory pages. Even
worse, as an implementation issue, GNU ld places .plt
in
the text segment and therefore the whole text segment is writable and
executable. -z relro -z now
has no effect.
In the newer Secure-PLT ABI, .plt
holds the table of
function addresses. .plt
is like .got.plt
for
other architectures.
The linker synthesizes .glink
, which is like
.plt
for other architectures. Unlike most architectures,
.glink
has a footer rather than a header. Each PLT entry is
either b footer
or a nop falling through to the footer. In
ld.lld, we only use b footer
for simplicity. See https://reviews.llvm.org/D75394 for
PPC32GlinkSection
in ld.lld.
000102b4 <.glink>:
b 0x102c0 <.glink+0xc>
b 0x102c0 <.glink+0xc>
b 0x102c0 <.glink+0xc>
addis 11, 11, 0 # start of the resolver
mflr 0
bcl 20, 31, 0x102cc <.glink+0x18>
addi 11, 11, 24
mflr 12
mtlr 0
sub 11, 11, 12
addis 12, 12, 1
lwz 0, 184(12)
lwz 12, 188(12)
mtctr 0
add 0, 11, 11
add 11, 0, 11
bctr
nop
nop
For non-PIC code, a possibly preemptible branch uses the relocation
type R_PPC_REL24
.
bl foo # R_PPC_REL24
bl foo # R_PPC_REL24
If the call target is preemtible, the linker creates a non-PIC call
stub and redirects the caller's branch instruction to the call stub. The
non-PIC call stub will use absolute addressing to load
.plt[n]
into r11 (call-clobbered) and branch there. This is
different from most other architectures where the caller can branch
directly to the PLT entry.
bl 00000000.plt_call32.f
bl 00000000.plt_call32.f
...
00000000.plt_call32.f:
lis 11, .plt[n]@ha
lwz 11, .plt[n]@l(11)
mtctr 11
bctr
For PIC code, a branch to a possibly preemptible target uses
R_PPC_PLTREL24
as the PLT-generating relocation type. The
addend encodes r30 set up by the caller. Yes, this is unusual.
- For
-fpic
and-fpie
, the addend is 0. - For
-fPIC
and-fPIE
, the addend is 0x8000. Linking this relocatable object file in-r
mode may increase the addend.
If the call target is preemtible, the linker creates a PIC call stub
and redirects the caller's branch instruction to the call stub. GNU ld
names a small PIC call stub as *.plt_pic32.*
and a large
PIC call stub as *.got2.plt_pic32.*
.
The call stub knows the value of r30 (GOT base) set up by the caller.
The distance from .plt[n]
to r30 is a constant. The call
stub computes the address of .plt[n]
, loads the entry, and
branches there.
00000000.plt_pic32.f:
## If the GOT offset is beyond 64KiB
addis 11, 30, .plt[n]-_GLOBAL_OFFSET_TABLE_@ha(30)
lwz 11, .plt[n]-_GLOBAL_OFFSET_TABLE_@l(30)
mtctr 11
bctr
## If the GOT offset is within 64KiB
# lwz 11, .plt[n]-_GLOBAL_OFFSET_TABLE_(30)
# mtctr 11
# bctr
# nop
00000000.got2.plt_pic32.f:
## .got2 refers to the copy belonging to the current translation unit.
## Different translation units have to use different stubs.
addis 11, 30, .plt[n]-(.got2+0x8000)(30)
lwz 11, .plt[n]-(.got2+0x8000)@l(30)
mtctr 11
bctr
## The case when the GOT offset is within 64KiB is similar to plt_pic32.f.
Setting up r30 is extremely expensive. A function tail calling another one requires the following many instructions:
<foo>:
stwu 1, -16(1) # allocate stack
mflr 0
bcl 20, 31, 0x1bc # set lr to PC
stw 30, 8(1) # save r30 which is used as the GOT base
mflr 30
addis 30, 30, 2 # high 16 bits of the GOT base (.got2+0x8000)
stw 0, 20(1) # save lr (copied to r0)
addi 30, 30, 32140 # low 16 bits of the GOT base (.got2+0x8000)
bl 0x1f0
lwz 0, 20(1)
lwz 30, 8(1)
addi 1, 1, 16
mtlr 0
blr
PPC64 ELFv2 PLT
.plt
is like .got.plt
for other
architectures. .plt[n]
holds the address of a PLT entry
(somewhere in .glink
).
.glink
is like .plt
for other
architectures. .glink
has a header of 60 bytes. Each PLT
entry consists of one instruction b .plt
. The PLT header
subtracts the address of the first PLT entry from r12
to
compute the PLT index.
An unconditional branch instruction b
/bl
may use either R_PPC64_REL24
or
R_PPC64_REL24_NOTOC
. R_PPC64_REL24
indicates
that the caller uses TOC. R_PPC64_REL24_NOTOC
indicates
that the caller does not use TOC or preserve r2.
If a PLT entry is needed, the linker creates a traditional or PC-relative PLT call stub, and redirect the caller's branch instruction to the call stub. This is different from most other architectures where an indirection is unneeded.
Thread Local Storage
Both PPC32 and PPC64 use TLS Variant I: the static TLS blocks are placed above the thread pointer. The thread pointer points to the end of the thread control block.
The linker performs TLS optimization.
See All about thread-local storage.
Workaround for old IBM XL compilers
R_PPC64_TLSGD
or R_PPC64_TLSLD
is required
to mark bl __tls_get_addr
for General Dynamic/Local Dynamic
code sequences.
addis r3, r2, x@got@tlsgd@ha # R_PPC64_GOT_TLSGD16_HA
addi r3, r3, x@got@tlsgd@l # R_PPC64_GOT_TLSGD16_LO
bl __tls_get_addr(x@tlsgd) # R_PPC64_TLSGD followed by R_PPC64_REL24
nop
However, there are two deviations from the above:
- direct call to
__tls_get_addr
. This is essential to implement rtld in glibc/musl/FreeBSD.
bl __tls_get_addr
nop
This is only used in a -shared
link, and thus not
subject to the GD/LD to IE/LE relaxation issue below.
- Missing
R_PPC64_TLSGD
/R_PPC64_TLSGD
for compiler generated TLS references
According to Stefan Pintille, "In the early days of the transition from the ELFv1 ABI that is used for big endian PowerPC Linux distributions to the ELFv2 ABI that is used for little endian PowerPC Linux distributions, there was some ambiguity in the specification of the relocations for TLS. The GNU linker has implemented support for correct handling of calls to __tls_get_addr with a missing relocation. Unfortunately, we didn't notice that the IBM XL compiler did not handle TLS according to the updated ABI until we tried linking XL compiled libraries with LLD."
It is unfortunate but in short ld.lld needs to work around the old
IBM XL compiler issue. Otherwise, if the object file is linked in
-no-pie
or -pie
mode, the result will be
incorrect because the 4 instructions are partially rewritten (the latter
2 are not changed).
Range extension thunks
On PPC32, an unconditional branch instruction
b
/bl
has a range of +-32MiB and may use 3
relocation types: R_PPC_LOCAL24PC
,
R_PPC_REL24
, and R_PPC_PLTREL24
. If the target
is not reachable from the instruction location, a range extension thunk
will be used. R_PPC_LOCAL24PC
is a useless relocation. All
occurrences can be replaced with R_PPC_REL24
.
Interop between PC-relative and TOC functions
TODO --power10-stubs/--no-power10-stubs
Recommend
-
8
Does ISA ownership matter? A Tale of Three ISAs December 22, 2020 Blog, News & DocsAn instruction set architecture (I...
-
7
ir3/isa,parser: fix encoding and parsing of bindless s2en SAM Before, decoding showed that there is an error: sam.base0 (f32)(xyzw)r0.x, r0.z, a1.x ; no field 'HAS_SAMP...
-
8
The ARMv9 ISA, And What It Can Do For You The number of distinct ARM Instruction Set Architectures (ISA) versions has slowly increased, with Arm adding a new version every few years. The oldes...
-
5
When the RISC-V ISA is the Weakest Link by Yannick Moy – Sep 02, 2021NVIDIA has been using SPARK for some time now to
-
23
Microelectronics | 80386DX ISA SINGLE BOARD MICROCOMPUTER Synopsis Living in the early 1990s along with the technological advancements in comput...
-
3
4 mins readCompanyIsa Notermans is Elevating Work Culture, One Policy at a TimeMeet the work-from-home-mom who is implementing big changes at Linktree,...
-
4
This article describes target-specific things about x86 in ELF linkers. I will use "x86" to refer to both x86-32 and x86-64. Global Offset Table _GLOBAL_OFFSET_TABLE_ is defined at the start of...
-
3
This article describes target-specific details about AArch64 in ELF linkers. AArch64 is the 64-bit execution environment for the Arm architecture. ABI documents Global Offset Table
-
3
Linker notes on AArch32 UNDER CONSTRUCTION This article describes target-specific details about AArch32 in ELF linkers. I described AArch64 in a
-
8
Linker notes on PE/COFF UNDER CONSTRUCTION This article describes linker notes about Portable Executable (PE) and Common Object File Format (COFF) us...
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK