8

zig.internals/internals.rst at master · mikdusan/zig.internals · GitHub

 2 years ago
source link: https://github.com/mikdusan/zig.internals/blob/master/internals.rst
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Zig Compiler Internals

note: Due to limitations of this article format we overload diff syntax highlighting to achieve a highlighting effect for code listings. Consequently, highlighted lines will display an exclamation-mark at the beginning of each line. This mark should be ignored.

1   Introduction

The Zig compiler is implemented mostly in C++ with some parts in Zig userland.

Long-term goals (in no particular order) for the compiler are as follows:

  • become self hosting
  • add a fast backend (non-optimizing machine code generator)
  • add fine-grained incremental builds
  • continue to improve safe-mode code generation

2   Abstract

This article aims to document various internal aspects of the Zig Programming Language and bootstrap newcomers interested in debugging/contributing to the project.

3   Compiler Pipeline

The Zig compiler architecture pipeline is as follows:

  • consume Zig source code
  • generate tokens (LEX)
  • generate abstract syntax tree (AST)
  • generate Src internal representation (SIR)
  • generate Gen internal representation (GIR)
  • generate LLVM internal representation (LLVM-IR)
  • emit machine code

3.1   Generate LEX

Source code is consumed and tokens are generated by tokenizer.cpp .

3.2   Generate AST

Tokens are consumed and the AST is generated by parser.cpp .

3.3   Generate SIR

AST is consumed and SIR is generated by analyze.cpp and ir.cpp .

  • execute comptime
  • resolve comptime types
  • apply result location semantics

Zig has two parts to its internal representation, SIR and GIR where the "S" in Src-IR indicates that it's coming from the source-side of the pipeline and the "G" in Gen-IR indicates that it's heading towards the machine code generation side.

Both SIR and GIR are colloqually known as IR.

3.4   Generate GIR

SIR is consumed and GIR is generated by ir.cpp .

3.5   Generate LLVM-IR

GIR is consumed and LLVM-IR is generated by codegen.cpp .

4   Reading IR

This section will briefly describe textual representation of IR for example source reduction.zig:

export fn reduction() u64 {
    var i: u64 = 999;
    i += 333;
    return i;
}

4.1   SIR

SIR listing for reduction.zig:

fn reduction() { // (IR)
Entry_0:
    #1  | ResetResult           | (unknown)   | - | ResetResult(none)
    #2  | ResetResult           | (unknown)   | - | ResetResult(none)
    #3  | ResetResult           | (unknown)   | - | ResetResult(none)
    #4  | Const                 | type        | 2 | u64
    #5  | EndExpr               | (unknown)   | - | EndExpr(result=none,value=u64)
    #6  | Const                 | bool        | 2 | false
    #7  | AllocaSrc             | (unknown)   | 1 | Alloca(align=(null),name=i)
    #8  | ResetResult           | (unknown)   | - | ResetResult(var(#7))
    #9  | ResetResult           | (unknown)   | - | ResetResult(none)
    #10 | Const                 | comptime_int| 2 | 999
    #11 | EndExpr               | (unknown)   | - | EndExpr(result=none,value=999)
    #12 | ImplicitCast          | (unknown)   | 1 | @implicitCast(u64,999)
    #13 | EndExpr               | (unknown)   | - | EndExpr(result=var(#7),value=#12)
    #14 | DeclVarSrc            | void        | - | var i = #7 // comptime = false
    #15 | ResetResult           | (unknown)   | - | ResetResult(none)
    #16 | ResetResult           | (unknown)   | - | ResetResult(none)
    #17 | VarPtr                | (unknown)   | 2 | &i
    #18 | LoadPtr               | (unknown)   | 1 | #17.*
    #19 | ResetResult           | (unknown)   | - | ResetResult(none)
    #20 | Const                 | comptime_int| 2 | 333
    #21 | EndExpr               | (unknown)   | - | EndExpr(result=none,value=333)
    #22 | BinOp                 | (unknown)   | 1 | #18 + 333
    #23 | StorePtr              | void        | - | *#17 = #22
    #24 | Const                 | void        | 2 | {}
    #25 | EndExpr               | (unknown)   | - | EndExpr(result=none,value={})
    #26 | CheckStatementIsVoid  | (unknown)   | - | @checkStatementIsVoid({})
    #27 | ResetResult           | (unknown)   | - | ResetResult(none)
    #28 | ResetResult           | (unknown)   | - | ResetResult(return)
    #29 | VarPtr                | (unknown)   | 1 | &i
    #30 | LoadPtr               | (unknown)   | 4 | #29.*
    #31 | EndExpr               | (unknown)   | - | EndExpr(result=return,value=#30)
    #32 | AddImplicitReturnType | (unknown)   | - | @addImplicitReturnType(#30)
    #35 | TestErrSrc            | (unknown)   | 2 | @testError(#30)
    #36 | TestComptime          | (unknown)   | 3 | @testComptime(#35)
    #37 | CondBr                | noreturn    | - | if (#35) $ErrRetErr_33 else $ErrRetOk_34 // comptime = #36
ErrRetErr_33:
    #39 | SaveErrRetAddr        | (unknown)   | - | @saveErrRetAddr()
    #40 | Br                    | noreturn    | - | goto $RetStmt_38 // comptime = #36
ErrRetOk_34:
    #41 | Br                    | noreturn    | - | goto $RetStmt_38 // comptime = #36
RetStmt_38:
    #42 | Return                | noreturn    | - | return #30
}

Each line represents an SIR instruction in tabular format columns with columns as follows:

  1. debug-id which is unique to the function body
  2. trimmed C++ struct name representing an instruction type
  3. Zig type for the instruction as an expression
  4. reference count for the instruction
  5. syntax (string representation) of the instruction

Intermixed between instructions are basic-block labels in style <name>_<debug-id>:

4.2   GIR

GIR listing for reduction.zig:

fn reduction() { // (analyzed)
Entry_0:
    #16 | StorePtr              | void        | - | *#12 = 999
    :12 | AllocaGen             | *u64        | 2 | Alloca(align=0,name=i)
    #17 | DeclVarGen            | void        | - | var i: u64 align(8) = #12 // comptime = false
    #20 | VarPtr                | *u64        | 2 | &i
    #21 | LoadPtrGen            | u64         | 1 | loadptr(#20)result=(null)
    #26 | BinOp                 | u64         | 1 | #21 + 333
    #27 | StorePtr              | void        | - | *#20 = #26
    #33 | VarPtr                | *u64        | 1 | &i
    #34 | LoadPtrGen            | u64         | 1 | loadptr(#33)result=(null)
    #39 | Return                | noreturn    | - | return #34
}

GIR is very similar to SIR and reduced in number of instructions as many have already been consumed by the pipeline. Bear in mind a few things:

  • the debug-ids from GIR have no correlation to those from SIR
  • many SIR instructions are illegal in GIR
  • all types are resolved

We should pause for a moment and examine why one of the instructions in column 1 looks different. Looking backwards from :12 we see that #16 is using #12 and it's an AllocaGen. These are special - the :12 rather than #12 indicates that the previous instruction references it, but it is not code-generated right there in that position. Rather, all the AllocaGen instructions are code-generated at the very beginning of a function before anything else.

5   Common IR Instruction Set

5.1   general

5.1.1   BinOp

IrInstructionBinOp represents a binary operation.

syntax:

<BinOp> ::= <op1> <op_id> <op1>

op1 first operand op_id one of: BoolOr, BoolAnd, CmpEq, CmpNotEq, CmpLessThan, CmpGreaterThan, CmpLessOrEq, CmpGreaterOrEq, BinOr, BinXor, BinAnd, BitShiftLeftLossy, BitShiftLeftExact, BitShiftRightLossy, BitShiftRightExact, Add, AddWrap, Sub, SubWrap, Mult, MultWrap, DivUnspecified, DivExact, DivTrunc, DivFloor, RemUnspecified, RemRem, RemMod, ArrayCat, ArrayMult, MergeErrorSets op2 second operand

source-reduction → SIR:

export fn reduction(one: u64, two: u64) void {
    var a: u64 = one + two;
}
  fn reduction() { // (analyzed)
  Entry_0:
      #10 | VarPtr                | *const u64  | 1 | &one
!     #11 | LoadPtrGen            | u64         | 1 | loadptr(#10)result=(null)
      #14 | VarPtr                | *const u64  | 1 | &two
!     #15 | LoadPtrGen            | u64         | 1 | loadptr(#14)result=(null)
!     #17 | BinOp                 | u64         | 1 | #11 + #15
      #20 | StorePtr              | void        | - | *#19 = #17
      :19 | AllocaGen             | *u64        | 2 | Alloca(align=0,name=a)
      #22 | DeclVarGen            | void        | - | var a: u64 align(8) = #19 // comptime = false
      #26 | Return                | noreturn    | - | return {}
  }

5.1.2   Const

IrInstructionConst is a compile-time instruction.

syntax:

<Const> ::= <value>

value comptime value

source-reduction → SIR:

export fn reduction() void {
   _ = true;
}
  fn reduction() { // (IR)
  Entry_0:
      #1  | ResetResult           | (unknown)   | - | ResetResult(none)
      #2  | ResetResult           | (unknown)   | - | ResetResult(none)
      #3  | ResetResult           | (unknown)   | - | ResetResult(none)
      #4  | Const                 | *void       | 1 | *_
      #5  | ResetResult           | (unknown)   | - | ResetResult(inst(*_))
      #6  | Const                 | bool        | 1 | true
      #7  | EndExpr               | (unknown)   | - | EndExpr(result=inst(*_),value=true)
!     #8  | Const                 | void        | 2 | {}
      #9  | EndExpr               | (unknown)   | - | EndExpr(result=none,value={})
      #10 | CheckStatementIsVoid  | (unknown)   | - | @checkStatementIsVoid({})
      #11 | Const                 | void        | 0 | {}
      #12 | Const                 | void        | 3 | {}
      #13 | EndExpr               | (unknown)   | - | EndExpr(result=none,value={})
      #14 | AddImplicitReturnType | (unknown)   | - | @addImplicitReturnType({})
      #15 | Return                | noreturn    | - | return {}
  }

5.2   terminators

5.2.1   Br

IrInstructionBr unconditionally transfers control flow to another basic-block.

syntax:

<Br> ::= "goto" "$"<dest_block>

dest_block branch to take

source-reduction → GIR:

export fn reduction(cond: bool) void {
    var a: u64 = 999;
    if (cond) {
        a += 333;
    }
}
  fn reduction() { // (analyzed)
  Entry_0:
      #16 | StorePtr              | void        | - | *#12 = 999
      :12 | AllocaGen             | *u64        | 2 | Alloca(align=0,name=a)
      #17 | DeclVarGen            | void        | - | var a: u64 align(8) = #12 // comptime = false
      #20 | VarPtr                | *const bool | 1 | &cond
      #21 | LoadPtrGen            | bool        | 1 | loadptr(#20)result=(null)
      #27 | CondBr                | noreturn    | - | if (#21) $Then_25 else $Else_26
  Then_25:
      #30 | VarPtr                | *u64        | 2 | &a
      #31 | LoadPtrGen            | u64         | 1 | loadptr(#30)result=(null)
      #36 | BinOp                 | u64         | 1 | #31 + 333
      #37 | StorePtr              | void        | - | *#30 = #36
!     #47 | Br                    | noreturn    | - | goto $EndIf_43
  Else_26:
!     #50 | Br                    | noreturn    | - | goto $EndIf_43
! EndIf_43:
      #57 | Return                | noreturn    | - | return {}
  }

5.2.2   CondBr

IrInstructionCondBr conditionally transfers control flow to other basic-blocks.

syntax:

<CondBr> ::= "if" "(" <condition> ")" "$"<then_block> "else" "$"<else_block>

condition is evaluated as a bool then_block branch taken if condition == true else_block branch taken if condition == false

source-reduction → GIR:

export fn reduction(cond: bool) void {
    var a: u64 = 999;
    if (cond) {
        a += 333;
    } else {
        a -= 333;
    }
}
  fn reduction() { // (analyzed)
  Entry_0:
      #16 | StorePtr              | void        | - | *#12 = 999
      :12 | AllocaGen             | *u64        | 2 | Alloca(align=0,name=a)
      #17 | DeclVarGen            | void        | - | var a: u64 align(8) = #12 // comptime = false
      #20 | VarPtr                | *const bool | 1 | &cond
      #21 | LoadPtrGen            | bool        | 1 | loadptr(#20)result=(null)
!     #27 | CondBr                | noreturn    | - | if (#21) $Then_25 else $Else_26
! Then_25:
      #30 | VarPtr                | *u64        | 2 | &a
      #31 | LoadPtrGen            | u64         | 1 | loadptr(#30)result=(null)
      #36 | BinOp                 | u64         | 1 | #31 + 333
      #37 | StorePtr              | void        | - | *#30 = #36
      #60 | Br                    | noreturn    | - | goto $EndIf_56
! Else_26:
      #44 | VarPtr                | *u64        | 2 | &a
      #45 | LoadPtrGen            | u64         | 1 | loadptr(#44)result=(null)
      #50 | BinOp                 | u64         | 1 | #45 - 333
      #51 | StorePtr              | void        | - | *#44 = #50
      #63 | Br                    | noreturn    | - | goto $EndIf_56
  EndIf_56:
      #70 | Return                | noreturn    | - | return {}
  }

5.2.3   Return

IrInstructionReturn unconditionally transfers control flow back to the caller basic-block.

syntax:

<Return> ::= "return" "{}"

source-reduction → GIR:

export fn reduction() void {}
 fn reduction() { // (analyzed)
 Entry_0:
!    #5  | Return                | noreturn    | - | return {}
 }

6   Compiler Building

6.1   Overview

  • cmake
  • compile common C++ sources
  • compile userland.o C++ sources
  • link zig0 stage0 compiler
  • compile libuserland.a Zig sources
  • link zig stage1 compiler
userland.o This is a shim implementation of libuserland.a and is completely implemented in C++. All exported symbols must match libuserland.a. zig0 links against but never makes calls against the shim. All shims are implemented as panics. zig0

Also known as the stage0 compiler. It links against userland.o and is a functionally limited compiler but is robust enough to build libuserland.a.

zig0 can build Zig source code, run tests and produce executables. It can be debugged with a native debugger such as gdb or lldb. But it cannot do things like zig0 build ... because part of that functionality is implemented in libuserland.a.

During Zig compiler development it may be of use to develop against zig0 in an interative fashion.

Here is an example of using stage0 to emit IR and LLVM-IR:

$ _build/zig0 --override-std-dir std --override-lib-dir . build-obj reduction.zig --verbose-ir --verbose-llvm-ir

and a corresponding example of launching lldb debugger:

$ lldb _build/zig0 -- --override-std-dir std --override-lib-dir . build-obj reduction.zig

libuserland.a This is a support library implemented in Zig userland. It replaces all shims from userland.o with implementations. zig links against this library instead of userland.o. zig Also known as the stage1 compiler. It links against libuserland.a and is a fully functional compiler. It can be debugged with a native debugger such as gdb or lldb.

7   How-To: Common Tasks

7.1   iteratively build compiler

note: for stage1 replace zig0 with zig:

using make:

$ make -C _build zig0
$ _build/zig0 --override-std-dir std --override-lib-dir . version

using ninja:

$ ninja -C _build zig0
$ _build/zig0 --override-std-dir std --override-lib-dir . version

7.2   debug compiler

note: for stage1 replace zig0 with zig:

using gdb:

$ _build/zig0 --override-std-dir std --override-lib-dir build-obj foobar.zig
segmentation fault
$ gdb --args _build/zig0 --override-std-dir std --override-lib-dir build-obj foobar.zig

using lldb:

$ _build/zig0 --override-std-dir std --override-lib-dir build-obj foobar.zig
segmentation fault
$ lldb _build/zig0 -- --override-std-dir std --override-lib-dir build-obj foobar.zig

7.3   debug: print instruction source location

using lldb:

  (lldb) frame variable instruction
  (IrInstructionSliceSrc *) instruction = 0x0000000108156910
! (lldb) p instruction->base.source_node->src()
  ~/zig/work/bounds1.zig:3:23

7.4   print IR listing

note: for stage1 replace zig0 with zig:

$ _build/zig0 --override-std-dir std --override-lib-dir build-obj reduction.zig --verbose-ir

pro-tip: to reduce IR noise add this to reduction.zig:

// override panic handler to reduce IR noise
pub fn panic(msg: []const u8, error_return_trace: ?*@import("builtin").StackTrace) noreturn {
    while (true) {}
}

7.5   configure for ninja

$ cd ~/zig/work
$ mkdir _build
$ cmake -G Ninja -S . -B _build -DCMAKE_BUILD_TYPE=Debug -DCMAKE_INSTALL_PREFIX=/opt/zig -DCMAKE_PREFIX_PATH=/opt/llvm-8.0.1

7.6   behavior tests

These are language-fundamental tests like flow-control, types, alignment, pointers, optionals, slices, arrays. It is crucial the compiler can pass these tests after making internal changes.

direct

The most fine-grained way to run tests is via zig test ... command. Here we run unit tests for the while flow-control:

_build/zig0 --override-std-dir std --override-lib-dir . test test/stage1/behavior/while.zig

1/20 test "while loop"...OK
2/20 test "static eval while"...OK
3/20 test "continue and break"...OK
4/20 test "return with implicit cast from while loop"...OK
5/20 test "while with continue expression"...OK
6/20 test "while with else"...OK
7/20 test "while with optional as condition"...OK
8/20 test "while with optional as condition with else"...OK
9/20 test "while with error union condition"...OK
10/20 test "while on optional with else result follow else prong"...OK
11/20 test "while on optional with else result follow break prong"...OK
12/20 test "while on error union with else result follow else prong"...OK
13/20 test "while on error union with else result follow break prong"...OK
14/20 test "while on bool with else result follow else prong"...OK
15/20 test "while on bool with else result follow break prong"...OK
16/20 test "break from outer while loop"...OK
17/20 test "continue outer while loop"...OK
18/20 test "while bool 2 break statements and an else"...OK
19/20 test "while optional 2 break statements and an else"...OK
20/20 test "while error 2 break statements and an else"...OK

and it can be restricted even further with simple filtering:

_build/zig0 --override-std-dir std --override-lib-dir . test test/stage1/behavior/while.zig --test-filter bool

1/3 test "while on bool with else result follow else prong"...OK
2/3 test "while on bool with else result follow break prong"...OK
3/3 test "while bool 2 break statements and an else"...OK
All tests passed.

via build

When the compiler is able to compile build.zig larger test suites can be used. Here we run all the behavior tests with the following restrictions:

  • skip repeating test against --release-safe and --release-fast compiler modes
  • skip repeating test for non-native platforms (run for host only)
  • test will still run for targets permutations such as freestanding, libc, single-threaded and multi-threaded.
  • filter for tests with break in name

_build/zig0 build --override-std-dir std --override-lib-dir . test-behavior -Dskip-release -Dskip-non-native -Dtest-filter=break

8   Best Practices

8.1   Always direct stage0 to workspace

It is recommended to override std and lib dirs for zig0.

zig build functionality is responsible for completing a compiler install. Since it is likely zig0 development involves writing tests and userland changes those files cannot be installed until your development is able to progress to stage1.

$ _build/zig0 --override-std-dir std --override-lib-dir build-obj reduction.zig

8.2   Reduce and Reduce and Reduce Again

Whether tracking down a bug or investigating compiler internals it's a good idea to reduce exposure to unrelated things.

  1. Source related issues should be reduced as much as possible. Any superfluous source can easily lead to an unnecessary loss of clarity and wasted time.
  2. When tracking compiler segfaults try also to reduce the compiler environment:
    • if crashing during zig run, zig test or zig build then try zig build-obj instead
    • file/directory permissions, including zig-cache if active (remember, there are 2 caches)
    • Make sure to identify where the segfault is coming from: userland or compiler?
    • Sanity check dependencies of compiler: official build instructions

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK