基于 LLVM 自制编译器(6)——语言扩展：自定义运算符

发表于 2022-08-28

更新于 2022-09-09

| 分类于 Compiler

目前为止，Kaleidoscope 已经是一门功能齐全且有用的编程语言了。但是，它仍然存在一个很大的问题。当前的 Kaleidoscope 缺少很多有用的运算符，比如：取反、比较等。

本章，我们将为 Kaleidoscope 支持自定义运算符，从而让 Kaleidoscope 具备更强大的编程能力。

我们希望为 Kaleidoscope 支持的 运算符重载（Operator Overloading）能够比 C++ 等语言更加通用。在 C++ 中，我们只能够重新定义已存在的运算符，我们不能以编程方式修改其语法，也不能引入新的运算符、修改运算符优先级等。在本章中，我们将为 Kaleidoscope 支持这种更加强大的能力。

目前为止，我们通过手写的方式实现了一个递归下降的解析器，能够解析目前我们所定义的表达式，包括语法、运算符优先级等。通过运算符优先级解析，我们可以非常简单地引入新的运算符。

本文我们将扩展两种运算符：

一元运算符（Unary Operator）
二元运算符（Binary Operator）

如下所示，为 Kaleidoscope 中自定义一元运算符和二元运算符的示例。

# Logical unary not.
def unary!(v)
  if v then
    0
  else
    1;

# Define > with the same precedence as <.
def binary> 10 (LHS RHS)
  RHS < LHS;

# Binary "logical or", (note that it does not "short circuit")
def binary| 5 (LHS RHS)
  if LHS then
    1
  else if RHS then
    1
  else
    0;

# Define = with slightly lower precedence than relationals.
def binary= 9 (LHS RHS)
  !(LHS < RHS | LHS > RHS);

很多语言希望能够在语言本身中实现其标准运行时库。在本章中，我们会在 Kaleidoscope 中将自定义运算符在其标准库中实现。

词法分析器扩展

无论是一元运算符还是二元运算符，对于词法分析器而言，两者只不过是关键词不同，分别是：unary 和 binary。因此，我们为 unary 和 binary 分别定义对应的 token 类型和 token 值，如下所示。

enum Token {
  ...
  // operators
  tok_binary = -11,
  tok_unary = -12
};
...
static int gettok() {
...
    if (IdentifierStr == "for")
      return tok_for;
    if (IdentifierStr == "in")
      return tok_in;
    if (IdentifierStr == "binary")
      return tok_binary;
    if (IdentifierStr == "unary")
      return tok_unary;
    return tok_identifier;

运算符的定义解析

从上述的目标可以看出，一元运算符和二元运算符均使用函数的方式进行定义。与普通函数定义不同的是，一元运算符和二元运算符的函数原型略有不同。因此，我们需要对 PrototypeAST 进行修改，从而支持解析两种类型的运算符。

/// PrototypeAST - This class represents the "prototype" for a function,
/// which captures its argument names as well as if it is an operator.
class PrototypeAST {
  std::string Name;
  std::vector<std::string> Args;
  bool IsOperator;
  unsigned Precedence;  // Precedence if a binary op.

public:
  PrototypeAST(const std::string &name, std::vector<std::string> Args,
               bool IsOperator = false, unsigned Prec = 0)
  : Name(name), Args(std::move(Args)), IsOperator(IsOperator),
    Precedence(Prec) {}

Function *codegen();
  const std::string &getName() const { return Name; }

bool isUnaryOp() const { return IsOperator && Args.size() == 1; }
  bool isBinaryOp() const { return IsOperator && Args.size() == 2; }

char getOperatorName() const {
    assert(isUnaryOp() || isBinaryOp());
    return Name[Name.size() - 1];
  }

unsigned getBinaryPrecedence() const { return Precedence; }
};

我们在原始的 PrototypeAST 定义的基础上，新增两个属性用于表示普通的函数原型和自定义运算符。

IsOperator 表示是否是运算符。
Precedence 表示运算符的优先级。优先级仅用于二元运算符。

通过修改 PrototypeAST，使其支持识别一元运算符和二元运算符的定义后，我们来进一步实现具体的解析逻辑，如下所示。

/// prototype
///   ::= id '(' id* ')'
///   ::= binary LETTER number? (id, id)
static std::unique_ptr<PrototypeAST> ParsePrototype() {
  std::string FnName;

unsigned Kind = 0;  // 0 = identifier, 1 = unary, 2 = binary.
  unsigned BinaryPrecedence = 30;

switch (CurTok) {
  default:
    return LogErrorP("Expected function name in prototype");
  case tok_identifier:
    FnName = IdentifierStr;
    Kind = 0;
    getNextToken();
    break;
 case tok_unary:
    getNextToken();
    if (!isascii(CurTok))
      return LogErrorP("Expected unary operator");
    FnName = "unary";
    FnName += (char)CurTok;
    Kind = 1;
    getNextToken();
    break;
  case tok_binary:
    getNextToken();
    if (!isascii(CurTok))
      return LogErrorP("Expected binary operator");
    FnName = "binary";
    FnName += (char)CurTok;
    Kind = 2;
    getNextToken();

// Read the precedence if present.
    if (CurTok == tok_number) {
      if (NumVal < 1 || NumVal > 100)
        return LogErrorP("Invalid precedence: must be 1..100");
      BinaryPrecedence = (unsigned)NumVal;
      getNextToken();
    }
    break;
  }

if (CurTok != '(')
    return LogErrorP("Expected '(' in prototype");

std::vector<std::string> ArgNames;
  while (getNextToken() == tok_identifier)
    ArgNames.push_back(IdentifierStr);
  if (CurTok != ')')
    return LogErrorP("Expected ')' in prototype");

// success.
  getNextToken();  // eat ')'.

// Verify right number of names for operator.
  if (Kind && ArgNames.size() != Kind)
    return LogErrorP("Invalid number of operands for operator");

return std::make_unique<PrototypeAST>(FnName, std::move(ArgNames), Kind != 0,
                                         BinaryPrecedence);
}

在 ParsePrototype 的解析逻辑中，我们通过解析运算符得到 FnName 变量，将其作为自定义运算符的构建名称，比如：为 @ 操作符构建 binary@ 的名称。由于 LLVM 符号表中的符号允许包含任何字符，所以我们可以将构建名称（如：binary@）存入符号表。

这里，一元运算符和二元运算符的定义的解析逻辑非常相似，唯一的区别在于，二元运算符需要额外解析运算符优先级。

运算符的表达式解析

上一节，我们介绍了如何解析运算符定义。本节，我们来介绍如何解析运算符表达式。

在第 2 章中，我们已经介绍了 Kaleidoscope 中定义的三种类型的 AST 结构，分别是：表达式、原型、函数。其中，我们已经定义了一种表达式子类型 BinaryExprAST，用于表示二元运算符表达式。因此，我们只需要额外定义一种新的表达式子类型 UnaryExprAST 表示一元运算符表达式即可。

UnaryExprAST 的具体定义如下所示。其中，Opcode 表示运算符符号，Operand 表示运算符所作用的操作数。

/// UnaryExprAST - Expression class for a unary operator.
class UnaryExprAST : public ExprAST {
  char Opcode;
  std::unique_ptr<ExprAST> Operand;

public:
  UnaryExprAST(char Opcode, std::unique_ptr<ExprAST> Operand)
    : Opcode(Opcode), Operand(std::move(Operand)) {}

Value *codegen() override;
};

接下来，我们来分别看有一下二元运算符和一元运算符的代码生成实现逻辑。

二元表达式代码生成

如下所示，为二元表达式实现代码生成的逻辑。我们仅仅是在原有的代码生成逻辑中进行了扩展。

Value *BinaryExprAST::codegen() {
  Value *L = LHS->codegen();
  Value *R = RHS->codegen();
  if (!L || !R)
    return nullptr;

switch (Op) {
  case '+':
    return Builder.CreateFAdd(L, R, "addtmp");
  case '-':
    return Builder.CreateFSub(L, R, "subtmp");
  case '*':
    return Builder.CreateFMul(L, R, "multmp");
  case '<':
    L = Builder.CreateFCmpULT(L, R, "cmptmp");
    // Convert bool 0/1 to double 0.0 or 1.0
    return Builder.CreateUIToFP(L, Type::getDoubleTy(TheContext),
                                "booltmp");
  default:
    break;
  }

// If it wasn't a builtin binary operator, it must be a user defined one. Emit
  // a call to it.
  Function *F = getFunction(std::string("binary") + Op);
  assert(F && "binary operator not found!");

Value *Ops[2] = { L, R };
  return Builder.CreateCall(F, Ops, "binop");
}

在解析二元表达式时，我们需要处理两种情况：

处理 默认运算符，比如：+、-、*、< 等。我们只需要调用 Builder 生成对应的 LLVM IR 即可。
处理 自定义运算符，比如：@、| 等。我们只需要在符号表中查找对应的运算符，并生成对函数（如：binary@）的调用，从而生成 LLVM IR。由于自定义运算符只是作为普通函数构建的，所以不存在额外的特殊逻辑。

一元表达式代码生成

如下所示，为一元表达式实现代码生成的逻辑。

Value *UnaryExprAST::codegen() {
  Value *OperandV = Operand->codegen();
  if (!OperandV)
    return nullptr;

Function *F = getFunction(std::string("unary") + Opcode);
  if (!F)
    return LogErrorV("Unknown unary operator");

return Builder.CreateCall(F, OperandV, "unop");
}

其生成逻辑非常简单，通过在符号表中查找对应的运算符，并生成对函数（如：binary!）的调用，从而生成 LLVM IR。

通用表达式解析优化

由于现在我们新增了一种表达式类型 UnaryExprAST，因此，我们需要对原始的通用表达式解析函数进行优化，以支持解析一元表达式。

/// expression
///   ::= unary binoprhs
///
static std::unique_ptr<ExprAST> ParseExpression() {
  auto LHS = ParseUnary();
  if (!LHS)
    return nullptr;

return ParseBinOpRHS(0, std::move(LHS));
}

ParseExpression 的实现逻辑非常简单，使用 ParseUnary 解析左操作数。然后继续解析结果，再解析右操作数。

由于表达式的右部可能也包含一元表达还是，因此，我们还需要修改 ParseBinOpRHS，支持解析一元运算符，如下所示。

/// binoprhs
///   ::= ('+' unary)*
static std::unique_ptr<ExprAST> ParseBinOpRHS(int ExprPrec,
                                              std::unique_ptr<ExprAST> LHS) {
  ...
    // Parse the unary expression after the binary operator.
    auto RHS = ParseUnary();
    if (!RHS)
      return nullptr;
  ...
}

上述的修改最终都会调用 ParseUnary 解析函数，其具体实现如下所示。

/// unary
///   ::= primary
///   ::= '!' unary
static std::unique_ptr<ExprAST> ParseUnary() {
  // If the current token is not an operator, it must be a primary expr.
  if (!isascii(CurTok) || CurTok == '(' || CurTok == ',')
    return ParsePrimary();

// If this is a unary operator, read it.
  int Opc = CurTok;
  getNextToken();
  if (auto Operand = ParseUnary())
    return std::make_unique<UnaryExprAST>(Opc, std::move(Operand));
  return nullptr;
}

ParseUnary 解析函数的实现非常简单。当我们在解析主表达式时遇到一元运算符时，我们会将运算符符作为前缀，并将剩余部分作为另一个一元运算符进行解析。这样使得我们能够处理多个一元运算符串联的场景，比如： !!x。注意，一元运算符不能像二元运算符那样具有二义性的解析，因此不需要设置优先级。

运算符函数调用

无论是一元运算符还是二元运算符，最终的实现逻辑均由函数体实现。因此，我们需要通过扩展 FunctionAST 的解析函数 codegen()，从而生成对应的 LLVM IR，具体如下所示。

Function *FunctionAST::codegen() {
  // Transfer ownership of the prototype to the FunctionProtos map, but keep a
  // reference to it for use below.
  auto &P = *Proto;
  FunctionProtos[Proto->getName()] = std::move(Proto);
  Function *TheFunction = getFunction(P.getName());
  if (!TheFunction)
    return nullptr;

// If this is an operator, install it.
  if (P.isBinaryOp())
    BinopPrecedence[P.getOperatorName()] = P.getBinaryPrecedence();

// Create a new basic block to start insertion into.
  BasicBlock *BB = BasicBlock::Create(TheContext, "entry", TheFunction);

...

if (P.isBinaryOp())
    BinopPrecedence.erase(P.getOperatorName());
  return nullptr;

其主要是在为 FunctionAST 生成代码时，判断是否是自定义运算符的函数定义，如果是，则设置运算符优先级，并在代码生成后移除对应的运算符优先级。通过这种方式，我们可以同时实现一元运算符和二元运算符的底层代码生成逻辑。

通过以上简单的扩展，我们开发出了一门真正的语言。基于 Kaleidoscope，我们可以做很多有趣的事情，包括：I/O、数学运算等。比如，我们添加一个串联操作符，如下所示。注意：printd 用于打印指定值或换行符。

ready> extern printd(x);
Read extern:
declare double @printd(double)

ready> def binary : 1 (x y) 0;
Read function definition:
define double @"binary:"(double %x, double %y) {
entry:
  ret double 0.000000e+00
}
ready> printd(123) : printd(456) : printd(789);
123.000000
456.000000
789.000000
Evaluated to 0.000000

我们还可以定义一系列的原始操作符，如下所示。

# Logical unary not.
def unary!(v)
  if v then
    0
  else
    1;

# Unary negate.
def unary-(v)
  0-v;

# Define > with the same precedence as <.
def binary> 10 (LHS RHS)
  RHS < LHS;

# Binary logical or, which does not short circuit.
def binary| 5 (LHS RHS)
  if LHS then
    1
  else if RHS then
    1
  else
    0;

# Binary logical and, which does not short circuit.
def binary& 6 (LHS RHS)
  if !LHS then
    0
  else
    !!RHS;

# Define = with slightly lower precedence than relationals.
def binary = 9 (LHS RHS)
  !(LHS < RHS | LHS > RHS);

# Define ':' for sequencing: as a low-precedence operator that ignores operands
# and just returns the RHS.
def binary : 1 (x y) y;

至此，Kaleidoscope 开始成为一种真实而强大的语言。

通过本章，我们增强了 Kaleidoscope 语言，在库中添加了扩展语言的能力。此时，Kaleidoscope 可以构建各种功能性的应用程序，并且可以调用具有副作用的函数，但它实际上不能定义和改变变量本身。

然而，变量更新是大多数编程语言的一个重要特性。下一章，我们将描述如何在不在前端构建 SSA 的情况下支持变量更新。

Kaleidoscope: Extending the Language: User-defined Operators

基于 LLVM 自制编译器(6)——语言扩展：自定义运算符

基于 LLVM 自制编译器(6)——语言扩展：自定义运算符

词法分析器扩展

运算符的定义解析

运算符的表达式解析

二元表达式代码生成

一元表达式代码生成

通用表达式解析优化

运算符函数调用

Recommend

Twitter Will Now Let You Schedule Up to 10 Spaces Broadcasts at a Time

Linkerd 与 Ingress-Nginx 结合使用以及对服务的访问限制

All Apple iPhone 14 models come with 6GB RAM

Chevrolet Bolt EUV review: Bigger is better

Java 序列化10倍性能优化对比测试

The Legendary Frank Drake Shaped the Search for Alien Life

Is the Next Generation of Your Family Business Entrepreneurial Enough?

The End of Kiwi Farms, the Web’s Most Notorious Stalker Site

达梦数据库（DM8）操作、问题记录及解决

PASS Data Platform Summit 2022 SQL Karaoke Party

About Joyk