TVM export_lib函数分析

发表于

2021-10-08 更新于 2021-10-15 分类于 TVM ， Runtime

阅读次数：阅读次数： 7

本文分析executor_lib函数的处理过程，其中包括DSO模块的编译和imported modules模块的序列化。

def export_library(self, file_name, fcompile=None, addons=None, **kwargs):
    return self.module.export_library(file_name, fcompile, addons, **kwargs)

首先调用relay.build最终生成GraphExecutorFactoryModule对象，位于python/tvm/relay/backend/executor_factory.py文件中。使用export_library导出动态库时，实际上调用的是GraphExecutorFactoryModule类本身的module中的export_library函数。该对象指向tvm.graph_executor_factory.create函数，是在C++端实现的函数。

fcreate = get_global_func("tvm.graph_executor_factory.create")
args = []
for k, v in params.items():
    args.append(k)
    args.append(ndarray.array(v))
self.module = fcreate(graph_json_str, libmod, libmod_name, *args)

接下来进入src/runtime/graph_executor/graph_executor_factory.cc文件，使用TVM_REGISTER_GLOBAL宏定义将函数暴露到python端，函数体是lamda表达式。首先进行参数数量检查，然后进行参数重组，将参数按照命名存储进params对象中，然后创建GraphExecutorFactory对象。其中module_name默认值为default。然后导入所有runtime_module，最后将GraphExecutorFactory对象传入Module模块并返回。

TVM_REGISTER_GLOBAL("tvm.graph_executor_factory.create")
    .set_body([](TVMArgs args, TVMRetValue* rv) {
        ICHECK_GE(args.num_args, 3) << "The expected number of arguments for "
            "graph_executor_factory.create needs at least 3, "
            "but it has "
            << args.num_args;
        // The argument order is graph_json, module, module_name, param0_name, param0_tensor,
        // [param1_name, param1_tensor], ...
        ICHECK_EQ((args.size() - 3) % 2, 0);
        std::unordered_map<std::string, tvm::runtime::NDArray> params;
        for (size_t i = 3; i < static_cast<size_t>(args.size()); i += 2) {
            std::string name = args[i].operator String();
            params[name] = args[i + 1].operator tvm::runtime::NDArray();
        }
        // graph_json, params, module_name
        auto exec = make_object<GraphExecutorFactory>(args[0], params, args[2]);
        // module
        exec->Import(args[1]);
        *rv = Module(exec);
    });

2.1. Module

2.1.1. export_library

关于Module类位于python/tvm/runtime/module.py文件中，该类中包含export_library函数。该函数的主要作用是将模块和所有被导入模块导出为一个简单的动态库。

2.1.1.1. Collect DSO Module

首先收集所有DSO模块(LLVM Module 和 C Module)

modules = self._collect_dso_modules()

2.1.1.2. Save File

一旦收集到所有的DSO模块，就可以调用runtime模块的save函数将其保存为文件格式。通过遍历所有DSO模块，根据其类型键设置相应的文件后缀，并通过save函数将模块保存为相应的文件类型，并将该文件添加到files数组序列中。

for index, module in enumerate(modules):
    if fcompile is not None and hasattr(fcompile, "object_format"):
        if module.type_key == "c":
            object_format = "c"
            has_c_module = True
        else:
        	object_format = fcompile.object_format
    else:
        if module.type_key == "llvm":
        	object_format = "o"
        else:
            assert module.type_key == "c"
            object_format = "c"
            if "cc" in kwargs:
                if kwargs["cc"] == "nvcc":
                	object_format = "cu"
        	has_c_module = True
    path_obj = os.path.join(workspace_dir, f"lib{index}.{object_format}")
    module.save(path_obj)
    files.append(path_obj)
    is_system_lib = (
    	module.type_key == "llvm" and module.get_function("__tvm_is_system_module")()
    )
    llvm_target_triple = (
    	module.type_key == "llvm" and module.get_function("_get_target_triple")()
    )

2.1.1.2.1 save函数

关于save函数，其调用C++端的函数ModuleSaveToFile。

def save(self, file_name, fmt=""):
    """Save the module to file.

This do not save the dependent device modules.
    See also export_shared

Parameters
    ----------
    file_name : str
    The name of the file.
    fmt : str
    The format of the file.

See Also
    --------
    runtime.Module.export_library : export the module to shared library.
    """
    _ffi_api.ModuleSaveToFile(self, file_name, fmt)

该函数使用各个模块的SaveToFile函数将模块保存为文件形式。

TVM_REGISTER_GLOBAL("runtime.ModuleSaveToFile")
    .set_body_typed([](Module mod, tvm::String name, tvm::String fmt) {
        mod->SaveToFile(name, fmt);
    });

2.1.1.3. Impoerted Modules

检查是否存在imported modules(像CUDA、OpenCL等)。这里不限制模块类型，一旦存在imports modules，将创建命名为devc.o或devc.c的文件。这就可以将imports modules的二进制blob数据嵌入进动态库中。然后调用ModulePackImportsToLLVM或ModulePackImportsToC进行模块序列化(module serialization)。

if self.imported_modules:
    if enabled("llvm") and llvm_target_triple:
        path_obj = os.path.join(workspace_dir, f"devc.{object_format}")
        m = _ffi_api.ModulePackImportsToLLVM(self, is_system_lib, llvm_target_triple)
        m.save(path_obj)
        files.append(path_obj)
    else:
        path_cc = os.path.join(workspace_dir, "devc.c")
        with open(path_cc, "w") as f:
            f.write(_ffi_api.ModulePackImportsToC(self, is_system_lib))
        files.append(path_cc)

注：使用PackImportsToLLVM或PackImportsToC取决于是否在TVM中使能LLVM，事实上它们的目标相同。

2.1.1.4. Function Compile

最后调用fcompile去编译生成动态共享库(so)。如果用户没有指定编译器类型，默认采用cc编译器，如果文件是tar压缩文件，则进行文件解压。

if not fcompile:
	if file_name.endswith(".tar"):
		fcompile = _tar.tar
	else:
		fcompile = _cc.create_shared

如果用户指定了编译器，一般在部署到设备端时会指定交叉编译工具，则使用用户指定的编译器进行编译。编译输入参数包括编译生成的文件名，待编译的文件组和编译参数。

fcompile(file_name, files, **kwargs)

对于C 源码模块，将编译他们并一起与DSO模块进行链接。

2.1.2. Module Serialization

在文件src/target/codegen.cc中，注册了全局函数runtime.ModulePackImportsToC和runtime.ModulePackImportsToLLVM，用于将其暴露到python端调用。

// Export two auxiliary function to the runtime namespace.
TVM_REGISTER_GLOBAL("runtime.ModulePackImportsToC").set_body_typed(PackImportsToC);
TVM_REGISTER_GLOBAL("runtime.ModulePackImportsToLLVM").set_body_typed(PackImportsToLLVM);

2.1.2.1 SerializeModule

其中，PackImportsToC和PackImportsToLLVM函数都调用SerializeModule函数序列化runtime module。

std::string PackImportsToC(const runtime::Module& mod, bool system_lib) {
	std::string bin = SerializeModule(mod);
  	...
}
runtime::Module PackImportsToLLVM(const runtime::Module& mod, bool system_lib,
                                  const std::string& target_triple) {
	std::string bin = SerializeModule(mod);
	...
}

2.1.2.2. ModuleSerializer

首先在SerializeModule函数中创建一个帮手类ModuleSerializer，其传入module做一些初始化工作，像标注模块的索引号等，然后调用该类的SerializeModule函数序列化模块。

std::string SerializeModule(const runtime::Module& mod) {
    std::string bin;
    dmlc::MemoryStringStream ms(&bin);
    dmlc::Stream* stream = &ms;

ModuleSerializer module_serializer(mod);
    module_serializer.SerializeModule(stream);

return bin;
}

在创建module_serializer对象时，其构造函数调用Init函数进行初始化操作。

explicit ModuleSerializer(runtime::Module mod) : mod_(mod) { Init(); }

2.1.2.2.1. Init

而在Init函数中，分别调用CreateModuleIndex和CreateImportTree函数，用于创建模块索引号和导入树。

void Init() {
    CreateModuleIndex();
    CreateImportTree();
}

2.1.2.2.1.1. CreateModuleIndex

在函数CreateModuleIndex中将使用DFS(深度优先)算法检查模块导入关系并为其创建索引，注意根模块固定为位置0。

llvm_mod:imported_modules
  - cuda_mod
  - opencl_mod

因此，LLVM模块将会拥有索引值0，CUDA模块将拥有索引值1，OpenCL模块将拥有索引值为2。

2.1.2.2.1.2. CreateImportTree

在构建模块索引号后，CreateImportTree函数将尝试构建导入树，用于将导出的库加载回来时，恢复模块的导入关系。使用CSR(Compressed Sparse Row)格式存储导入树。每一行都是父索引，子索引对应其子索引。

使用import_tree_row_ptr_表示行偏移，即某一行的第一个元素在values里面的起始偏移位置。import_tree_child_indices_表示子索引值。

2.1.2.2.2. SerializeModule

通过上述两个函数初始化后，可以使用SerializeModule函数序列化模块。在其功能逻辑中，假定序列化格式如下：

binary_blob_size
binary_blob_type_key
binary_blob_logic
binary_blob_type_key
binary_blob_logic
...
_import_tree
_import_tree_logic

binary_blob_size

表示序列化步骤中将拥有的blob数量。如果只有一个DSO模块并且是根模块，将不产生import_tree_。

// Only have one DSO module and it is in the root, then
// we will not produce import_tree_.
bool has_import_tree = true;
if (DSOExportable(mod_.operator->()) && mod_->imports().empty()) {
	has_import_tree = false;
}

根据是否存在import_tree_，如果不存在，binary_blob_size字段直接写入模块数量，否则写入所有模块数量并追加1。

uint64_t sz = 0;
if (has_import_tree) {
    // we will append one key for _import_tree
    // The layout is the same as before: binary_size, key, logic, key, logic...
    sz = mod_group_vec_.size() + 1;
} else {
    // Keep the old behaviour
    sz = mod_->imports().size();
}
stream->Write(sz);

binary_blob_type_key

表示模块的blob类型键，对于LLVM或C模块，其blob类型键是_lib。而其它模块，像CUDA模块，其类型键为cuda；OpenCL模块，其类型键为opencl等。关于类型键的可以通过module->type_key()获取。

for (const auto& group : mod_group_vec_) {
    ICHECK_NE(group.size(), 0) << "Every allocated group must have at least one module";
    if (!DSOExportable(group[0])) {
        ICHECK_EQ(group.size(), 1U) << "Non DSO module is never merged";
        std::string mod_type_key = group[0]->type_key();
        stream->Write(mod_type_key);
        group[0]->SaveToBinary(stream);
    } else {
        // DSOExportable: do not need binary
        if (has_import_tree) {
            std::string mod_type_key = "_lib";
            stream->Write(mod_type_key);
        }
    }
}

binary_blob_logic

表示blob的逻辑处理，对于大多数的blob(像CUDA、OpenCL)，将会调用其模块的SaveToBinary函数序列化为二进制形式。但是像LLVM或C模块，只要写入_lib字符，表明该模块是DSO模块。

import_tree

除非模块只有一个DSO模块且为根模块，不需要写入_import_tree字段，其它情况都需要写入。当将导出的库需要加载回来时，可以用其重构模块的导入关系。

// Write _import_tree key if we have
if (has_import_tree) {
    std::string import_key = "_import_tree";
    stream->Write(import_key);
    stream->Write(import_tree_row_ptr_);
    stream->Write(import_tree_child_indices_);
}

import_tree_logic

将import_tree_row_ptr_和import_tree_child_indices_数组内容写入数据流。

2.1.2.3. Pack

经过上述序列化步骤后，将数据流打包成一个符号(runtime::symbol::tvm_dev_mblob)，这样就可以在需要时从动态库中恢复模块内容。写入动态库中的符号为__tvm_dev_mblob，根据序列化后的数据流大小创建以__tvm_dev_mblob为符号的数组const unsigned char __tvm_dev_mblob[bin.length() + sizeof(nbytes)]{}。

uint64_t nbytes = bin.length();
os << "const unsigned char " << runtime::symbol::tvm_dev_mblob << "["
	<< bin.length() + sizeof(nbytes) << "] = {\n  ";

前八个字节存储序列化数据流的字节数大小，按照16进制形式保存。

for (size_t i = 0; i < sizeof(nbytes); ++i) {
    // sperators
    if (i != 0) {
    	os << ",";
    }
    os << "0x" << ((nbytes >> (i * 8)) & 0xffUL);
}

20个字节为一行，保存序列化数据。

for (size_t i = 0; i < bin.length(); ++i) {
    // sperators
    if ((i + sizeof(nbytes)) % nunit == 0) {
    	os << ",\n  ";
    } else {
    	os << ",";
    }
    int c = bin[i];
    os << "0x" << (c & 0xff);
}

TVM export_lib函数分析

TVM export_lib函数分析

2.1. Module

2.1.1. export_library

2.1.1.1. Collect DSO Module

2.1.1.2. Save File

2.1.1.2.1 save函数

2.1.1.3. Impoerted Modules

2.1.1.4. Function Compile

2.1.2. Module Serialization

2.1.2.1 SerializeModule

2.1.2.2. ModuleSerializer

2.1.2.2.1. Init

2.1.2.2.1.1. CreateModuleIndex

2.1.2.2.1.2. CreateImportTree

2.1.2.2.2. SerializeModule

2.1.2.3. Pack

Recommend

Microbial Genomics

How to benchmark the performance of machine learning platforms

Performance comparison of dense networks in GPU: TensorFlow vs PyTorch vs Neural...

Share your martech stack illustration in the 2022 Martech Stackie Awards

This Week in Rust and WebAssembly 009

This Week in Rust and WebAssembly 11

This Week in Rust and WebAssembly 14

Gloo Update: Onion Layers, Timers, and Events

This Week in Rust and WebAssembly 15

Blog with Hugo (and GitHub Pages)

About Joyk