3

C++ 调用Rust

 1 year ago
source link: http://bean-li.github.io/Rust-FFi/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

软件开发中,没有银弹,我们有时需要多种语言互相调用支援,发挥各自的优势。本文介绍Rust的FFI 外部函数接口。

本文重点介绍C或者C++调用Rust library的方法。

Rust在实现std::slice::sort_unstable的时候,用了一种新的快速排序变种PDQSort,相对其它语言里面普遍用到的IntroSort有较大的性能提升,我们偷个懒,不自己实现C的pdqsort,调用Rust的pdqsort,作为本次练习的任务。

Rust 侧

cargo new pdqsort --lib

执行完毕后,我们得到了:

ROG-Manjaro Rust/pdqsort ‹master*› » tree
.
├── Cargo.lock
├── Cargo.toml
└── src
    └── lib.rs

2 directories, 3 files

我们修改src/lib.rs

#[no_mangle]
pub unsafe extern fn rust_u32sort(elements: *mut u32, size: u64) {
    let elements = std::slice::from_raw_parts_mut(elements, size as usize);
    elements.sort_unstable();
}

同时修改Cargo.toml :

[lib]
crate-type = ["cdylib"]

crate-type = ["cdylib"]会创建一个动态链接的库。 可查看 Cargo 文档的动态或静态库,了解更多信息.

cdylib在 RFC 1510 中引入,并改善了现有的dylib文件,减小其大小,和导出更少符号。

然后我们可以通过调用:

ROG-Manjaro Rust/pdqsort ‹master*› » cargo build --release
   Compiling pdqsort v0.1.0 (/home/manu/CODE/Rust/pdqsort)
    Finished release [optimized] target(s) in 1.49s
ROG-Manjaro Rust/pdqsort ‹master*› » tree
.
├── Cargo.lock
├── Cargo.toml
├── src
│   └── lib.rs
└── target
    ├── CACHEDIR.TAG
    └── release
        ├── build
        ├── deps
        │   ├── libpdqsort.so
        │   └── pdqsort.d
        ├── examples
        ├── incremental
        ├── libpdqsort.d
        └── libpdqsort.so

8 directories, 8 files

可以看出我们已经可以编译出了libpdqsort.so。

如果想用静态链接:


[lib]
crate-type = ["staticlib"]

会编译出来如下内容:

ROG-Manjaro Rust/pdqsort ‹master*› » tree
.
├── Cargo.lock
├── Cargo.toml
├── src
│   └── lib.rs
└── target
    ├── CACHEDIR.TAG
    └── release
        ├── build
        ├── deps
        │   ├── libpdqsort-4b11be617319d092.a
        │   └── pdqsort-4b11be617319d092.d
        ├── examples
        ├── incremental
        ├── libpdqsort.a
        └── libpdqsort.d

C++侧,我们写测试代码,即对100万个unsigned int数进行排序。

#include <algorithm>
#include <cassert>
#include <cstdint>
#include <cstdio>
#include <cstdlib>
#include <ctime>
#include <vector>

extern "C" int rust_u32sort(uint32_t* elements, uint64_t size);

std::vector<uint32_t> generate_test_vec(size_t n) {
    auto test_vec = std::vector<uint32_t> {};
    while (test_vec.size() < n) {
        test_vec.push_back((uint32_t) std::rand());
    }
    return test_vec;
}

int main() {
    for (int round = 0; round < 3; round++) {
        std::printf("round %d:\n", round);

        auto test_vec1 = generate_test_vec(10000000);
        auto test_vec2 = test_vec1;

        auto start_time1 = std::clock();
        std::sort(test_vec1.begin(), test_vec1.end());
        std::printf("std::sort time: %.3lf sec\n", (std::clock() - start_time1) * 1.0 / CLOCKS_PER_SEC);

        auto start_time2 = std::clock();
        rust_u32sort(test_vec2.data(), test_vec2.size());
        std::printf("rust_u32sort time: %.3lf sec\n", (std::clock() - start_time2) * 1.0 / CLOCKS_PER_SEC);

        assert(test_vec2 == test_vec1);
    }
}
 动态链接:
 c++  -std=c++20 -O3  test.cpp -L ../../Rust/pdqsort/target/release/ -lpdqsort -o sort   
 静态链接:
 c++  -std=c++20 -O3  test.cpp -L ../../Rust/pdqsort/target/release/ -lpdqsort -o sort 
动态链接:
-----------
LD_LIBRARY_PATH=./target/release ../../C++/pdqsort/sort
round 0:
std::sort time: 1.746 sec
rust_u32sort time: 0.797 sec
round 1:
std::sort time: 1.786 sec
rust_u32sort time: 0.841 sec
round 2:
std::sort time: 1.829 sec
rust_u32sort time: 0.834 sec


静态链接:
---------
ROG-Manjaro C++/pdqsort » ./sort
round 0:
std::sort time: 1.787 sec
rust_u32sort time: 0.721 sec
round 1:
std::sort time: 1.848 sec
rust_u32sort time: 0.776 sec
round 2:
std::sort time: 1.831 sec
rust_u32sort time: 0.745 sec

Summary

结果上可以看到Rust的排序比C++ std::sort快了一倍多,结果比较出乎意料。目前在读PDQSort的代码,分析性能比传统快排更优的原因是使用了一种新的快排变种BlockQuickSort,这个变种算法可以较大改善传统快排中的分支预言失败的情况,具体实现还没有读完,后续再补充上。

附上BlockQuickSort的论文,2016年出的,BlockQuicksort: Avoiding Branch Mispredictions in Quicksort.

我们将上述测试放到物理机上跑:

[david@david-latitude3510 pdqsort]$ LD_LIBRARY_PATH=./target/release perf stat ../../C++/pdqsort/sort
round 0:
std::sort time: 0.782 sec
round 1:
std::sort time: 0.785 sec
round 2:
std::sort time: 0.779 sec

 Performance counter stats for '../../C++/pdqsort/sort':

          3,029.93 msec task-clock:u                     #    1.000 CPUs utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
            11,694      page-faults:u                    #    3.859 K/sec
     9,780,539,711      cycles:u                         #    3.228 GHz                         (74.95%)
     7,320,156,912      instructions:u                   #    0.75  insn per cycle              (75.02%)
     1,664,764,519      branches:u                       #  549.440 M/sec                       (75.05%)
       305,855,559      branch-misses:u                  #   18.37% of all branches             (74.97%)

       3.030752533 seconds time elapsed

       2.800187000 seconds user
       0.133345000 seconds sys


[david@david-latitude3510 pdqsort]$ LD_LIBRARY_PATH=./target/release perf stat ../../C++/pdqsort/sort
round 0:
rust_u32sort time: 0.342 sec
round 1:
rust_u32sort time: 0.341 sec
round 2:
rust_u32sort time: 0.343 sec

 Performance counter stats for '../../C++/pdqsort/sort':

          1,619.17 msec task-clock:u                     #    1.000 CPUs utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
            11,702      page-faults:u                    #    7.227 K/sec
     5,300,322,850      cycles:u                         #    3.273 GHz                         (74.99%)
     9,081,461,937      instructions:u                   #    1.71  insn per cycle              (74.99%)
     1,185,526,644      branches:u                       #  732.182 M/sec                       (74.99%)
        69,121,012      branch-misses:u                  #    5.83% of all branches             (75.02%)

       1.619959094 seconds time elapsed

       1.566298000 seconds user
       0.049854000 seconds sys

我们可以看到branch-misses,对于改进后的快排,只有5.83%的分支预测失败,而传统的快排,18.37%的branch-misses。

Rust的排序比C++快了一倍

C++ pdqsort排序


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK