1

jott - wasm_vector_addition

 2 years ago
source link: https://jott.live/markdown/wasm_vector_addition
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

15x Faster TypedArrays: Vector Addition in WebAssembly @ 154GB/s

by @bwasti


Vector addition is the process of adding vector components:

A→+B→=(a0+b0,a1+b1,…)A→+B→=(a0+b0,a1+b1,…)

Over my holiday break, I tried to optimize this operation in JavaScript. I got to around ~12 million additions per second of 1024-dimensional vectors. If you use Chrome or Firefox, you can try out the benchmark in your browser.

Here's a chart of the results across a spread of the techniques described below:

(All run on an M1 MacBook Air with node.js 17, full code here)

Starting with pure JavaScript

let a = new Array(N).fill(0);
let b = new Array(N).fill(0);
let c = new Array(N).fill(0);

function add() {
  for (let i = 0; i < N; ++i) {
    c[i] = a[i] + b[i];
  }
}

This should certainly be slow.

Let's try with a couple sizes: 4, 64, 1024.

benchmarking vec add of size 4
  pure javascript:     6.43 GB/s (133952036.48 iters/sec)
benchmarking vec add of size 64
  pure javascript:     13.71 GB/s (17850781.69 iters/sec)
benchmarking vec add of size 1024
  pure javascript:     14.96 GB/s (1217559.02 iters/sec)

Ok cool, that's our baseline. ~15GB/s.

+ TypedArrays

Instead of boring old Array, let's use Float32Array. A typed version will let V8 really do it's thing. On top of that, Array uses double and float is half the size. All we have to do is swap out the initial call.

let a = new Float32Array(N);
let b = new Float32Array(N);
let c = new Float32Array(N);

And we're off to the races:

benchmarking vec add of size 4
  pure javascript:     6.54 GB/s (136224559.41 iters/sec)
  typed arrays:        5.73 GB/s (119443844.4 iters/sec)
benchmarking vec add of size 64
  pure javascript:     13.86 GB/s (18053198.87 iters/sec)
  typed arrays:        10.81 GB/s (14080397.49 iters/sec)
benchmarking vec add of size 1024
  pure javascript:     14.96 GB/s (1217559.02 iters/sec)
  typed arrays:        10.88 GB/s (885410.91 iters/sec)

Huh. The TypedArray calls seems to cap out at 11GB/s?

Even with larger sizes...

benchmarking vec add of size 4096
  pure javascript:     15.08 GB/s (306770.86 iters/sec)
  typed arrays:        11.03 GB/s (224453.88 iters/sec)

It seems the JIT does a really good job with JavaScript arrays. Disabling warmup in the benchmark helps to demonstrate this:

benchmarking vec add of size 4 [no warmup]
  pure javascript:     0.04 GB/s (782572.09 iters/sec)
  typed arrays:        0.13 GB/s (2701225.26 iters/sec)
benchmarking vec add of size 64 [no warmup]
  pure javascript:     0.12 GB/s (154383.82 iters/sec)
  typed arrays:        0.14 GB/s (180306.89 iters/sec)
benchmarking vec add of size 1024 [no warmup]
  pure javascript:     9.15 GB/s (744839.13 iters/sec)
  typed arrays:        6.66 GB/s (542320.53 iters/sec)

There's a great blog post by the V8 team on how this is done: https://v8.dev/blog/elements-kinds. The aggregate type of an Array is tracked, which means the type hint provided by TypedArrays doesn't give them as much of an intrinsic advantage. floats are smaller, though, so I'm still a bit perplexed as to why this is the case. (solved! Matheus28 on HackerNews has pointed out that the addition is being done in double precision, requiring a conversion at each step. Using Float64Array's speeds things up to match pure JavaScript)

Firefox uses a different JavaScript engine than node.js and Chromium, and has expected results (TypedArrays are faster). Thanks viraptor on lobste.rs for pointing this out!

+ Emscripten

Emscripten is a framework that helps compile C and C++ projects directly to WebAssembly. It does a lot of heavy lifting, including shimming system calls and the C standard library.

WebAssembly is designed as a simple instruction set that can be executed at near native performance. Here's the spec: https://webassembly.github.io/spec/core/

+ Memory passing

On top of its simplicity, WebAssembly is also designed to be very safe. This makes it a little bit awkward to interact with when dealing with large amounts of memory.

For example, every WebAssembly instance has its own linear memory. Out of the box, the only function you're given is Memory.grow. This is a lot like mmap and less like malloc, which is a bit of a pain.

const instance = new WebAssembly.Instance(...);
let mem = instance.exports.memory; // this needs to be manually exported
mem.grow(1); // grow by a single page (64KiB)

Emscripten comes the rescue and actually implements malloc for us, but the WebAssembly instance still owns the memory, and we'll need to do all of our vector addition there.

Commonly, you'll see functions like this written when folks use Emscripten:

function emscripten_array(len) {
  var ptr = Module._malloc(len * 4);
  return [new Float32Array(Module.HEAPF32.buffer, ptr, len), ptr];
}

The above code uses Emscripten's malloc and has JavaScript reinterpet that pointer as an offset into Emscripten's heap. This gives us two variables that refer to the same memory. The latter can be passed to Emscripten calls and the former can be manipulated as a TypedArray.

let [a, a_] = emscripten_array(N);
let [b, b_] = emscripten_array(N);
let [c, c_] = emscripten_array(N);
const add = Module._add;
// initialize a, b
for (let i = 0; i < N; ++i) {
  a[i] = Math.random();
  b[i] = Math.random();
}
add(a_, b_, c_);
// results are now in c

Now we're all set to run our freshly written Emscripten module. Be sure to compile with -O3 for performance.

// emcc add.cc -O3 -s EXPORTED_FUNCTIONS="['_add']"
extern "C" {

void add(const float* a, const float* b, float* c, int len) {
  for (auto i = 0; i < len; ++i) {
    c[i] = a[i] + b[i];
  }
}

}

Unfortunately we're still quite slow for smaller sizes, but we see a noticeable speed up at size 1024.

benchmarking vec add of size 4
  pure javascript:     6.83 GB/s (142214376.78 iters/sec)
  typed arrays:        5.71 GB/s (119016249.72 iters/sec)
  emscripten:          1.33 GB/s (27608487.76 iters/sec)
benchmarking vec add of size 64
  pure javascript:     13.46 GB/s (17524173.64 iters/sec)
  typed arrays:        10.64 GB/s (13853318.79 iters/sec)
  emscripten:          11.91 GB/s (15505845.36 iters/sec)
benchmarking vec add of size 1024
  pure javascript:     15.06 GB/s (1225731.4 iters/sec)
  typed arrays:        10.92 GB/s (888576.43 iters/sec)
  emscripten:          22.81 GB/s (1856272.37 iters/sec)

+ SIMD

Can we do any better? Surely -O3 isn't the only flag available to us.

Modern processors can handle multiple elements with the same instruction at once, and WebAssembly has limited support for that. We can tell Emscripten to use that feature by compiling with -msimd128. This gives us a nice win:

benchmarking vec add of size 4
  pure javascript:     6.62 GB/s (137877529.08 iters/sec)
  typed arrays:        5.76 GB/s (120071457.96 iters/sec)
  emscripten (simd):   1.33 GB/s (27729893.7 iters/sec)
benchmarking vec add of size 64
  pure javascript:     13.3 GB/s (17319586.98 iters/sec)
  typed arrays:        10.59 GB/s (13795225.43 iters/sec)
  emscripten (simd):   18.31 GB/s (23845788.63 iters/sec)
benchmarking vec add of size 1024
  pure javascript:     15.07 GB/s (1226252.88 iters/sec)
  typed arrays:        10.96 GB/s (892301.4 iters/sec)
  emscripten (simd):   75.14 GB/s (6114821.99 iters/sec)

75 GB/s is really good! But there are still some inefficiencies here. For example, we're running a function that can handle any vector size. Would there be more optimizations available if we limited the function to the vector size we care about at the time of benchmarking?

+ wasmblr

Working in C++ with Emscripten is great, so for 99% of usecases, I don't want to change that workflow. However, in this really contrived case, I want to be able to go a bit deeper and do some dynamic code generation. Unfortunately, Emscripten can often take seconds (if not minutes for larger projects) to compile.

So I wrote a header file that'd let me emit pure WebAssembly directly from C++, wasmblr (https://github.com/bwasti/wasmblr). It's fast becasue it doesn't do much beyond expose the specification with C++ semantics. In this writeup, most of the calls to it can run in about 1.5 microseconds. It takes 125 microseconds to instantiate a WebAssembly module, so we're bounded to around 8000 recompilations per second. That's a lot faster than Emscripten for this usecase.

+ Unroll

Since we know the size of the loop, we can unroll it. This is a common technique used to avoid executing instructions dealing with loops and pointer arithmetic.

In wasmblr, all the cg.* calls emit assembly. This gives us a meta-programming-like ability to use C++ loops to emit a load, add, store for every element in the vectors.

std::vector<uint8_t> gen_add_unroll(int len) {
  assert(len % 4 == 0);
  wasmblr::CodeGenerator cg;
  
  auto pages = (len * 3 * 4) / (1 << 16) + 1;
  cg.memory(pages).export_("mem");
  
  auto add_func = cg.function({cg.i32, cg.i32, cg.i32}, {}, [&]() {
    // no loop emitted at all,
    // C++ is generating len / 4 groups of instructions
    for (auto i = 0; i < len / 4; ++i) {
      cg.local.get(2);

      cg.local.get(0);
      cg.v128.load(0, i * 16);

      cg.local.get(1);
      cg.v128.load(0, i * 16);

      cg.v128.f32x4_add();

      cg.v128.store(0, i * 16);
    }
  });
  cg.export_(add_func, "add");
  return cg.emit();
}

If you're not familiar with WebAssembly, the above code might need some explaining.

First, that variable "pages". Since wasmblr is dirt simple, there is no malloc implementation we can invoke from JavaScript, so I just hacked in a pre-allocation of the full size of the inputs and outputs (2 inputs, 1 output, 4 byte floats, all of length len, a page size is 1<<16).

Second, what is that cg.local.get(2) doing at the top? cg.local.get(X) refers to arguments, so that's a reference to third one (our output). Surprisingly, I needed to call that at the top because the cg.v128.store way at the bottom needs it as an argument on the stack. WebAssembly is a stack based instruction set, rather than a register based one, so it can get kinda funky to read.

Third, load and store seem to take immediate arguments. This is a part of the WebAssembly specification that we're exploiting. You can hardcode an offset into a load/store and it's added to the pointer on the stack. That means instead of hardcoding 0 and incrementing pointers to the memory, we can just leave the pointers and put offsets in the code itself. This wouldn't be possible in Emscripten because the generated code would need to handle all types of lengths.

The results are pretty good on the sizes we've chosen, about 2x Emscripten. We're hitting up to 154 GB/s in this latest benchmark!

benchmarking vec add of size 4
  pure javascript:         6.47 GB/s (134745835.5 iters/sec)
  typed arrays:            5.94 GB/s (123764397.54 iters/sec)
  emscripten (simd):       1.34 GB/s (27821574.95 iters/sec)
  wasmblr:                 2.73 GB/s (56838383.47 iters/sec)
benchmarking vec add of size 64
  pure javascript:         13.82 GB/s (17996280.78 iters/sec)
  typed arrays:            10.79 GB/s (14052220.48 iters/sec)
  emscripten (simd):       18.26 GB/s (23773963.85 iters/sec)
  wasmblr:                 37.41 GB/s (48715916.95 iters/sec)
benchmarking vec add of size 1024
  pure javascript:         15.04 GB/s (1223936.07 iters/sec)
  typed arrays:            10.91 GB/s (887874.77 iters/sec)
  emscripten (simd):       75.07 GB/s (6109401.71 iters/sec)
  wasmblr:                 154.89 GB/s (12605111.46 iters/sec)

But, what about larger sizes?

benchmarking vec add of size 16384
  pure javascript:         15.02 GB/s (76393.04 iters/sec)
  typed arrays:            10.89 GB/s (55389.2 iters/sec)
  emscripten (simd):       90.96 GB/s (462658.42 iters/sec)
  wasmblr:                 88.09 GB/s (448069.81 iters/sec)

Damn. We must have fallen out of instruction cache. Unrolling has its limits and we've gone a bit too far emitting a program that's ~100KB for a simple for loop.

Guess we'll have to take the hard route and also keep track of pointers:

std::vector<uint8_t> gen_add_mix(int len, int unroll) {
  assert(len % (unroll * 4) == 0);
  wasmblr::CodeGenerator cg;
  auto pages = (len * 3 * 4) / (1 << 16) + 1;
  cg.memory(pages).export_("mem");
  auto add_func = cg.function({cg.i32, cg.i32, cg.i32}, {}, [&]() {
    // keep track of the iteration with this variable
    auto iter = cg.local(cg.i32);
    cg.i32.const_(0);
    cg.local.set(iter);

    cg.loop(cg.void_);

    // same as above, but now locals are mutated
    for (auto i = 0; i < unroll; ++i) {
      cg.local.get(2);

      cg.local.get(0);
      cg.v128.load(0, i * 16);

      cg.local.get(1);
      cg.v128.load(0, i * 16);

      cg.v128.f32x4_add();

      cg.v128.store(0, i * 16);
    }

    // pointer/iterator math below
    cg.local.get(0);
    cg.i32.const_(unroll * 16);
    cg.i32.add();
    cg.local.set(0);

    cg.local.get(1);
    cg.i32.const_(unroll * 16);
    cg.i32.add();
    cg.local.set(1);

    cg.local.get(2);
    cg.i32.const_(unroll * 16);
    cg.i32.add();
    cg.local.set(2);

    cg.local.get(iter);
    cg.i32.const_(unroll * 16);
    cg.i32.add();
    cg.local.set(iter);

    // condition to exit the loop
    cg.i32.const_(len * 4);  // bytes
    cg.local.get(iter);
    cg.i32.ge_s();
    cg.br_if(0);

    cg.end(); // need to end the loop! this is important to remember
  });
  cg.export_(add_func, "add");
  return cg.emit();
}

Now we can unroll by a more reasonable factor of 16.

benchmarking vec add of size 4
  pure javascript:         6.32 GB/s (131599401.31 iters/sec)
  typed arrays:            5.88 GB/s (122460888.5 iters/sec)
  emscripten (simd):       1.33 GB/s (27803398.21 iters/sec)
  wasmblr:                 2.73 GB/s (56812962.22 iters/sec)
benchmarking vec add of size 64
  pure javascript:         13.82 GB/s (18000013.97 iters/sec)
  typed arrays:            10.8 GB/s (14057150.76 iters/sec)
  emscripten (simd):       18.24 GB/s (23755889.12 iters/sec)
  wasmblr:                 36.98 GB/s (48154136.31 iters/sec)
benchmarking vec add of size 1024
  pure javascript:         15.04 GB/s (1224069.46 iters/sec)
  typed arrays:            10.85 GB/s (882788.85 iters/sec)
  emscripten (simd):       73.71 GB/s (5998175.91 iters/sec)
  wasmblr:                 138.9 GB/s (11303927.39 iters/sec)
benchmarking vec add of size 16384
  pure javascript:         15.27 GB/s (77662.85 iters/sec)
  typed arrays:            11.09 GB/s (56398.08 iters/sec)
  emscripten (simd):       90.31 GB/s (459346.92 iters/sec)
  wasmblr:                 86.71 GB/s (441049.92 iters/sec)

Well that was a lot of work for nothing. And our performance for size 1024 seems to have gone down! Playing around with unroll factor, we can probably get some more performance...

+ Tuning

Why manually play with unroll factors when we recompile and reload wasmblr code in 1/8000th of a second? We may as well just try a bunch of unroll factors and save the best one.

let best = 0;
let best_time = 1e9;
for (let i = 0; Math.pow(2, i) < Math.min(1024, N / 4 + 2); ++i) {
  let [fn, w_a, w_b, w_c] = gen_wasmblr(N, Math.pow(2, i));
  for (let _ = 0; _ < 100; ++_) {
    fn();
  }
  const t = performance.now();
  for (let _ = 0; _ < 1000; ++_) {
    fn();
  }
  const diff = performance.now() - t;
  if (diff < best_time) {
    best = i;
    best_time = diff;
  }
}

Alright, let's see what happens! And let's add another benchmark for good measure. As we try larger vectors, we start getting bounded by RAM and less by instruction execution, so I'd expect things to even out.

benchmarking vec add of size 4
  pure javascript:         6.58 GB/s (137027121.14 iters/sec)
  typed arrays:            5.94 GB/s (123815568.64 iters/sec)
  emscripten (simd):       1.33 GB/s (27750315.65 iters/sec)
  wasmblr:                 2.73 GB/s (56932301.96 iters/sec)
  wasmblr (tuned 2):       2.73 GB/s (56956492.7 iters/sec)
benchmarking vec add of size 64
  pure javascript:         13.81 GB/s (17975517.63 iters/sec)
  typed arrays:            10.78 GB/s (14037142.11 iters/sec)
  emscripten (simd):       18.21 GB/s (23713778.4 iters/sec)
  wasmblr:                 36.9 GB/s (48052212.55 iters/sec)
  wasmblr (tuned 4):       36.31 GB/s (47284842.5 iters/sec)
benchmarking vec add of size 1024
  pure javascript:         14.79 GB/s (1203713.72 iters/sec)
  typed arrays:            10.8 GB/s (878615.27 iters/sec)
  emscripten (simd):       73.35 GB/s (5969393.52 iters/sec)
  wasmblr:                 144.1 GB/s (11727035.06 iters/sec)
  wasmblr (tuned 16):      144.05 GB/s (11722845.16 iters/sec)
benchmarking vec add of size 16384
  pure javascript:         14.92 GB/s (75866.72 iters/sec)
  typed arrays:            11.03 GB/s (56121.47 iters/sec)
  emscripten (simd):       89.49 GB/s (455146.48 iters/sec)
  wasmblr:                 85.06 GB/s (432642.22 iters/sec)
  wasmblr (tuned 4):       153.31 GB/s (779787.52 iters/sec)
benchmarking vec add of size 262144
  pure javascript:         14.92 GB/s (4744.22 iters/sec)
  typed arrays:            10.99 GB/s (3493.11 iters/sec)
  emscripten (simd):       93.77 GB/s (29809.79 iters/sec)
  wasmblr:                 87.29 GB/s (27748.19 iters/sec)
  wasmblr (tuned 2):       93.79 GB/s (29814.89 iters/sec)

Woo! We've recovered some perf and even improved on certains sizes. I find it cool that tuning let us climb back up the cliff for size 16,384 and actually stay in cache. At the larger size of 262,144 it really doesn't matter much what we do, the vectors themselves are so big we're just waiting on memory most of the time.

In conclusion, if your vectors are small, you're probably fine with pure JavaScript. If you've got very large vectors (larger than cache), Emscripten does a great job. If you've got a very specific size and a lot of time on your hands, you can check out wasmblr (or any other WebAssembly emitter).

Thanks so much for reading!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK