SIMD Array-of-Structures-of-Arrays in nalgebra and comparison with ultraviolet
source link: https://www.rustsim.org/blog/2020/03/23/simd-aosoa-in-nalgebra/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Hello everyone!
In this post I'd like to introduce the next major change that will be released in nalgebra at the end of this month (March 2020). This change is about adding the support for SIMD AoSoA to nalgebra . I'll explain what I mean by SIMD AoSoA (Array-of-Structures-of-Arrays with explicit SIMD) and how it relates to SoA (Structure-of-Arrays) and AoS (Array-of-Structures). To give you an idea, SIMD AoSoA is actually what the recent ultraviolet crate has been using to achieve its amazing performances.
Here is a benchmark including the next (to-be-released) version of nalgebra
. The best times within a 2.5% range of the minimum of each row are highlighted.
Here, the ultraviolet
column uses the non- wide
types
of ultraviolet
( Vec2
, Mat2
, Isometry3
, etc.) while the column ultraviolet_f32x4
uses its wide
types ( Wec2
, Wat2
, WIsometry3
, etc.):
Please see theof this post for a more comprehensive benchmark (including the use of f32x8
and f32x16
with nalgebra) and details about the benchmark conditions.
What is AoSoA with explicit SIMD?
The data layout I call here AoSoA with explicit SIMD
(or SIMD AoSoA
for short) is a straightforward variant of the AoSoA
(Array-of-Structures-of-Arrays) layout which is itself a
combination of two more common layouts: SoA
(Structure-of-Arrays) and AoS
(Array-of-Structures). So let's see what is the difference
of all those layouts with the simple example using 3D vectors (vectors with three components x, y, z
): given two arrays of
1024 3D vectors, we want to compute the sum of each pairs of vectors with the same index.
Note: The explanations here are merely a superficial introduction of the AoS vs SoA vs AoSoA concepts. I just want to explain some of the differences and some of their advantages/disadvantages. I won't give any detailed analysis of the generated assembly code after compiling the examples provided. The benchmarks at the end of this post will show the performance difference between AoS and SIMD AoSoA. You may be interested by that article too .
Note 2:
For iterating through the arrays, I'll be using explicit indices for i in 0..1024
instead of iterators. This
is on purpose to make the number of iterations explicit and to make the code easier to understand for readers that are not
used to Rust iterators.
Array-of-Structures (AoS)
The Array-of-Structures layout is the most common and intuitive layout. You define a structure as the aggregation of all its fields. And multiple structures of the same type are simply stored one after the other inside of an array. Here is what our 3D vector would look like:
struct Vector3 { x: f32, y: f32, z: f32 } /// An array containing 1024 vectors. type SetOfVector3 = [Vector3; 1024]; fn add_arrays_of_vectors(a: &mut SetOfVector3, b: &SetOfVector3) { for i in 0..1024 { va.x += vb.x; va.y += vb.y; va.z += vb.z; } }
Here, we need to iterate through each pair of vectors, one from each set, and execute our sum. This is arguably the most intuitive way of doing this, but not necessarily the most efficient. All Rust linear algebra crates from this benchmark can work with this layout.
Pros of AoS
- It is easy to read/write/modify each vector individually.
- AoS are generally easier to reason with when designing algorithms.
Cons of AoS
- Not as efficient as other layouts when working on a large number of elements of the same type.
Structure-of-Arrays (SoA)
The Structure-of-Arrays layout is less intuitive to work with because it will store each field of a struct into its own array. Thus, our set of vector set would look like that:
struct SetOfVector3 { x: [f32; 1024], y: [f32; 1024], z: [f32; 1024], }
These is no explicit Vector3
structure here because they are all packed into the array. Accessing the components of the i
-th
vector of the set means we access set.x[i]
, set.y[i]
and set.z[i]
. With this structure, our vector sum becomes the following:
fn add_arrays_of_vectors(a: &mut SetOfVector3, b: &SetOfVector3) { for i in 0..1024 { a.x[i] += b.x[i]; } for i in 0..1024 { a.y[i] += b.y[i]; } for i in 0..1024 { a.z[i] += b.z[i]; } }
There is more code here than in our AoS example because we need to iterate on each components individually. However this will be much more efficient than our AoS approach because this is extremely vectorization-friendly: the compiler will be able to group consecutive vector entries for each component, and use SIMD instructions to perform the 2, 4, 8, or even 16 additions at a time instead of a single one.
Pros of SoA
- Extremely fast.
Cons of SoA
- Algorithms need to be explicitly designed with SoA in mind to be efficient.
- Manipulating vectors individually is more difficult and can be less efficient than AoS.
- It is more difficult to think in terms of SoA than in terms of AoS.
Array-of-Structures-of-Arrays (AoSoA)
Now let's combine both SoA and AoS to obtain: Array-of-Structures-of-Arrays (AoSoA). The idea is to first define a wide 3D vector, i.e., we pack several vectors into a single struct:
struct WideVector3 { x: [f32; 4], y: [f32; 4], z: [f32; 4], }
The term wide
3D vector is inspired from the terminology used in the ultraviolet
crate's documentation.
Here, our WideVector3
actually represents 4 vectors laid out in a SoA fashion. If we want our set of vector to contain more than just
4 vectors, we can define an array of WideVector3
(so we end up with an array of structure of array) with only 1024 / 4
elements
because each element already represent 4 vectors:
type SetOfWideVector3 = [WideVector3; 1024 / 4];
Our vector sum then becomes quite similar to the AoS approach, except that we have to add each 4 components in the inner loop in an SoA fashion:
fn add_arrays_of_vectors(a: &mut SetOfWideVector3, b: &SetOfWideVector3) { for i in 0..1024 / 4 { // NOTE: each of those groups of four sums can be executed as // a single SIMD instruction. va[i].x[0] += vb[i].x[0]; va[i].x[1] += vb[i].x[1]; va[i].x[2] += vb[i].x[2]; va[i].x[3] += vb[i].x[3]; va[i].y[0] += vb[i].y[0]; va[i].y[1] += vb[i].y[1]; va[i].y[2] += vb[i].y[2]; va[i].y[3] += vb[i].y[3]; va[i].z[0] += vb[i].z[0]; va[i].z[1] += vb[i].z[1]; va[i].z[2] += vb[i].z[2]; va[i].z[3] += vb[i].z[3]; } }
So with this data layout, we still achieve great performances because components are grouped by 4, and thus their sum can
easily be auto-vectorized to a single SIMD instruction. To manipulate an individual vector we first have to access the corresponding wide
vector, and then its components on the x,y,z
arrays.
Pros of AoSoA
- Great performances compared to AoS.
- Can even beat the performances of SoA considering AoSoA can be more cache-friendly in modern architecture.
- Designing algorithms around AoSoA is somewhat easier than with plain SoA.
- Using AoSoA, most linear-algebra operations can be easily vectorized as long as they don't rely too much on branching.
Cons of AoSoA
- We still need to design our algorithms carefully to use AoSoA efficiently.
Array-of-Structures-of-arrays with explicit SIMD (SIMD AoSoA)
Finally, let's talk about what I call here SIMD AoSoA
. This layout is actually exactly the same as AoSoA presented before,
except we that we use explicit SIMD primitives like, e.g., the types from
the packed_simd
crate instead of plain arrays. For example we can use 4-lanes SIMD
floats packed_simd::f32x4
instead of [f32; 4]
:
struct WideVector3 { x: packed_simd::f32x4, y: packed_simd::f32x4, z: packed_simd::f32x4, } type SetOfWideVector3 = [WideVector3; 1024 / 4];
Then our sum will be exactly the same as in the AoSoA approach, except that we don't have to deal with each 4 lanes
explicitly because packed_simd::f32x4
implements the Add
trait:
fn add_arrays_of_vectors(a: &mut SetOfWideVector3, b: &SetOfWideVector3) { for i in 0..1024 / 4 { // NOTE: each 4-lanes sum will be executed as a single SIMD operation. va[i].x += vb[i].x; va[i].y += vb[i].y; va[i].z += vb[i].z; } }
If we wanted to use 16-lanes vectors here (based on AVX instructions), we could simply do:
struct WideVector3 { x: packed_simd::f32x16, y: packed_simd::f32x16, z: packed_simd::f32x16, } type SetOfWideVector3 = [WideVector3; 1024 / 16];
Another benefit of using SIMD types explicitly here is that we no longer rely on the compiler's auto-vectorization. So we can get SIMD instructions even in debug mode, and in some situations where the compiler would fail to auto-vectorize properly.
Using SIMD AoSoA for linear-algebra in Rust: ultraviolet and nalgebra
As far as I know, the first crate that implemented this concept for (gamedev) linear algebra in Rust is ultraviolet . At the end of this month (March 2020), you will also be able to use this approach with nalgebra , in its upcoming version 0.21.0.
With ultraviolet
, you have the choice between two families of types: regular types ( Vec2
, Mat3
, Isometry3
, etc.) and wide
types ( Wec2
, Wat3
, WIsometry3
).
The regular types are designed to be usable with the common AoS layout, with vector/matrix components set to f32
.
The wide
types on the other hand are designed to be used with the SIMD AoSoA layout: each wide
type packs the corresponding four non-wide types into a single structure
(e.g. one Wec3
represents four Vec3
), and the wide vector/wide matrix components are of type f32x4
from the wide crate
. As of today, ultraviolet
is limited to 32-bits floats, and 4-lanes 32-bits SIMD floats.
You can't use it with SIMD integers, nor with 64-bits floats or 32-bits SIMD floats with a number of lanes different from 4.
With nalgebra
, all types of vectors/matrices/tansformations are generic wrt. their component type. Therefore, for a
AoS layout, you can use, e.g., the Vector2<f32>
type. And if you want to leverage SIMD AoSoA, you can use Vector2<f32x4>
instead and
that will give you four 2D vectors for the price of one. Note that the f32x4
type here comes from the new simba
crate and is a newtype for the f32x4
from packed_simd
(the upcoming standard safe SIMD implementation in Rust).
A newtype was required because of the orphan rules, and the need to implement some traits from the num
crate for the SIMD types. Simba
is not limited to f32x4
though as it defines a newtype for every single numeric type of packed_simd
. Therefore nalgebra
will also support all the integer and float SIMD types, with 2, 4, 8, or 16 lanes. You can for example write and use Isometry2<simba::f32x16>
as
a SoA set of 16 Isometry2<f32>
.
Finally nalgebra
has support for SIMD on all the platforms supported by packed_simd
itself.
Here are some (subjective) pros and cons for ultraviolet and nalgebra :
- Pros of ultraviolet : no generics so it is simple to use, and efficient even without compiler optimizations enabled. Works on stable Rust.
- Pros of nalgebra : generics allow the use of all SIMD types. More feature-complete and tested than ultraviolet . Based on packed_simd with great platform support but we could make it work with the wide crate too.
- Cons of ultraviolet : cannot use 64-bits floats nor SIMD types other than 32-bit 4-lanes floats. It has a more limited feature set (but that may be enough for gamedev).
- Cons of nalgebra : generics make it harder to use, and make the doc harder to browse. Also nalgebra is much less efficient without compiler optimizations enabled. Because SIMD AoSoA is based on packed_simd , the use of SIMD types will only work with the nightly compiler.
Benchmarks of Rust linear-algebra crates
Now is the time to see if SIMD AoSoA is useful at all. The benchmarks I run here are a modified version of the mathbench-rs
benchmark suite you may have already head of. For example it
was used when glam
and ultraviolet
were published.
If you want to run the benchmarks on your machine, you can clone my fork
and
either run cargo bench
or RUSTFLAGS='-C target-cpu=native' cargo bench
.
The modifications I made to the benchmark were the following:
-
I added
codegen-units = 1
to the[profile.bench]
section. This allows to get as much optimization as we can get from the compiler (this is typically what you want before releasing a binary). It turns out that this can affect the performance of nalgebra which benefits significantly from compiler optimizations. - I added support for ultraviolet regular types (identified by the column ultraviolet in the benchmarks) as well as its wide types (identified by the column ultraviolet_f32x4 ).
-
Because ultraviolet
use the concepts of rotors instead of quaternions for its rotation, I used its
Rotor/WRotor
types for all its quaternion benchmark rows. - I added support for using nalgebra with f32 SIMD with 4, 8, and 16 lanes identified by nalgebra_f32x4 , nalgebra_f32x8 and nalgebra_f32x16 .
- I added benchmarks for 3D isometries of both ultraviolet and nalgebra .
- I modified the benchmark of unary and binary operations so they measure the time to process 16 elements. For example when benchmarking 2D matrix transposition, we are actually seeing the computation time for transposing 16 matrices (instead of just one). This allows us to measure the gain of SIMD AoSoA without penalizing neither non-AoSoA crates nor AoSoA crates.
This last modification follows the same principle as the one introduced for the benchmarks presented by ultraviolet
in
their README
.
Benchmark with -C target-cpu=native
In this benchmark I am compiling with the RUSTFLAGS='-C target-cpu=native
option so that the compiler emits AVX instructions
for f32x8
and f32x16
. It appears that some crates (namely glam
and vek
) are not as efficient as they could be
when this option is enabled so you will find a second benchmark without this option enabled in the next section.
Benchmark options:
-
command line:
RUSTFLAGS='-C target-cpu=native' cargo bench
-
profiling option in
Cargo.toml
:codegen-units = 1
-
CPU:
AMD Ryzen 9 3900X 12-Core Processor
To get a better comparative view of the performance of AoSoA, here are the benchmarks restricted to ultraviolet's wide
types
( Wec2, Wat2, WIsometry2
, etc.) and nalgebra types paramaterized by SIMD types ( Vector2<f32x4>, Matrix2<f32x8>, Isometry2<f32x16>
, etc.):
As we can see, both ultraviolet_f32x4
and nalgebra_f32x4
yield roughly the same computation times. The most efficient
option for my processor is nalgebra
with f32x8
: it often reaches the best performance, and f32x16
comes
with little performance gain, and even significant regressions for some tests.
Benchmark without -C target-cpu=native
It appears some rust linear algebra crates do not perform as well as they could if -C target-cpu=native
is passed
to the compiler. This is for example the case of glam
and vek
. However, it appears that not using -C target-cpu=native
makes some methods of non-wide types of ultraviolet
much less efficient.
Because we don't use -C target-cpu=native
here, both f32x8
and f32x16
won't emit 8- and 16-lanes AVX instructions, making
them only as efficient as f32x4
.
Here is the same benchmark with:
-
command line:
cargo bench
-
options in
Cargo.toml
:codegen-units = 1
. -
CPU:
AMD Ryzen 9 3900X 12-Core Processor
Conclusion
The use of SIMD AoSoA is extremely promising for doing efficiently lots of linear algebra on large arrays of entities with the same types.
This will be possible in nalgebra
in its next version 0.21.0 and will perform as efficiently as the current implementation in ultraviolet
when using f32x4
and will also be compatible with other SIMD types as well (e.g. f32x16
, f64x2
, i32x4
, u8x4
, etc.) Those types are newtypes wrapping
the types from packed_simd
. Those newtypes will come from a new crate named simba
which I will present in more depth in the next edition of the This month in Rustsim
blog posts.
Finaly here are two elements to keep in mind:
-
Compilation flags matter a lot when measuring performances.
-C target-cpu
orcodegen-units=1
can have significant impacts on your release performance. - SIMD AoSoA comes with a price: algorithms must be carefully designed to obtain as much gain as possible compared to an AoS approach.
Thank you all for reading, and for your support on patreon or GitHub sponsor !
Don't hesitate to share your thoughts on this topic, or to correct me if something seems off to you.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK