3

proposal: add package for using SIMD instructions · Issue #53171 · golang/go ·...

 2 years ago
source link: https://github.com/golang/go/issues/53171
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Comments

mpldr commented 16 days ago

edited

SIMD has the potential to greatly increase performance of a lot of data processing applications. Since #35307 was closed with the remark

We agree that there is an opportunity here, but we don't know what it looks like. Also, this is an area that is likely to be affected by generics, which are in progress. For this specific proposal, there is no change in consensus. Closing.

Generics are now available, so I want to use this opportunity to necrobump and suggest a simd package, which allows using SIMD instructions via a highlevel API. I think a simple API like Rust's experimental SIMD support, or the previously linked WebAssembly and Dart would help a lot of developers in improving their data processing performance in a simple and easy way.

OneOfOne, erifan, Jorropo, changkun, arl, j178, 5hay, mrg0lden, pourfar, inkeliz, and 12 more reacted with thumbs up emojiguidog reacted with thumbs down emojicarlmjohnson, lemon-mint, and mugli reacted with heart emoji All reactions

Author

mpldr commented 16 days ago

edited

As for what the API might look like, the following yenc encoder may provide some insight.

fn yenc(input: [u8;64]) -> [u8;128] {
    use core::simd::*;
    let mut indata = u8x64::from_array(input); //u8x64::from_array(input);
    indata += u8x64::splat(42);

    // mark special characters
    let mut mask = indata.lanes_eq(u8x64::splat(0));
    mask |= indata.lanes_eq(u8x64::splat(10));
    mask |= indata.lanes_eq(u8x64::splat(13));
    mask |= indata.lanes_eq(u8x64::splat(61));

    indata += mask.select(
        u8x64::splat(64), // add 64 where the mask is set to true
        u8x64::splat(0)   // and don't do anything where it isn't
        );
    
    let escape = mask.select(
        u8x64::splat(61), // add '=' (ASCII 61) where the mask is set to true
        u8x64::splat(0)   // and don't do anything where it isn't
    );
    
    let mut result: [u8;128] = [0;128];
    
    for i in 0..64 {
        result[i*2] = escape[i];
        result[i*2+1] = indata[i];
    }
    
    return result;
}

(Rust Playground)

package simd

 simd

  []

 [ ](,  )  {
    
}

  ( )  {
    
}

package simd/mask

 mask

  []


 (,  simd.)   {
    
}

 ( ) ( ) {
    
}

 ( ) (,  simd.) simd. {
    
}

package simd/{sse2,sse3,…}

Equivalent Go-Code would look something like this (assuming abovementioned API design)

package yenc

import (
    "simd"
    "simd/mask"
)

func yenc(input [64]byte) [128]byte {
    indata := simd.Uint8x64(input)
    indata += Uint8x64Splat(42)

    // mark special characters
    escapeMask := mask.Equals(indata, Uint8x64Splat(0))
    escapeMask.Or(mask.Equals(indata, Uint8x64Splat(10)))
    escapeMask.Or(mask.Equals(indata, Uint8x64Splat(13)))
    escapeMask.Or(mask.Equals(indata, Uint8x64Splat(61)))

    indata += escapeMask.Select(
        Uint8x64Splat(64), // add 64 where the mask is set to true
        Uint8x64Splat(0)   // and don't do anything where it isn't
        )
    
    let escape = escapeMask.Select(
        Uint8x64Splat(61), // add '=' (ASCII 61) where the mask is set to true
        Uint8x64Splat(0)   // and don't do anything where it isn't
    )
    
    var result [128]byte
    
    for i := 0; i < 64; i++ {
        result[i*2] = escape[i]
        result[i*2+1] = indata[i]
    }
    
    return result
}

I think this should be part of the standard library mainly for two reasons:

  • This would allow to write performant SIMD code without using assembler within the standard library (making code more readable)
  • Lowers the barrier of entry for using SIMD by providing a reliable API
tigerwill90 reacted with thumbs up emojiguidog reacted with thumbs down emoji All reactions

Contributor

ianlancetaylor commented 16 days ago

The difficulty with a general purpose approach to SIMD, which is what you are suggesting, is that the performance can be dramatically different on different processors. Also, for specific processors, performance is not optimal as not all special purpose instructions are available.

(The difficulty with a processor-specific approach to SIMD is that you have to write different code for each processor.)

(As a side note, I don't see any reason to have a package like simd/sse2 in your description. Instead, we would arrange to use the appropriate implementation when building the simd package.)

inkeliz, mrg0lden, guidog, and apocelipes reacted with thumbs up emoji All reactions

Contributor

Zheng-Xu commented 16 days ago

I hope the proposal can support different architectures. Arm SVE has a feature called (VLA)Vector Length Agnostics. Different H/W may implement the vector size differently. See: a sneak peek into sve and vla programming

On the API level, it is better that we can decide the vector length at runtime instead of hard coding it as 8x64. Reference: OpenJDK Vector API

On the compiler side, so far Go ABI needs the frame size to be decided at compile time, which doesn't support VLA programming very well yet.

jimwei reacted with thumbs up emojiklauspost reacted with thumbs down emoji All reactions

Author

mpldr commented 16 days ago

we would arrange to use the appropriate implementation when building the simd package

That might not be as easy as it sounds since that would make binaries potentially less portable. Adding a fallback to use in case a specific instructionset is not supported would help in offsetting this. (Yes, I'm aware that this is also not as simple as it sounds.

(As a side note, I don't see any reason to have a package like simd/sse2 in your description. Instead, we would arrange to use the appropriate implementation when building the simd package.)

I think we should provide different packages for different instruction sets .
Because it helps programmers better design compatible programs for different hardware.
This also helps to maintain compatibility, because the behavior of a specific instruction set depends on the hardware, the package provides an API based on the specific instruction set.

ericlagergren, mpldr, and klauspost reacted with thumbs up emojierifan, mrg0lden, and guidog reacted with thumbs down emoji All reactions

Contributor

ianlancetaylor commented 15 days ago

@mpldr We already have a mechanism for making binaries more or less portable: the GOAMD64 environment variable and friends. See https://pkg.go.dev/cmd/go#hdr-Environment_variables.

mpldr reacted with thumbs up emoji All reactions

inkeliz commented 15 days ago

edited

I think we should provide different packages for different instruction sets .

It's not the same of writing assembly file for each CPU? I think that defeats the purpose of SIMD package.


I think the Zig approach is better (and I think Rust is similar). For instance, Zig provides one @Vector function which works on any CPU, can read more about that here. That makes the code as portable as non-simd version, and it will use SIMD-instructions when available.

mrg0lden, gmeligio, tdakkota, jimwei, and tigerwill90 reacted with thumbs up emojiklauspost reacted with thumbs down emoji All reactions

Author

mpldr commented 15 days ago

I think the Zig approach is better (and I think Rust is similar). For instance, Zig [provides one ***@***.***`](https://ziglang.org/documentation/master/#Vectors) function which works on any CPU, can [read more about that here](https://zig.news/michalz/fast-multi-platform-simd-math-library-in-zig-2adn). That makes the code as portable as non-simd version.
I also like the approach of just allowing the various operations to be applied to it, but this would require a language change. (Also adding 60-something new types to the language seems rather contrary to "the Go spirit")

I think the Zig approach is better (and I think Rust is similar). For instance, Zig provides one @Vector function which works on any CPU, can read more about that here. That makes the code as portable as non-simd version, and it will use SIMD-instructions when available.

This seems to make the code more portable, and I support it.

Jorropo, beoran, mrg0lden, jimwei, sirkon, and tigerwill90 reacted with thumbs up emoji All reactions

beoran commented 12 days ago

Indeed, a vectorize package that is portable and that uses the CPU relevant instructions would be the best solution. On platforms that do not have such instructions, the default implementation could still be useful for portability and optimized for performance.

qiulaidongfeng and mpldr reacted with thumbs up emoji All reactions

The current consensus on portability is to use appropriate implementations when building SIMD packages, which I support.

Contributor

rsc commented 2 days ago

I came across https://github.com/google/highway a few weeks ago. Is there anything we can learn from that about portable API for SIMD?

jackielii, franchb, and kscooo reacted with heart emoji All reactions

Contributor

rsc commented 2 days ago

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

sJJdGG and zliang-min reacted with heart emojitdakkota, tigerwill90, and inkeliz reacted with eyes emoji All reactions

rsc

moved this from Incoming to Active in Proposals

2 days ago

Contributor

klauspost commented 15 hours ago

edited

Overall, as mentioned by people a platform independent implementation if simd is at best a half implementation.

While some of the aspects are platform independent, and it may be possible to port a fraction, there is simply too much difference between platforms to make anything that would be genuinely useful. While it is "neat"

Falling back to Go implementation of SIMD would in many cases be much worse than straight up Go, and overall design of this will just slow down the availability. Example: MPSADBW -if there is no HW support, the fallback will be horrible.

SIMD intrinsics should be able to live alongside Go code. The compiler controls register use. This will allow using simd and other instructions without the now forced function call overhead. SIMD should be guarded by platform tags. SIMD types should be available for all platforms, but the functions shouldn't be abstracted, and should match underlying instructions, maybe with some compound functions.

Here are some of my previous observations when looking at intrinsics for Go:

Feature detection

There is a huge number of individual features. Compile-time feature specification will always just select a very low common denominator, and GOAMD64 only provides for a sub-set anyway and cannot be used for this.

It should be easy (maybe looking at imports), which features to check for. The detection should be part of the intrinsic offering.

A quite tricky thing is that some instructions contains several "forms", but with different features. For example ANDPS xmm1, xmm2 (SSE) also has a non-destructive 3 register version VANDPS xmm1,xmm2, xmm3 (AVX). Other intrinsic implementation has made it hard to select which version is used - and either used SSE non-optimally in an AVX code path or (even worse) used AVX in an SSE code path, causing a crash.

Data types

The tricky part about data types is that it will often require type conversion, or be untyped. Data loaded as a []byte should be available as *[][8]float32 or similar types. There should be specific types that typically maps to registers a [8]float32 type to XMM registers for instance.

Some intrinsics has no clear type. For example PXOR operates on bits, and can be mixed with operations that operate on [16]uint8, [8]uint16, [4]uint32. Same for signed/unsigned values. This means the compiler should be able to "convert" between these as a no-op, or just have a single type per register size.

I don't have a ready-to-go solution, but having to copy input from a []byte to a []float32 or vice versa must be avoidable.

The compiler cannot enforce forced constant values. Take PSHUFD, which has an 8 bit immediate value. This must be resolvable at compile time. With current function definitions that isn't really doable, so some handling would be needed for this.

Pointer arguments are a little tricky. There aren't that many, but prefetch instructions and gather/scatter and of course loading and saving.
Loading and storing has more than straight up load/save from memory, for example VPMASKMOVD, so expect there to be platform specific implementations.

Edit: Final word on portable SIMD: I am not against it, but Go should supply the tools to write portable SIMD packages, that cover the feasible subsets.

mpldr, tdakkota, zephyrtronium, smasher164, and changkun reacted with thumbs up emojibeoran reacted with confused emoji All reactions

beoran commented 15 hours ago

edited

@rsc Highway seems to be a good idea. Since Go now has generics, and Highway is Apache licensed, is there any reason why someone interested could not port it to Go? That would be the first step, I think.

@klauspost It sounds more like you want to use inline SIMD assembly than have a portable vector API. While the go compiler already supports assembly in separate files, I don't think inline SIMD assembly in Go is a good idea.

Edit: a third approach would be for the Go compiler to optimize certain array and float operations using SIMD if possible. This should be documented then, though.

Contributor

klauspost commented 15 hours ago

edited

@beoran That is the title of the proposal.

While portable solutions and automatic vectoring are neat, it only provides a band-aid solution, with quite limited usability.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK