3

Implement dynamic byte-swizzle prototype by workingjubilee · Pull Request #334 ·...

 1 year ago
source link: https://github.com/rust-lang/portable-simd/pull/334
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Conversation

Contributor

This is meant to be an example that is used to test a Rust intrinsic against, which will replace it. The interface is fairly direct and doesn't address more nuanced or interesting permutations one can do, nevermind on types other than bytes.

The ultimate goal is for direct LLVM support for this.

Contributor

Author

The API isn't perfect as we will want to figure out something so that, if we do indeed want to support this function beyond an N of 128~256, we can have this use 16-bit indices instead like RVV allows.

However, that's not necessarily the biggest concern, and this is too important as a functionality to worry about "what if it needs two functions, one for each index type?" Then it needs two functions! We will endure.

Are 16 bit indices as universal as 8 bit?

{

/// Swizzle a vector of bytes according to the index vector.

/// Indices within range select the appropriate byte.

/// Indices "out of bounds" instead select 0.

Maybe add a note that this really needs build-std to work correctly

Contributor

Author

Well, it kinda doesn't, does it? The vectors are generic, so this will get instantiated at compile time. What it needs is to be combined with target_feature configuration, either dynamic multiversioning or compile-time versioning or whatnot.

The cfgs will depend on the features std is built with, unfortunately

Contributor

Author

Oooh good point hmmmmmm...

...I guess I could make this dynamically multiversioned lol.

Contributor

Author

That part, at least, will be fixed upon promoting this into an intrinsic.

calebzulawski reacted with thumbs up emoji

Contributor

Author

Hmm. Honestly, 16-bit indices are only used, afaik, by RISC-V's Vector extension. So I think saying "nah, use a target intrinsic for that" would be fair. It's mostly "if we find a way to seamlessly transition to larger indices, that would be cool".

Hmm. Honestly, 16-bit indices are only used, afaik, by RISC-V's Vector extension. So I think saying "nah, use a target intrinsic for that" would be fair. It's mostly "if we find a way to seamlessly transition to larger indices, that would be cool".

afaict avx512 supports 16-bit indexes for vpermi2w and related operations.

SimpleV supports 8/16/32/64-bit indexes iirc.

Contributor

Author

The AVX512 instruction, VPERMI2{W,D,Q} doesn't really matter to the abstract operation we're defining because that instruction uses indices that have the same size as the type, because it overwrites the index vector with the results (subject to the mask). And that's to be expected for AVX512F because without the AVX512BW extension, you can't do byte-level operations at all. But we don't really care about that because destructive update on the indices is a pretty unusual pattern, a form using that instruction should be yielded by combining it with a masked store (or the intrinsic, obv), and what is actually relevant for what we're doing is whether a u8 will be big enough.

Contributor

Author

@workingjubilee workingjubilee

left a comment

As funny as it would be to package a CPUID implementation to handle the which-AVX-version stuff, I think I am gonna skip it for now. I could have completely skipped implementing this "up here" at the "tip", but I wanted to have full testing in our suite against proptest, first, which is very good at finding counterfactuals.

// This is ordering sensitive, and LLVM will order these how you put them.

// Most AVX2 impls use ~5 "ports", and only 1 or 2 are capable of permutes.

// But the "compose" step will lower to ops that can also use at least 1 other port.

// So this tries to break up permutes so composition flows through "open" ports.

// Comparative benches should be done on multiple AVX2 CPUs before reordering this

Contributor

Author

Having all this commentary here isn't strictly necessary but I'm going to transplant more-or-less the same remarks into rustc (and maybe into LLVM???) later, so writing this down matters.

workingjubilee

merged commit 4f0d822 into

master

Apr 23, 2023

75 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

calebzulawski

calebzulawski left review comments

jhorstmann

jhorstmann left review comments
Assignees

No one assigned

Labels
None yet
Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

None yet

4 participants

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK