joadnacer/jdz

master

Go to file

Code

Files

Permalink

Latest commit message

Commit time

bench

Initial Release

September 3, 2023 16:28

README.md

This library contains Base64 encoders and decoders implemented using the incubator Vector API, as well as (slower) scalar implementations. This is the fastest Java-written Base64 library, with both the vector-based and scalar methods outperforming any other Java written library. These methods are faster than intrinsified java.util for RFC2045/MIME encoding/decoding, but slower for RFC4648 (see "java.util Intrinsification" and "Benchmark" sections). Requires Java 16 or higher.

The encoding and decoding methods used for the Vector implementation, which can be found in VectorUtils.java, were heavily inspired by Wojciech Mula's articles, available here:

http://0x80.pl/notesen/2016-01-12-sse-base64-encoding.html

http://0x80.pl/notesen/2016-01-17-sse-base64-decoding.html

Usage

As the Vector API is currently an incubator project, you will need to add the flag --add-modules jdk.incubator.vector to your java run command in order to use this library's vector methods.

This library will be hosted on the central Maven repository once the Vector API is officially released. In the meanwhile this may only be used via local installation.

Decoder/Encoder Instantiation

For convenience, statically instantiated encoders/decoders can be accessed via the Base64Vector and Base64Scalar classes. You may also instantiate the indidual encoders/decoders yourself.

Fast vs Normal Decode

The fast decode methods assume that input is valid Base64 encoded according to the relevant spec. The normal decode methods do not, and handle invalid input explicitly.

For RFC4648 decoding, the normal decode methods will throw an IllegalArgumentException in the case that an invalid byte is detected, with an exception message that includes the position of the first invalid byte. With RFC2045 decoding, the invalid input bytes are simply ignored (as long as total valid input length is divisible by 4).

Using the fast decoding methods on invalid Base64 input will likely result in either an ArrayOutOfBoundsException due to invalid lookup table indices/invalid encoded input length, or in silently erroneous output.

Vector Sizing

Java should select the optimal vector size thanks to the use of the Vector API's SPECIES_PREFERRED. However, this may not select the optimal vector size on certain systems.

The automated vector length selection can be overriden with the jdz.vector.size system property, which may be set to values 128, 256 or 512, representing the vector bit size. Selecting a value greater than what your CPU supports will make encoding/decoding 100-1000x slower. Experiment with this value if the library appears suspiciously slow.

java.util Intrinsification

The java.util encode and decode methods rely on the use of an encodeBlock and decodeBlock method. These methods are both intrinsinc candidates since Java 11 and 16 respectively, meaning that the JVM can replace them with an extremely fast SIMD method. From my testing, these methods get intrinsifed after around 5000 calls in a JMH benchmark. So how many bytes need to be decoded in order to hit 5000 calls to these methods?

If you're encoding or decoding according the RFC4648/RFC4648_URL spec, the intrinsic methods get called once for each call to encode or decode, which obviously allows for the fastest speeds as this minimizes callbacks into Java. Unfortunately, this means that the high level methods must be called ~5000 times before users benefit from intrinsification (*). If you are working with large arrays, this can take a while: 33 seconds of continuous encoding of 10MB arrays were required for the java.util encoding to intrinsify on my M1 mac. Similarly, if your application does not encode/decode 5000+ times in its lifetime, you will never benefit from this. If your application is encoding/decoding base64 sparsely rather than continuously, this might also impede intrinsification.

The java.util RFC2045 methods are more consistent in their intrinsification and unaffected by input size. The encoding method will encode data in 76 byte blocks, before return to Java to write linebreak characters, while the decoding method will return to Java on each invalid base64 character (which includes linebreak characters every 76 bytes). Although this guarantees faster intrinsification due to the IntrinsicCandidate methods being called at least once per 76 bytes of input, this slows down the methods considerably, allowing the jdz vector implementation to be faster.

Intrinsification also does not necessarily occur on every system or JVM. For example, the java.util methods were not intrinsified on an AWS m4.large or on my VMWare Ubuntu VM. As a result, jdz methods strongly outperform java.util on any such system, as java.util will be performing scalar encoding/decoding.

Benchmarks

Benchmarks were all ran on Java 20 (Temurin), using non byte array allocating methods for all libraries supporting this. These are JMH benchmarks consisting of 3 forks of 5 warmups and 5 iterations. All benchmarks were measured for size of unencoded data, input size for encoding and output size for decoding.

The overall theme in these benchmarks is that it is hard to predict relative performance of jdz vs java.util across different architectures. If your application is dependent on efficient base64 encoding, make sure to test your system yourself to determine what is fastest.

jdz scalar methods are generally the best performing scalar methods for both specs and all architectures - however, these were mostly implemented for reference, and are only worth using over the vector/java.util methods if your cpu does not support 128+ bit vector sizes/does not intrinsify java.util.

As a quick summary, the following table presents the generally fastest library for each spec/method combo. The overall benchmarks assume that java.util is intrinsified - if this is not the case, the jdz vector methods will always be fastest.

Spec/Method	Overall Encoding	Overall Fast Decoding	Overall Decoding	Scalar Encoding	Scalar Decoding	Scalar Fast Decoding
RFC4648	java.util	java.util	java.util	jdz scalar	jdz scalar	jdz scalar
RFC2045	jdz vector/java.util	jdz vector	jdz vector	jdz scalar	jdz scalar/java.util	jdz scalar

Throughput

These benchmarks represent MB/s throughput of encoding/decoding methods given 1MB inputs. java.util scalar throughput was meansured by disabling intrinsification.

RFC4648

RFC4648 Encoding

java.util scalar	java.util intrinsic	jdz vector	jdz scalar	migBase64	iharder	Apache Commons
Apple M1 (128 bits)	1707MB/s	16126MB/s	6277MB/s	2845MB/s	1793MB/s	1560MB/s	165MB/s
Vs Java Scalar	1.00x	9.45x	3.68x	1.67x	1.05x	0.91x	0.10x
vs Java Intrinsic	0.11x	1.00x	0.39x	0.18x	0.11x	0.10x	0.01x
m4.large (256 bits)	886MB/s	N/A	1515MB/s	1171MB/s	609MB/s	567MB/s	133MB/s
vs Java Scalar	1.0x	N/A	1.70x	1.32x	0.69x	0.64x	0.15x
vs Java Intrinsic	N/A	N/A	N/A	N/A	N/A	N/A	N/A
m6i.large	1441MB/s	12644MB/s	10891MB/s	2018MB/s	954MB/s	984MB/s	178MB/s
vs Java Scalar	1.0x	8.77x	7.56x	1.40x	0.66x	0.68x	0.12x
vs Java Intrinsic	0.11x	1.0x	0.86x	0.16x	0.06x	0.08x	0.01x

RFC4648 Decoding

java.util scalar	java.util intrinsic	jdz vector	jdz vector fast	jdz scalar	jdz scalar fast	migBase64	migBase64 fast	iharder	Apache Commons
Apple M1 (128 bits)	1478MB/s	9149MB/s	4613MB/s	6801MB/s	2503MB/s	2777MB/s	637MB/s	1096MB/s	323MB/s	266MB/s
Vs Java Scalar	1.00x	6.19x	3.12x	4.60x	1.69x	1.88x	0.43x	0.74x	0.22x	0.18x
vs Java Intrinsic	0.17x	1.00x	0.50x	0.74x	0.27x	0.30x	0.07x	0.12x	0.04x	0.03x
m4.large (256 bits)	799MB/s	N/A	1453MB/s	2434MB/s	1105MB/s	1066MB/s	289MB/s	479MB/s	139MB/s	120MB/s
vs Java Scalar	1.00x	N/A	1.82x	0.30x	1.38x	1.33x	0.36x	0.60x	0.17x	0.20x
vs Java Intrinsic	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A
m6i.large	1360MB/s	14884MB/s	6585MB/s	9258MB/s	1270MB/s	1351MB/s	363MB/s	791MB/s	218MB/s	194MB/s
vs Java Scalar	1.00x	10.20x	4.85x	6.80x	0.93x	0.99x	0.26x	0.58x	0.16x	0.14x
vs Java Intrinsic	0.09x	1.00x	0.44x	0.62x	0.09x	0.09x	0.02x	0.05x	0.01x	0.01x

RFC2045

RFC2045 Encoding

java.util scalar	java.util intrinsic	jdz vector	jdz scalar	migBase64	iharder	Apache Commons
Apple M1 (128 bits)	1466MB/s	5721MB/s	5814MB/s	2704MB/s	1572MB/s	1540MB/s	171MB/s
Vs Java Scalar	1.0x	3.90x	3.97x	1.84x	1.07x	1.05x	0.12x
vs Java Intrinsic	0.26x	1.0x	1.03x	0.48x	0.28x	0.27x	0.03x
m4.large (256 bits)	686MB/s	N/A	1192MB/s	1087MB/s	464MB/s	494MB/s	125MB/s
vs Java Scalar	1.0x	N/A	1.74x	1.58x	0.67x	0.72x	0.18x
vs Java Intrinsic	N/A	N/A	N/A	N/A	N/A	N/A	N/A
m6i.large	1129MB/s	3526MB/s	5722MB/s	1859MB/s	861MB/s	872MB/s	178MB/s
vs Java Scalar	1.0x	3.12x	5.07x	1.65x	0.76x	0.77x	0.16x
vs Java Intrinsic	0.32x	1.0x	1.622x	0.52x	0.24x	0.25	0.05x

RFC2045 Decoding

java.util scalar	java.util intrinsic	jdz vector	jdz vector fast	jdz scalar	jdz scalar fast	migBase64	migBase64 fast	iharder	Apache Commons
Apple M1 (128 bits)	716MB/s	919MB/s	1805MB/s	6685MB/s	757MB/s	2728MB/s	390MB/s	928MB/s	305MB/s	255MB/s
Vs Java Scalar	1.00x	1.28x	2.52x	9.33x	1.06x	3.81x	0.54x	1.29x	0.42x	0.35x
vs Java Intrinsic	0.78x	1.00x	1.96x	7.27x	0.82x	2.97x	0.42x	1.01x	0.33x	0.28x
m4.large (256 bits)	433MB/s	N/A	731MB/s	1809MB/s	412MB/s	1157MB/s	197MB/s	479MB/s	138MB/s	120MB/s
vs Java Scalar	1.00x	N/A	1.67x	4.17x	0.95x	2.67x	0.45x	1.10x	0.32x	0.28x
vs Java Intrinsic	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A
m6i.large	652MB/s	761MB/s	1729MB/s	5077MB/s	583MB/s	1351/MB/s	393MB/s	791MB/s	218MB/s	195MB/s
vs Java Scalar	1.00x	1.17x	2.65x	7.79x	0.89x	2.07x	0.60x	1.21x	0.33x	0.30x
vs Java Intrinsic	0.86x	1.00x	2.27x	6.77x	0.77x	1.77x	0.52x	1.04x	0.29x	0.26x

Vector Sizing

This benchmark measures different vector sizes on a system supporting up to 512 bits (m6i.large). As expected, the largest supported vector size is fastest, and this is naturally the one automatically selected via use of SPECIES_PREFERRED.

Method/Vector Size	128 bits	256 bits	512 bits
RFC4648 encode	4000MB/s	6733MB/s	11157MB/s
RFC4648 decode	2457MB/s	3945MB/s	5491MB/s
RFC4648 decodeFast	3517MB/s	5646MB/s	9128MB/s
RFC2045 encode	3099MB/s	4926MB/s	5724MB/s
RFC2045 decode	1150MB/s	1482MB/s	1730MB/s
RFC2045 decodeFast	2965MB/s	4224MB/s	6733MB/s

Unencoded Length Benchmarks

The following benchmarks measure the time per operation to encode/decode a certain unencoded length - lower is better. These were all carried out on an M1 mac using 128 bit vectors.

RFC4648

RFC4648 Encoding

Method/Unencoded Length	10B	100B	1KB	1MB	10MB	100MB
jdz-scalar encode	11ns	48ns	343ns	346µs	3486µs	34.88ms
jdz-vector encode	11ns	40ns	195ns	142µs	1447µs	14.31ms
java.util encode	11ns	15ns	73ns	60µs	5784µs	57.81ms
migBase64 encode	22ns	64ns	562ns	545µs	5498µs	55.12ms

RFC4648 Decoding

Method/Unencoded Length	10B	100B	1KB	1MB	10MB	100MB
jdz-scalar decode	11ns	55ns	409ns	405µs	4079µs	40.76ms
jdz-scalar decodeFast	11ns	52ns	371ns	364µs	3673µs	36.79ms
jdz-vector decode	12ns	52ns	299ns	205µs	2088µs	20.90ms
jdz-vector decodeFast	11ns	43ns	170ns	139µs	1409µs	14.02ms
java.util decode	14ns	49ns	138ns	109µs	6727µs	66.92ms
migBase64 decode	30ns	157ns	1472ns	1554µs	15673µs	171.8ms
migBase64 decodeFast	28ns	104ns	955ns	894µs	9000µs	98.96ms

As only 5 warmups were performed, we can see java.util methods cease to be intrinsified for 10MB+ data (see "java.util Intrinsification").

RFC2045 (MIME)

RFC2045 Encoding

Method/Unencoded Length	10B	100B	1KB	1MB	10MB	100MB
jdz-scalar encode	10ns	50ns	349ns	358µs	3799us	39.80ms
jdz-vector encode	10ns	38ns	179ns	169µs	1887µs	20.78ms
java.util encode	12ns	28ns	181ns	171µs	1767µs	17.30ms
migBase64 encode	16ns	72ns	644ns	620µs	6239µs	62.39ms

RFC2045 Decoding

Method/Unencoded Length	10B	100B	1KB	1MB	10MB	100MB
jdz-scalar decode	29ns	247ns	2.64µs	1608µs	16.12ms	133.60ms
jdz-scalar decodeFast	22ns	49ns	376ns	358µs	3.569ms	36.06ms
jdz-vector decode	29ns	61ns	575ns	564µs	5.698ms	57.258ms
jdz-vector decodeFast	23ns	38ns	167ns	151µs	1.528ms	14.77ms
java.util decode	29ns	132ns	1284ns	1087µs	10.92ms	135.35ms
migBase64 decode	408ns	306ns	2.89µs	2383µs	24.11ms	286.59ms
migBase64 decodeFast	29ns	105ns	987ns	1052µs	10.72ms	125.53ms

GitHub - joadnacer/jdz: A Java Vector-API base64 library.

joadnacer/jdz

Files

README.md

Usage

Decoder/Encoder Instantiation

Fast vs Normal Decode

Vector Sizing

java.util Intrinsification

Benchmarks

Throughput

RFC4648

RFC4648 Encoding

RFC4648 Decoding

RFC2045

RFC2045 Encoding

RFC2045 Decoding

Vector Sizing

Unencoded Length Benchmarks

RFC4648

RFC4648 Encoding

RFC4648 Decoding

RFC2045 (MIME)

RFC2045 Encoding

RFC2045 Decoding

Recommend

Papermark - Open source DocSend alternative for secure document sharing | Produc...

Google's new use of generative AI could boost 'fuzzing,' a longtime cybersecurit...

Key Management Service: The Foundation of Cloud Security

Vision Pro的11个新细节披露，苹果画饼上瘾了…

Reshaping Productive Workflows Integrating AI Capabilities and UX | UX Collectiv...

VisualVM SQL profiler SQL

Disney VFX Workers File for Union Election

1Panel 推荐一款非常好用的Linux服务器管理面板

云与传统数据中心：哪个适合您？

英伟达力挺，这家「AI 算力黄牛」4 年估值 560 亿

About Joyk