2

Truncating UTF String to the given number of bytes while preserving its validity...

 2 years ago
source link: https://blog.jakubholy.net/2007/11/02/truncating-utf-string-to-the-given/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Truncating UTF String to the given number of bytes while preserving its validity [for DB insert]

November 2, 2007
Often you need to insert a String from Java into a database column with a fixed length specified in bytes. Using
string.substring(0, DB_FIELD_LENGTH);
isn't enough because it only cuts down the number of characters but in UTF-8 a single character may be represented by 1-4 bytes. But you cannot just turn the string into an array of bytes and use its first DB_FIELD_LENGTH elements because you could end up with an invalid UTF-8 character at the end (one that is represented by 2+ bytes while only its 1st byte fits into the field). There are two solutions for truncation the string in such a way, that it has at most DB_FIELD_LENGTH bytes and is a valid UTF-8 string.

Approach 1: Replace the invalid trailing byte(s) with a 'rectangle'

This is as simple as:
int maxLen = DB_FIELD_LENGTH-2;
string = new String( string.getBytes("UTF-8") , 0, maxLen, "UTF-8");
The new String constructor will automatically replace any invalid character (i.e. incomplete utf-8 char; we may only have one at the end) with the character \uFFFD, which looks like an empty rectangle. This character requires 3 bytes in utf-8 - therefore we decrease DB_FIELD_LENGTH by 2; the resulting string will have either exactly maxLen bytes if its last byte(s) is a valid utf-8 character or maxLen+2 bytes if it isn't valid and this 1 byte was replaced by \uFFFD (3B).

Approach 2: Skip the invalid trailing byte(s) altogether

If you don't want to have the rectangle character in the place of a split multibyte character, you must do yourself what the String constructor does internally, in a bit different way:
import java.nio.*; import java.nio.charset.*;
Charset utf8Charset = Charset.forName("UTF-8");
CharsetDecoder cd = utf8Charset.newDecoder();
byte[] sba = string.getBytes("UTF-8");
// Ensure truncating by having byte buffer = DB_FIELD_LENGTH
ByteBuffer bb = ByteBuffer.wrap(sba, 0, DB_FIELD_LENGTH); // len in [B]
CharBuffer cb = CharBuffer.allocate(DB_FIELD_LENGTH); // len in [char] <= # [B]
// Ignore an incomplete character
cd.onMalformedInput(CodingErrorAction.IGNORE)
cd.decode(bb, cb, true);
cd.flush(cb);
string = new String(cb.array(), 0, cb.position());
The string will end with the last valid character in the given range.

Approach 3: Manually remove the invalid trailing bytes

As you can see, the approach 2 requires quite lot of coding and method calls. If you know the details of UTF-8, namely how to distinguish an invalid byte or byte sequence then you can simply truncate the byte array and then remove/replace the invalid bytes. I'd be glad for the code :-)
Tags: java

Are you benefitting from my writing? Consider buying me a coffee or supporting my work via GitHub Sponsors. Thank you! You can also book me for a mentoring / pair-programming session via Codementor or (cheaper) email.

Allow me to write to you!

Let's get in touch! I will occasionally send you a short email with a few links to interesting stuff I found and with summaries of my new blog posts. Max 1-2 emails per month. I read and answer to all replies.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK