Truncating UTF String to the given number of bytes while preserving its validity [for DB insert]

November 2, 2007

Often you need to insert a String from Java into a database column with a fixed length specified in bytes. Using

string.substring(0, DB_FIELD_LENGTH);

isn't enough because it only cuts down the number of characters but in UTF-8 a single character may be represented by 1-4 bytes. But you cannot just turn the string into an array of bytes and use its first DB_FIELD_LENGTH elements because you could end up with an invalid UTF-8 character at the end (one that is represented by 2+ bytes while only its 1st byte fits into the field). There are two solutions for truncation the string in such a way, that it has at most DB_FIELD_LENGTH bytes and is a valid UTF-8 string.

Approach 1: Replace the invalid trailing byte(s) with a 'rectangle'

This is as simple as:

int maxLen = DB_FIELD_LENGTH-2;
string = new String( string.getBytes("UTF-8") , 0, maxLen, "UTF-8");

The new String constructor will automatically replace any invalid character (i.e. incomplete utf-8 char; we may only have one at the end) with the character \uFFFD, which looks like an empty rectangle. This character requires 3 bytes in utf-8 - therefore we decrease DB_FIELD_LENGTH by 2; the resulting string will have either exactly maxLen bytes if its last byte(s) is a valid utf-8 character or maxLen+2 bytes if it isn't valid and this 1 byte was replaced by \uFFFD (3B).

Approach 2: Skip the invalid trailing byte(s) altogether

If you don't want to have the rectangle character in the place of a split multibyte character, you must do yourself what the String constructor does internally, in a bit different way:

import java.nio.*; import java.nio.charset.*;
Charset utf8Charset = Charset.forName("UTF-8");
CharsetDecoder cd = utf8Charset.newDecoder();
byte[] sba = string.getBytes("UTF-8");
// Ensure truncating by having byte buffer = DB_FIELD_LENGTH
ByteBuffer bb = ByteBuffer.wrap(sba, 0, DB_FIELD_LENGTH); // len in [B]
CharBuffer cb = CharBuffer.allocate(DB_FIELD_LENGTH); // len in [char] <= # [B]
// Ignore an incomplete character
cd.onMalformedInput(CodingErrorAction.IGNORE)
cd.decode(bb, cb, true);
cd.flush(cb);
string = new String(cb.array(), 0, cb.position());

The string will end with the last valid character in the given range.

Approach 3: Manually remove the invalid trailing bytes

As you can see, the approach 2 requires quite lot of coding and method calls. If you know the details of UTF-8, namely how to distinguish an invalid byte or byte sequence then you can simply truncate the byte array and then remove/replace the invalid bytes. I'd be glad for the code :-)

Tags: java

Are you benefitting from my writing? Consider buying me a coffee or supporting my work via GitHub Sponsors. Thank you! You can also book me for a mentoring / pair-programming session via Codementor or (cheaper) email.

Allow me to write to you!

Let's get in touch! I will occasionally send you a short email with a few links to interesting stuff I found and with summaries of my new blog posts. Max 1-2 emails per month. I read and answer to all replies.

Truncating UTF String to the given number of bytes while preserving its validity...

Truncating UTF String to the given number of bytes while preserving its validity [for DB insert]

Approach 1: Replace the invalid trailing byte(s) with a 'rectangle'

Approach 2: Skip the invalid trailing byte(s) altogether

Approach 3: Manually remove the invalid trailing bytes

Allow me to write to you!

Recommend

Fulcro Troubleshooting Decision Tree

The Ultimate Guide to FinOps - Part 4

Tip: Multiple webservice implementation classes available at the same time under...

How I managed to deploy a JSF/Seam portlet to JBoss after all

Book review: Refactoring by Martin Fowler

Amazon Workers Vote to Unionize in Stunning Win for Organized Labor

教堂与集市的有机结合，昇思MindSpore的发展之路_AI_InfoQ编辑部_InfoQ精选文章

清华博士提出「Chiplet精算师」登顶会！越接近摩尔极限，多芯片集成越划算

巴西宠物经济崛起！这几个细分市场也值得关注！

iQOO Neo6 will be unveiled on April 13

About Joyk