XML document read as Latin1 but half converted to UTF-8

advertisements

I'm hitting my head off a brick wall with a bizarre problem that I know there will be an obvious answer to, but I can't see if for the life of me. It's all to do with encoding. Before the code, a simple description: I want to take in an XML document which is Latin1 (ISO-8859-1) encoded, and then send the thing completely unchanged over an HttpURLConnection. I have a small test class and the raw XML which shows my problem. The XML file contains a Latin1 character 0xa2 (a cent character), which is invalid UTF-8 - I'm deliberately using this as my test case. The XML declaration is ISO-8859-1. I can read it in no bother, but then when I want to convert the org.w3c.dom.Document to a byte[] array to send down the HttpURLConnection, the 0xa2 character gets converted to the UTF-8 encoded cent character (0xc2 0xa2), and the declaration stays as ISO-8859-1. In other words, it's converted to two characters - totally wrong.

The code which does this:

FileInputStream input = new FileInputStream( "input-file" );
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware( true );
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse( input );

Source source = new DOMSource( document );
ByteArrayOutputStream baos = new ByteArrayOutputStream();
Result result = new StreamResult( baos );
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.transform( source, result );
byte[] bytes = baos.toByteArray();

FileOutputStream fos = new FileOutputStream( "output-file" );
fos.write( bytes );

I'm just writing it to a file at the moment while I figure out what on earth is converting this character. The input-file has 0xa2, the output-file contains 0xc2 0xa2. One way to fix this is to put this line in the 2nd last block:

transformer.setOutputProperty(OutputKeys.ENCODING, "ISO-8859-1");

However, not all XML documents that I'll be dealing with will be Latin1; most, indeed, will be UTF-8 when they come in. I'm assuming I shouldn't have to be working out what the encoding is such that I feed that in to the transformer though? I mean, surely it should be working this out for itself, and I'm just doing something else wrong?

A thought had occurred to me that I could just query the document to find out the encoding and thus the extra line could just do the trick:

transformer.setOutputProperty(OutputKeys.ENCODING, document.getInputEncoding());

However, I then determined that this wasn't the answer, as document.getInputEncoding() returns a different String if I run it in a terminal on the linux box in comparison to when I run it within Eclipse on my Mac.

Any hints would be appreciated. I fully accept I'm missing out on something obvious.

yes, by default, xml documents are written as utf-8, so you need to explicitly tell the Transformer to use a different encoding. your last edit is the "trick" to doing this such that it always matches the input xml encoding:

transformer.setOutputProperty(OutputKeys.ENCODING, document.getXmlEncoding());

the only question is, do you really need to maintain the input encoding?

XML document read as Latin1 but half converted to UTF-8

XML document read as Latin1 but half converted to UTF-8

Recommend

iPad Air 4 drops to a new record-low price in Amazon's latest iPad deals

If the string is equal to the item in list a, followed by the item in list b

Canonical Group 2020年度财务报告

How Cloud Can Drive Sustainable, Data-Driven Success

Lodash: Convert the array with the duplicate values in object with the number...

Setting HTTP status code based on Exception in Slim 4

Trying to understand Big-o notation

How to generate 6 different random numbers in java

How to automatically submit a Rails form when visiting a page

金色DeFi日报 | 灰度推出与CoinDesk指数挂钩的DeFi新基金

About Joyk