20

[Golang] Unmarshal UTF-16 XML

 2 years ago
source link: https://siongui.github.io/2020/07/27/go-decode-utf-16-xml/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

[Golang] Unmarshal UTF-16 XML

July 27, 2020

How to parse a UTF-16 XML file in Go.

The Go standard encoding/xml package provides basic methods for parsing XML files. However, the encoding/xml package parses only UTF-8 encoded XML. After spending a lot of time for search and trial and error [1] [2] [3], finally I get my UTF-16 XML parsed correctly. I do not know why it works. Just show the code for reference.

import (
      "encoding/xml"
      "io"

      "golang.org/x/net/html/charset"
)

func BypassReader(label string, input io.Reader) (io.Reader, error) {
      return input, nil
}

func DecodeUtf16XML(r io.Reader, v interface{}) (err error) {
      // https://www.tipitaka.org/romn/cscd/vin01m.mul.toc.xml
      // The Tipiṭaka XML is encoded in UTF-16
      // Google search: golang xml utf-16
      // https://stackoverflow.com/questions/6002619/unmarshal-an-iso-8859-1-xml-input-in-go
      // https://groups.google.com/forum/#!topic/golang-nuts/tXcECEKC2rs
      nr, err := charset.NewReader(r, "utf-16")
      if err != nil {
              return
      }
      decoder := xml.NewDecoder(nr)
      decoder.CharsetReader = BypassReader
      err = decoder.Decode(v)
      return
}

Usage:

import (
      "encoding/xml"
      "os"
      "testing"
)

type Tree struct {
      XMLName xml.Name `xml:"tree"`
      Trees   []Tree   `xml:"tree"`
      Text    string   `xml:"text,attr"`
      Src     string   `xml:"src,attr"`
      Action  string   `xml:"action,attr"`
}

func TestDecodeUtf16XML(t *testing.T) {
      dst := "/tmp/romn/cscd/vin01m.mul.toc.xml"
      err := CheckDownload("https://www.tipitaka.org/romn/cscd/vin01m.mul.toc.xml", dst, false)
      if err != nil {
              t.Error(err)
              return
      }

      f16, err := os.Open(dst)
      if err != nil {
              t.Error(err)
              return
      }

      tree := Tree{}
      err = DecodeUtf16XML(f16, &tree)
      if err != nil {
              t.Error(err)
              return
      }
      t.Log(tree)
}

Tested on: Ubuntu Linux 20.04, Go 1.12.17, 1.14.4.


References:

[4]decode utf-16 xml · siongui/gopalilib@370ba1f · GitHub[5][Golang] Auto-Detect and Convert Encoding of HTML to UTF-8


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK