10

goquery Handle Non-UTF8 HTML Web Page

 2 years ago
source link: http://siongui.github.io/2018/10/09/goquery-handle-non-utf8-html-webpage/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

goquery Handle Non-UTF8 HTML Web Page

October 09, 2018

goquery handles only UTF-8 encoded web pages. The wiki of goquery [1] provides a method to handle non-utf8 html pages if the character encoding (charset) of the pages is known. The trick is to use iconv to convert the encoding to utf8 first. I re-write the code in the wiki and make it more modular.

Install goquery and Go iconv binding:

$ go get -u github.com/PuerkitoBio/goquery
$ go get -u github.com/djimenez/iconv-go

Source code

import (
      "net/http"

      "github.com/PuerkitoBio/goquery"
      iconv "github.com/djimenez/iconv-go"
)

func NewDocumentFromNonUtf8Url(url, charset string) (doc *goquery.Document, err error) {
      resp, err := http.Get(url)
      if err != nil {
              return
      }
      defer resp.Body.Close()

      utfBody, err := iconv.NewReader(resp.Body, charset, "utf-8")
      if err != nil {
              return
      }

      doc, err = goquery.NewDocumentFromReader(utfBody)
      return
}

Example: Read Big5 webpage

func main() {
      doc, err := NewDocumentFromNonUtf8Url("http://shenfang.com.tw/product.htm", "big5")
      if err != nil {
              panic(err)
      }

      // do something with the doc
}

Tested on: Ubuntu Linux 18.04, Go 1.10.1.


References:


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK