goquery Handle Non-UTF8 HTML Web Page
source link: http://siongui.github.io/2018/10/09/goquery-handle-non-utf8-html-webpage/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
goquery Handle Non-UTF8 HTML Web Page
October 09, 2018
goquery handles only UTF-8 encoded web pages. The wiki of goquery [1] provides a method to handle non-utf8 html pages if the character encoding (charset) of the pages is known. The trick is to use iconv to convert the encoding to utf8 first. I re-write the code in the wiki and make it more modular.
Install goquery and Go iconv binding:
$ go get -u github.com/PuerkitoBio/goquery $ go get -u github.com/djimenez/iconv-go
Source code
import ( "net/http" "github.com/PuerkitoBio/goquery" iconv "github.com/djimenez/iconv-go" ) func NewDocumentFromNonUtf8Url(url, charset string) (doc *goquery.Document, err error) { resp, err := http.Get(url) if err != nil { return } defer resp.Body.Close() utfBody, err := iconv.NewReader(resp.Body, charset, "utf-8") if err != nil { return } doc, err = goquery.NewDocumentFromReader(utfBody) return }
Example: Read Big5 webpage
func main() { doc, err := NewDocumentFromNonUtf8Url("http://shenfang.com.tw/product.htm", "big5") if err != nil { panic(err) } // do something with the doc }
Tested on: Ubuntu Linux 18.04, Go 1.10.1.
References:
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK