2

[Golang] Determine Encoding of HTML Document

 2 years ago
source link: https://siongui.github.io/2018/10/26/determine-encoding-of-html-document-in-go/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

[Golang] Determine Encoding of HTML Document

October 26, 2018

Given an URL, determine the encoding of the HTML document in Go using golang.org/x/net/html and golang.org/x/text packages. I came across the code snippet from [1], so I extract and re-organize the content to make it search engine friendly.

Install the packages first:

$ go get -u golang.org/x/text
$ go get -u golang.org/x/net/html

The following code shows how to determine the encoding of an HTML document given the URL:

url.go | repository | view raw
package guess

import (
	"bufio"
	"fmt"
	"io"
	"net/http"

	"golang.org/x/net/html/charset"
	"golang.org/x/text/encoding"
)

func UrlEncoding(url string) (name string, certain bool, err error) {
	resp, err := http.Get(url)
	if err != nil {
		return
	}
	defer resp.Body.Close()

	if resp.StatusCode != http.StatusOK {
		err = fmt.Errorf("response status code: %d", resp.StatusCode)
		return
	}

	_, name, certain, err = DetermineEncodingFromReader(resp.Body)
	return
}

func DetermineEncodingFromReader(r io.Reader) (e encoding.Encoding, name string, certain bool, err error) {
	bytes, err := bufio.NewReader(r).Peek(1024)
	if err != nil {
		return
	}

	e, name, certain = charset.DetermineEncoding(bytes, "")
	return
}

Usage of the above code:

url_test.go | repository | view raw
package guess

import (
	"testing"
)

func TestUrlEncoding(t *testing.T) {
	name, _, err := UrlEncoding("http://shenfang.com.tw/")
	if err != nil {
		t.Error(err)
		return
	}
	if name != "big5" {
		t.Error("bad guess!")
		return
	}

	name, _, err = UrlEncoding("https://siongui.github.io/")
	if err != nil {
		t.Error(err)
		return
	}
	if name != "utf-8" {
		t.Error("bad guess!")
		return
	}
}

If you want to convert the non-utf8 encoded HTML to utf8, see [3].


Tested on: Ubuntu 18.04, Go 1.11.1


References:

[2]golang 用/x/net/html写的小爬虫,爬小说 - 简书


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK