3

一个Go升级引起的血案

 3 years ago
source link: https://ieevee.com/tech/2021/04/21/no-cached-connection.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

一个Go升级引起的血案

by 伊布

直接现象:使用Go 1.13编译kube-controller-manager,运行一段时间后,发现controller不工作,查看日志,发现打印“http2: no cached connection was available”。

I0328 09:48:59.925056       1 round_trippers.go:383] GET https://10.220.14.10:8443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager
I0328 09:48:59.925085       1 round_trippers.go:390] Request Headers:
I0328 09:48:59.925094       1 round_trippers.go:393]     User-Agent: kube-controller-manager/v1.11.1 (linux/amd64) kubernetes/b1b2997/leader-election
I0328 09:48:59.925102       1 round_trippers.go:393]     Accept: application/vnd.kubernetes.protobuf, */*
I0328 09:48:59.925148       1 round_trippers.go:408] Response Status:  in 0 milliseconds
E0328 09:48:59.925199       1 leaderelection.go:234] error retrieving resource lock kube-system/kube-controller-manager: Get https://10.220.14.10:8443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager: http2: no cached connection was available
I0328 09:48:59.925216       1 leaderelection.go:190] failed to acquire lease kube-system/kube-controller-manager

使用1.13会有上面的现象。使用1.12版本没有问题。k8s版本是1.11。

从go代码查询“no cached connection was available”,可以看到是http2noCachedConnError错误。

// noCachedConnError is the concrete type of ErrNoCachedConn, which
// needs to be detected by net/http regardless of whether it's its
// bundled version (in h2_bundle.go with a rewritten type name) or
// from a user's x/net/http2. As such, as it has a unique method name
// (IsHTTP2NoCachedConnError) that net/http sniffs for via func
// isNoCachedConnError.
type http2noCachedConnError struct{}

func (http2noCachedConnError) IsHTTP2NoCachedConnError() {}

func (http2noCachedConnError) Error() string { return "http2: no cached connection was available" }

// isNoCachedConnError reports whether err is of type noCachedConnError
// or its equivalent renamed type in net/http2's h2_bundle.go. Both types
// may coexist in the same running program.
func http2isNoCachedConnError(err error) bool {
	_, ok := err.(interface{ IsHTTP2NoCachedConnError() })
	return ok
}

var http2ErrNoCachedConn error = http2noCachedConnError{}

由于问题必现,所以不妨使用delve跟踪。

复现方法:配置kube-controller-manage连接1个API Server,当kube-controller-manag正常运行时,将API Server关闭,然后重新启动API Server,之后去查看kube-controller-manager的日志,会打印”http2: no cached connection was available”。

从代码上分析,http2ErrNoCachedConn 只在 src/net/http/h2_bundle.go 中会返回,那么判断可能是在函数getClientConn出错了。来看下Go源码的实现。

func (p *http2clientConnPool) getClientConn(req *Request, addr string, dialOnMiss bool) (*http2ClientConn, error) {
	if http2isConnectionCloseRequest(req) && dialOnMiss {
		// It gets its own connection.
		http2traceGetConn(req, addr)
		const singleUse = true
		cc, err := p.t.dialClientConn(addr, singleUse)
		if err != nil {
			return nil, err
		}
		return cc, nil
	}
	p.mu.Lock()
	for _, cc := range p.conns[addr] {
		if st := cc.idleState(); st.canTakeNewRequest {
			if p.shouldTraceGetConn(st) {
				http2traceGetConn(req, addr)
			}
			p.mu.Unlock()
			return cc, nil
		}
	}
	if !dialOnMiss {
		p.mu.Unlock()
		return nil, http2ErrNoCachedConn
	}

dlv attach pid后,尝试对getClientConn打断点,发现有2个getClientConn,一个在上面看到的在Go源码中,而另一个是在k8s源码的vendor目录下。那么,实际运行中,会调用到哪个呢?没关系,都打上断点。

(dlv) b getClientConn
Command failed: Location "getClientConn" ambiguous: net/http.(*http2clientConnPool).getClientConn, k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*clientConnPool).getClientConn…
(dlv) b net/http.(*http2clientConnPool).getClientConn
Breakpoint 3 (enabled) set at 0x7b6b93 for net/http.(*http2clientConnPool).getClientConn() .usr/local/go/src/net/http/h2_bundle.go:749
(dlv) b k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*clientConnPool).getClientConn
Breakpoint 4 (enabled) set at 0x841a03 for k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*clientConnPool).getClientConn() .export/hubt/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/client_conn_pool.go:55

继续执行,会进入断点,可以看到,调用的是k8s源码vendor目录下的getClientConn

 0  0x0000000000851ff3 in k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*clientConnPool).getClientConn
    at .Users/hubaotao/Code/GoGo/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/client_conn_pool.go:55
 1  0x000000000085355e in k8s.io/kubernetes/vendor/golang.org/x/net/http2.noDialClientConnPool.GetClientConn
    at .Users/hubaotao/Code/GoGo/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/client_conn_pool.go:255
 2  0x000000000086d985 in k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*Transport).RoundTripOpt
    at .Users/hubaotao/Code/GoGo/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/transport.go:345
 3  0x0000000000853cee in k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*Transport).RoundTrip
    at .Users/hubaotao/Code/GoGo/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/transport.go:313
 4  0x0000000000853cee in k8s.io/kubernetes/vendor/golang.org/x/net/http2.noDialH2RoundTripper.RoundTrip
    at .Users/hubaotao/Code/GoGo/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/configure_transport.go:75
 5  0x00000000008161aa in net/http.(*Transport).roundTrip
    at .usr/local/go/src/net/http/transport.go:486
 6  0x00000000007fc4d5 in net/http.(*Transport).RoundTrip
    at .usr/local/go/src/net/http/roundtrip.go:17
 7  0x0000000000ea9100 in k8s.io/kubernetes/vendor/k8s.io/client-go/transport.(*userAgentRoundTripper).RoundTrip
    at .Users/hubaotao/Code/GoGo/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/client-go/transport/round_trippers.go:162
 8  0x00000000007bfafa in net/http.send
    at .usr/local/go/src/net/http/client.go:250
 9  0x00000000007bf51a in net/http.(*Client).send

来看看它的实现。

var ErrNoCachedConn = errors.New("http2: no cached connection was available")

func (p *clientConnPool) getClientConn(req *http.Request, addr string, dialOnMiss bool) (*ClientConn, error) {
	if isConnectionCloseRequest(req) && dialOnMiss {
		// It gets its own connection.
		const singleUse = true
		cc, err := p.t.dialClientConn(addr, singleUse)
		if err != nil {
			return nil, err
		}
		return cc, nil
	}
	p.mu.Lock()
	for _, cc := range p.conns[addr] {
		if cc.CanTakeNewRequest() {
			p.mu.Unlock()
			return cc, nil
		}
	}
	if !dialOnMiss {
		p.mu.Unlock()
		return nil, ErrNoCachedConn
	}
	call := p.getStartDialLocked(addr)
	p.mu.Unlock()
	<-call.done
	return call.res, call.err
}

可以看到,这里会返回 ErrNoCachedConn(从堆栈上看,rt为noDialH2RoundTripper,所以dialOnMiss值为false),而ErrNoCachedConn的值,的确是kube-controller-manager打印的字符串。

从前面的堆栈看,错误码会一路返回到 configure_transport.goRoundTrip。来看下这块代码。

func (rt noDialH2RoundTripper) RoundTrip(req *http.Request) (*http.Response, error) {
	res, err := rt.t.RoundTrip(req)
	if err == ErrNoCachedConn {
		return nil, http.ErrSkipAltProtocol
	}
	return res, err
}

从这里可以看到,如果返回的错误码是 noDialH2RoundTripper,会把它转为http.ErrSkipAltProtocol再返回。错误不是这里打印的。

继续往回看堆栈,在Go源码的roundTrip中,针对http.ErrSkipAltProtocol的确有特殊处理,针对这种错误不会直接返回报错。

继续单步跟踪下去,可以看到,在pconn的roundTrip中返回了ErrNoCachedConn,但在http2isNoCachedConnError判断时,认为不是同一个错误,没有执行t.removeIdleConn(pconn),并进入了 pconn.shouldRetryRequest,并最终返回了ErrNoCachedConn

// roundTrip implements a RoundTripper over HTTP.
func (t *Transport) roundTrip(req *Request) (*Response, error) {
...
	if t.useRegisteredProtocol(req) {
		altProto, _ := t.altProto.Load().(map[string]RoundTripper)
		if altRT := altProto[scheme]; altRT != nil {
			if resp, err := altRT.RoundTrip(req); err != ErrSkipAltProtocol {
				return resp, err
			}
		}
	}
...
	for {
...
      pconn, err := t.getConn(treq, cm)
		if err != nil {
			t.setReqCanceler(req, nil)
			req.closeBody()
			return nil, err
		}

		var resp *Response
		if pconn.alt != nil {
			// HTTP/2 path.
			t.setReqCanceler(req, nil) // not cancelable with CancelRequest
			resp, err = pconn.alt.RoundTrip(req)
		} else {
			resp, err = pconn.roundTrip(treq)
		}
		if err == nil {
			return resp, nil
		}
		if http2isNoCachedConnError(err) {
			t.removeIdleConn(pconn)
		} else if !pconn.shouldRetryRequest(req, err) {
			// Issue 16465: return underlying net.Conn.Read error from peek,
			// as we've historically done.
			if e, ok := err.(transportReadFromServerError); ok {
				err = e.err
			}
			return nil, err
		}

给shouldRetryRequest也打上断点(同样的,它也出现了2处)。

b net/http.(*persistConn).shouldRetryRequest
b k8s.io/kubernetes/vendor/golang.org/x/net/http2.shouldRetryRequest

在复现时查看到堆栈如下,打印下入参 err ,确实是ErrNoCachedConn

 0  0x0000000000816413 in net/http.(*persistConn).shouldRetryRequest
    at .usr/local/go/src/net/http/transport.go:569
 1  0x0000000000815ce9 in net/http.(*Transport).roundTrip
    at .usr/local/go/src/net/http/transport.go:543
 2  0x00000000007fc4d5 in net/http.(*Transport).RoundTrip
    at .usr/local/go/src/net/http/roundtrip.go:17
 3  0x0000000000ea9100 in k8s.io/kubernetes/vendor/k8s.io/client-go/transport.(*userAgentRoundTripper).RoundTrip
    at .Users/hubaotao/Code/GoGo/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/client-go/transport/round_trippers.go:162
 4  0x00000000007bfafa in net/http.send
    at .usr/local/go/src/net/http/client.go:250
 5  0x00000000007bf51a in net/http.(*Client).send
(dlv) p err
error(*errors.errorString) *{
	s: "http2: no cached connection was available",}

我们来看下shouldRetryRequest的实现。

func (pc *persistConn) shouldRetryRequest(req *Request, err error) bool {
	if http2isNoCachedConnError(err) {
		// Issue 16582: if the user started a bunch of
		// requests at once, they can all pick the same conn
		// and violate the server's max concurrent streams.
		// Instead, match the HTTP/1 behavior for now and dial
		// again to get a new TCP connection, rather than failing
		// this request.
		return true
	}
...
	return false // conservatively

可以看到,如果 no cached conn,是会返回true,即表示应该会retry。但单独跟踪下去发现并没有retry,这样从前面的代码分析,会返回错误码到 http.send,也就是打印出来的日志。

所以,可以看到,问题是出在 http2isNoCachedConnError。前面有贴它的实现,再贴一下。

func http2isNoCachedConnError(err error) bool {
	_, ok := err.(interface{ IsHTTP2NoCachedConnError() })
	return ok
}

写个小代码检查下,ErrNoCachedConn 是否满足 http2isNoCachedConnError(https://play.golang.org/p/fB8ilA6-VPZ)。

package main

import (
	"errors"
	"fmt"
)

type http2noCachedConnError struct{}

func (http2noCachedConnError) IsHTTP2NoCachedConnError() {}

func (http2noCachedConnError) Error() string { return "http2: no cached connection was available" }

// isNoCachedConnError reports whether err is of type noCachedConnError
// or its equivalent renamed type in net/http2's h2_bundle.go. Both types
// may coexist in the same running program.
func http2isNoCachedConnError(err error) bool {
	_, ok := err.(interface{ IsHTTP2NoCachedConnError() })
	return ok
}
func main(){
	var ErrNoCachedConn = errors.New("http2: no cached connection was available")
	res := http2isNoCachedConnError(ErrNoCachedConn)
	fmt.Printf("result 1 %v\n", res)

	var err = http2noCachedConnError{}
	res = http2isNoCachedConnError(err)
	fmt.Printf("result 2 %v\n", res)

	if ErrNoCachedConn.Error() == err.Error() {
		fmt.Printf("same result!")
	}
}

可以看到,虽然他们的值是相同的,但 ErrNoCachedConn 并不符合 http2isNoCachedConnError 的要求,这样也就导致了上面一直报错,没有发起重连。

当然这个问题解决很简单,不要用Go 1.13版本编译,用Go 1.12版本就不会问题。 1.12版本的重连逻辑与1.13不一样,如何不同,有机会再分析。

当然,也可以修改k8s源码里vendor的实现,统一两边错误码的定义,这样可以使用后续版本的Go,不过最好不要改vendor。

事实上,Go在后续版本的实现里,也是这样做的,具体可以看这两个patch:

不过,这个patch考虑不是很周到,要是这样改,向前兼容会更好一些:

func http2isNoCachedConnError(err error) bool {
	_, ok := err.(interface{ IsHTTP2NoCachedConnError() })
	if !ok {
		if err.Error() == http2ErrNoCachedConn.Error() {
			return true
		}
	}
	return ok
}

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK