grpc
记一次线上的错误日志
问题描述
在周六的下午高峰期,因为有个接口要加白不需要token也能访问。当时在数据库改了接口的配置。就重启业务网关的服务apigateway。
飞书的报警群就弹出了grpc服务超时的调用。
rpc error: code = DeadlineExceeded desc = received context error while waiting for new LB policy update: context deadline exceeded
还有一条更加明确
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.16.253.251:8081: i/o timeout"
第一反应是重启导致的
为啥会有这两种错误呢,一是因为k8s中DNS的更新是需要时间的,二是原来的client的参数中没有keepalive 两个问题
- 网关的服务已经重启,但是作为下游的服务还是持有旧的tcp连接,导致超时。
- 网关重启后,DNS解析还未更新
解决
grpc client 的配置
err := dispatcher.NewClient("dns:///apigateway:8081",
ggrpc.WithKeepaliveParams(keepalive.ClientParameters{
Time: 10 * time.Second,
Timeout: 2 * time.Second,
PermitWithoutStream: true,
}),
ggrpc.WithDefaultServiceConfig(`{"loadBalancingPolicy":"round_robin"}`),
)
if err != nil {
panic(err)
}前缀 dns:/// 明确告诉 gRPC:使用 gRPC 内置的 DNS resolver周期性刷新 DNS(默认每 30 秒)可以获取 Service 下最新的 Endpoint 列表避免 client 缓存旧 IP 导致连接不可用
设置keepalive 可以断掉已经失效的连接,避免出现连接超时
设置默认的负载均衡策略为 round_robin,避免每次请求都打到同一个服务节点
然后就是deployment 的配置 设置lifecycle
Pod 被删除或重启时,Kubernetes 会先发送 SIGTERM 给容器。如果容器直接退出,客户端(比如 gRPC client)可能还在使用旧连接 → i/o timeout、请求失败。preStop 可以在容器收到 SIGTERM 后延迟执行操作,比如:sleep 3 给客户端时间感知 Pod 下线关闭 HTTP/gRPC server,先拒绝新请求,再处理现有请求防止服务被立即杀掉,提高请求的稳定性
lifecycle:
preStop:
exec:
command:
- sleep
- '3'---
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
analysis.crane.io/resource-recommendation: |
containers:
- containerName: account
target:
cpu: 125m
memory: 125Mi
app.gitlab.com/app: booster-server-raven
app.gitlab.com/env: dev
prometheus.io/port: '9090'
prometheus.io/scrape: 'true'
labels:
app: account
ref: dev
name: account
namespace: raven
resourceVersion: '6066716413'
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: account
ref: dev
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
type: RollingUpdate
template:
metadata:
annotations:
kubectl.kubernetes.io/restartedAt: ''
qcloud-redeploy-timestamp: ''
creationTimestamp: null
labels:
app: account
pod-template-hash: 6f79b965c6
ref: dev
spec:
containers:
- env:
- name: PATH
value: '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin'
- name: APOLLO_META
value: 'http://cfg-apollo-configservice.apollo:8080'
- name: APOLLO_ACCESS_KEY
value: **************************
image: '******************************'
imagePullPolicy: Always
lifecycle:
preStop:
exec:
command:
- sleep
- '3'
livenessProbe:
failureThreshold: 3
grpc:
port: 8000
service: ''
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
name: account
ports:
- containerPort: 8000
name: grpc
protocol: TCP
readinessProbe:
failureThreshold: 3
grpc:
port: 8000
service: ''
periodSeconds: 20
successThreshold: 1
timeoutSeconds: 5
resources: {}
securityContext:
privileged: false
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/localtime
name: localtime
readOnly: true
dnsPolicy: ClusterFirst
imagePullSecrets:
- name: **********
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
- hostPath:
path: /etc/localtime
type: File
name: localtime
---
apiVersion: v1
kind: Service
metadata:
annotations: {}
labels:
app: account
ref: dev
name: account
namespace: raven
spec:
internalTrafficPolicy: Cluster
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
ports:
- port: 8000
protocol: TCP
targetPort: 8000
selector:
app: account
ref: dev
sessionAffinity: None
type: ClusterIP
loadBalancer: {}