最新在新搭建的测试集群测试etcd-operator的过程中,发现一个诡异的问题,表象是,新建etcd集群的时候无论size设置多少,都只会产生一个etcd节点实例,也就是一个pod,这点和预期差距很大,问题出现在哪呢
问题现象
在之前搭建好的kubernetes 1.17集群上,安装etcd-operator后,尝试创建一个集群,这里指定repository
来加速镜像拉取
1 2 3 4 5 6 7 8 9 10 11 12
| apiVersion: "etcd.database.coreos.com/v1beta2" kind: "EtcdCluster" metadata: name: "example-etcd-cluster" spec: size: 3 version: "3.2.28" repository: "quay.azk8s.cn/coreos/etcd"
|
这个时候查看pod
1 2
| $ kubectl get pod |grep etcd-cluster example-etcd-cluster-fzdvb6rt24 0/1 Init:0/1 0 3m31s
|
这里有几个奇怪的地方:
- 设置size=3应该会拉起3个实例,组成一个集群,可是只有一个
- 这一个pod一直处于Init状态 过了很久才变成readdy
排查
确认etcd是否正常
1 2 3 4 5 6 7
| $ kubectl port-forward pod/example-etcd-cluster-fzdvb6rt24 2379 $ ETCDCTL_API=3 etcdctl -w table endpoint status +----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | 127.0.0.1:2379 | e2ac63a1ba0e5009 | 3.2.28 | 25 kB | true | false | 2 | 4 | 0 | | +----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|
看起来启动的就是一个单节点etcd,查看pod启动命令
1 2 3 4 5 6 7 8 9 10 11 12 13 14
| $ kubectl describe pod/example-etcd-cluster-fzdvb6rt24
... Command: /usr/local/bin/etcd --data-dir=/var/etcd/data --name=example-etcd-cluster-fzdvb6rt24 --initial-advertise-peer-urls=http://example-etcd-cluster-fzdvb6rt24.example-etcd-cluster.default.svc:2380 --listen-peer-urls=http://0.0.0.0:2380 --listen-client-urls=http://0.0.0.0:2379 --advertise-client-urls=http://example-etcd-cluster-fzdvb6rt24.example-etcd-cluster.default.svc:2379 --initial-cluster=example-etcd-cluster-fzdvb6rt24=http://example-etcd-cluster-fzdvb6rt24.example-etcd-cluster.default.svc:2380 --initial-cluster-state=new --initial-cluster-token=b5de0ca8-b855-472a-96a6-87b2f406dc31
|
参数也显示启动的是单节点集群,反复检查CRD设置确认自己没错,把注意转向etcd-operator
查看etcd-operator日志,发现如下几个错误
1 2 3 4 5
| $ kubectl logs pod/etcd-operator-etcd-operator-etcd-operator-56dfc86b9d-x6gwc ... time="2020-03-06T11:49:06Z" level=error msg="failed to reconcile: add one member failed: creating etcd client failed context deadline exceeded" cluster-name=example-etcd-cluster cluster-namespace=default pkg=cluster time="2020-03-06T11:49:20Z" level=error msg="failed to update members: context deadline exceeded" cluster-name=example-etcd-cluster cluster-namespace=default pkg=cluster time="2020-03-06T11:49:33Z" level=error msg="failed to update members: list members failed: creating etcd client failed: context deadline exceeded" cluster-name=example-etcd-cluster cluster-namespace=default pkg=cluster
|
猜测启动逻辑为,先新建一个单节点初始化完毕后,然后启动新节点添加进去,目前的状态是创建了第一个节点,然后卡主了,按照时间顺序先看看第一个错误failed to reconcile: add one member failed
,打开源码搜索这个错误信息,看看上下文
1 2 3 4 5 6 7 8 9 10
| cfg := clientv3.Config{ Endpoints: c.members.ClientURLs(), DialTimeout: constants.DefaultDialTimeout, TLS: c.tlsConfig, } etcdcli, err := clientv3.New(cfg) if err != nil { return fmt.Errorf("add one member failed: creating etcd client failed %v", err) } defer etcdcli.Close()
|
该段逻辑简单,建立etcdclient失败,连接超时,怀疑连接的Endpoints有问题,分析代码endpoints是按照规则拼出的域名地址
尝试添加日志编译,发现按照文档无法编译,文档里面构建容器已经被删除了,
未完待续
整体看了下etcd-operator的commit和issue,完全一个dead project,PR没有人管理,不乐观
坑
最后发现是测试的虚拟机网络有问题