etcd-operator创建集群单实例问题

最新在新搭建的测试集群测试etcd-operator的过程中,发现一个诡异的问题,表象是,新建etcd集群的时候无论size设置多少,都只会产生一个etcd节点实例,也就是一个pod,这点和预期差距很大,问题出现在哪呢

问题现象

在之前搭建好的kubernetes 1.17集群上,安装etcd-operator后,尝试创建一个集群,这里指定repository来加速镜像拉取

1
2
3
4
5
6
7
8
9
10
11
12
apiVersion: "etcd.database.coreos.com/v1beta2"
kind: "EtcdCluster"
metadata:
name: "example-etcd-cluster"
## Adding this annotation make this cluster managed by clusterwide operators
## namespaced operators ignore it
# annotations:
# etcd.database.coreos.com/scope: clusterwide
spec:
size: 3
version: "3.2.28"
repository: "quay.azk8s.cn/coreos/etcd"

这个时候查看pod

1
2
$ kubectl get pod |grep etcd-cluster
example-etcd-cluster-fzdvb6rt24 0/1 Init:0/1 0 3m31s

这里有几个奇怪的地方:

  1. 设置size=3应该会拉起3个实例,组成一个集群,可是只有一个
  2. 这一个pod一直处于Init状态 过了很久才变成readdy

排查

确认etcd是否正常

1
2
3
4
5
6
7
$ kubectl port-forward pod/example-etcd-cluster-fzdvb6rt24 2379
$ ETCDCTL_API=3 etcdctl -w table endpoint status
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | e2ac63a1ba0e5009 | 3.2.28 | 25 kB | true | false | 2 | 4 | 0 | |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

看起来启动的就是一个单节点etcd,查看pod启动命令

1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ kubectl describe pod/example-etcd-cluster-fzdvb6rt24

...
Command:
/usr/local/bin/etcd
--data-dir=/var/etcd/data
--name=example-etcd-cluster-fzdvb6rt24
--initial-advertise-peer-urls=http://example-etcd-cluster-fzdvb6rt24.example-etcd-cluster.default.svc:2380
--listen-peer-urls=http://0.0.0.0:2380
--listen-client-urls=http://0.0.0.0:2379
--advertise-client-urls=http://example-etcd-cluster-fzdvb6rt24.example-etcd-cluster.default.svc:2379
--initial-cluster=example-etcd-cluster-fzdvb6rt24=http://example-etcd-cluster-fzdvb6rt24.example-etcd-cluster.default.svc:2380
--initial-cluster-state=new
--initial-cluster-token=b5de0ca8-b855-472a-96a6-87b2f406dc31

参数也显示启动的是单节点集群,反复检查CRD设置确认自己没错,把注意转向etcd-operator

查看etcd-operator日志,发现如下几个错误

1
2
3
4
5
$ kubectl logs pod/etcd-operator-etcd-operator-etcd-operator-56dfc86b9d-x6gwc
...
time="2020-03-06T11:49:06Z" level=error msg="failed to reconcile: add one member failed: creating etcd client failed context deadline exceeded" cluster-name=example-etcd-cluster cluster-namespace=default pkg=cluster
time="2020-03-06T11:49:20Z" level=error msg="failed to update members: context deadline exceeded" cluster-name=example-etcd-cluster cluster-namespace=default pkg=cluster
time="2020-03-06T11:49:33Z" level=error msg="failed to update members: list members failed: creating etcd client failed: context deadline exceeded" cluster-name=example-etcd-cluster cluster-namespace=default pkg=cluster

猜测启动逻辑为,先新建一个单节点初始化完毕后,然后启动新节点添加进去,目前的状态是创建了第一个节点,然后卡主了,按照时间顺序先看看第一个错误failed to reconcile: add one member failed,打开源码搜索这个错误信息,看看上下文

1
2
3
4
5
6
7
8
9
10
cfg := clientv3.Config{
Endpoints: c.members.ClientURLs(),
DialTimeout: constants.DefaultDialTimeout,
TLS: c.tlsConfig,
}
etcdcli, err := clientv3.New(cfg)
if err != nil {
return fmt.Errorf("add one member failed: creating etcd client failed %v", err)
}
defer etcdcli.Close()

该段逻辑简单,建立etcdclient失败,连接超时,怀疑连接的Endpoints有问题,分析代码endpoints是按照规则拼出的域名地址

尝试添加日志编译,发现按照文档无法编译,文档里面构建容器已经被删除了,

未完待续

整体看了下etcd-operator的commit和issue,完全一个dead project,PR没有人管理,不乐观

最后发现是测试的虚拟机网络有问题