症状
Rancher无法正常启动,通过查看Rancher日志可以看到集群一直报错:
Waiting on etcd startup: status 503
可以明显的看出是etcd出了问题阻塞了集群的启动,需要进入到rancher容器里,查看etcd的问题
etcdctl check datascale
{"level":"warn","ts":"2022-12-26T06:56:07.062Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-315441f2-5ef2-4474-91b4-249484ee17de/127.0.0.1:2379","attempt":0,"error":"rpc error: code = ResourceExhausted desc = etcdserver: mvcc: database space exceeded"}
10000 / 10000 Boooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00% 1s
FAIL: too many errors
FAIL: ERROR(etcdserver: mvcc: database space exceeded) -> 10000
etcdctl check perf
FAIL: too many errors
FAIL: ERROR(etcdserver: mvcc: database space exceeded) -> 8994
FAIL: Throughput too low: 1 writes/s
PASS: Slowest request took 0.000000s
PASS: Stddev is NaNs
FAIL
通过上面两个检查命令,得到是错误太多
打印节点状态
etcdctl endpoint status --write-out table
通过节点信息,可以看到是因为错误太多导致警告空间被占满,etcd无法写入
压缩并整理多余空间
通过查找官方文档确定解决方案,通过执行命令压缩etcd空间并且整理空间碎片即可
#使用API3
export ETCDCTL_API=3
# 查看告警信息
etcdctl --endpoints=http://127.0.0.1:2379 alarm list
# 告警信息
memberID:10276657743932975437 alarm:NOSPACE
# 获取当前版本
rev=$(etcdctl --endpoints=http://127.0.0.1:2379 endpoint status --write-out="json" | egrep -o '"revision":[0-9]*' | egrep -o '[0-9].*')
# 压缩掉所有旧版本
etcdctl --endpoints=http://127.0.0.1:2379 compact $rev
# 整理多余的空间
etcdctl --endpoints=http://127.0.0.1:2379 defrag
# 取消告警信息
etcdctl --endpoints=http://127.0.0.1:2379 alarm disarm
数据压缩完成之后数据大小如下:
Comments | NOTHING