Learnitweb

What Happens if etcd Fails in Kubernetes?

In Kubernetes, etcd is a distributed key-value store that acts as the brain of the cluster. It stores all cluster data—including node information, Pod definitions, ConfigMaps, Secrets, and more. If etcd fails, the entire Kubernetes control plane is affected.

1. What is etcd?

etcd is a high-availability, strongly consistent, distributed key-value store built on the Raft consensus algorithm. Kubernetes uses etcd to persist:

  • Cluster state (nodes, pods, namespaces)
  • Workload specifications (deployments, services)
  • Configuration (ConfigMaps, Secrets)
  • Role-based access controls (RBAC)
  • Events and service discovery

Without etcd, the control plane cannot function.

2. Which Kubernetes Components Rely on etcd?

  • kube-apiserver: Reads from and writes to etcd
  • kube-controller-manager: Watches etcd for changes
  • kube-scheduler: Needs etcd indirectly via the API server
  • kubectl: Communicates through the API server, which depends on etcd

Worker nodes and running Pods can continue running temporarily without etcd—but new scheduling or configuration changes will fail.

3. What Happens If etcd Fails?

The outcome depends on the type of failure:

1. etcd is Down or Crashed

kubectl commands hang or return errors:

The connection to the server <ip>:<port> was refused

Control plane logs (API server, controller-manager) show etcd-related errors. Dashboard and other control plane UIs are unresponsive. No new pods can be created. No configuration changes take effect.

2. etcd Data is Corrupted or Deleted

This is more dangerous.

Consequences:

  • Entire cluster configuration is lost
  • You cannot recover from backup-less state
  • Nodes may still run, but cluster is effectively bricked

3. etcd Cluster Loses Quorum

In HA setups, etcd typically runs with 3, 5, or 7 members.

To operate, etcd must maintain quorum—a majority of healthy nodes.

If quorum is lost:

  • etcd rejects reads and writes
  • API server fails to respond
  • Control plane becomes read-only or unavailable

For a 3-node etcd cluster:

  • Minimum of 2 nodes must be healthy
  • If only 1 node survives → quorum is lost

4. How to Detect etcd Failure

1. Check etcd Pod Status (if using static pods or kubeadm)

kubectl get pods -n kube-system | grep etcd

Look for status like CrashLoopBackOff, Error, or Pending.

2. Check API Server Logs

journalctl -u kube-apiserver

Errors may include:

etcdserver: request timed out
grpc: the client connection is closing

3. Query etcd Directly (if access is available)

ETCDCTL_API=3 etcdctl endpoint health --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key

This tests etcd cluster health.

Recovery Steps

Scenario A: Temporary etcd Crash

Restart etcd:

systemctl restart etcd

Or if using static pods:

docker ps -a | grep etcd
docker restart <etcd-container>

Verify:

kubectl get nodes

Scenario B: etcd Quorum Lost (in HA cluster)

  • Restore quorum by restarting or fixing failed etcd nodes.
  • If a node is permanently lost, remove it from the cluster:
etcdctl member remove <member-id>
  • Add new member:
etcdctl member add <new-member-name> --peer-urls=<peer-urls>

Scenario C: etcd Data Corruption

If you have a snapshot/backup, restore it:

ETCDCTL_API=3 etcdctl snapshot restore <snapshot.db> \
  --name <etcd-node-name> \
  --data-dir /var/lib/etcd-from-backup \
  --initial-cluster <initial-cluster> \
  --initial-cluster-token <token> \
  --initial-advertise-peer-urls https://<peer-ip>:2380

Then update etcd to use the new data directory and restart the cluster.