Learnitweb

What Happens if CoreDNS Fails Inside the Cluster?

1. Introduction

In Kubernetes, CoreDNS is a critical system component that provides DNS-based service discovery. All internal service-to-service communication relies on it. When CoreDNS fails, the DNS resolution within the cluster breaks, which can lead to cascading application failures.

2. What is CoreDNS?

CoreDNS is a DNS server that runs as a Deployment in the kube-system namespace. It answers DNS queries for services and pods in the cluster, enabling one pod to resolve the name of another.

When a pod issues a DNS query (e.g., trying to reach my-service.my-namespace.svc.cluster.local), the request is sent to CoreDNS, which then resolves the IP address of the service.

3. What Happens When CoreDNS Fails?

When CoreDNS fails, internal DNS resolution stops working, meaning:

  1. Pods cannot resolve other services by DNS name.
  2. Cluster Add-ons that rely on DNS (e.g., metrics-server, kube-proxy) may fail.
  3. Deployments may crash or hang if they wait for a dependent service via DNS.
  4. InitContainers or ReadinessProbes that depend on DNS will fail, stalling deployments.
  5. ExternalName Services, which rely on DNS, won’t function.

4. Common Symptoms

Some obvious signs that CoreDNS has failed:

  • Applications inside pods report “Name resolution errors”.
  • ping or curl using service names returns: makefileCopyEditping: unknown host
  • nslookup or dig inside pods returns DNS failure.
  • Logs show: nginxCopyEditlookup my-service on 10.96.0.10:53: no such host
  • kubectl get pods -n kube-system shows CoreDNS pods in CrashLoopBackOff, Error, or Pending.

5. Root Causes of CoreDNS Failure

Several issues may lead to CoreDNS failure:

  1. Resource Constraints
    CPU or memory starvation causes pods to get evicted or crash.
  2. Configuration Errors
    Incorrect Corefile (CoreDNS configuration) leads to syntax or runtime errors.
  3. Network Issues
    Flannel/Calico/Cilium misconfiguration blocks communication on port 53 (DNS).
  4. Node Problems
    If CoreDNS pods are scheduled to nodes that are NotReady or tainted.
  5. Service/Endpoint Missing
    The kube-dns service or endpoints are accidentally deleted or misconfigured.
  6. Pod Scheduling Issues
    Taints or affinity rules prevent CoreDNS pods from scheduling.

6. How Kubernetes Reacts Internally

Here’s what happens under the hood when CoreDNS fails:

  • Pods can still run, but they cannot discover or talk to other services by name.
  • Service discovery fails, even though kubectl get svc may show everything as normal.
  • Kubelet and control plane components don’t depend on CoreDNS, so the cluster appears “healthy” from the outside.
  • Any service using environment variable-based discovery may still function, but this method is deprecated and limited.

7. Diagnosing CoreDNS Issues

1. Check CoreDNS Pod Status

kubectl get pods -n kube-system -l k8s-app=kube-dns

2. Inspect Logs

kubectl logs -n kube-system -l k8s-app=kube-dns

3. Test DNS Inside a Pod

kubectl run -it test --image=busybox --restart=Never -- sh
nslookup kubernetes.default

4. Check Corefile Configuration

kubectl -n kube-system edit configmap coredns

5. Ensure kube-dns Service Exists

kubectl get svc -n kube-system

6. Describe CoreDNS Pods for Events

kubectl describe pod <pod-name> -n kube-system

8. Recovery Steps

Step 1: Restart CoreDNS

kubectl delete pod -n kube-system -l k8s-app=kube-dns

Step 2: Fix Resource Limits (if pods are getting OOMKilled)

resources:
  limits:
    memory: "170Mi"
    cpu: "100m"

Step 3: Roll Back Misconfigured Corefile

kubectl -n kube-system edit configmap coredns

Step 4: Scale Up CoreDNS

kubectl scale deployment coredns -n kube-system --replicas=3

Step 4: Scale Up CoreDNS

kubectl scale deployment coredns -n kube-system --replicas=3

Step 5: Check Node and Network Health

  • Ensure nodes are Ready.
  • Ensure Calico/Cilium/Flannel pods are working.