1. Introduction
In Kubernetes, CoreDNS is a critical system component that provides DNS-based service discovery. All internal service-to-service communication relies on it. When CoreDNS fails, the DNS resolution within the cluster breaks, which can lead to cascading application failures.
2. What is CoreDNS?
CoreDNS is a DNS server that runs as a Deployment in the kube-system
namespace. It answers DNS queries for services and pods in the cluster, enabling one pod to resolve the name of another.
When a pod issues a DNS query (e.g., trying to reach my-service.my-namespace.svc.cluster.local
), the request is sent to CoreDNS, which then resolves the IP address of the service.
3. What Happens When CoreDNS Fails?
When CoreDNS fails, internal DNS resolution stops working, meaning:
- Pods cannot resolve other services by DNS name.
- Cluster Add-ons that rely on DNS (e.g., metrics-server, kube-proxy) may fail.
- Deployments may crash or hang if they wait for a dependent service via DNS.
- InitContainers or ReadinessProbes that depend on DNS will fail, stalling deployments.
- ExternalName Services, which rely on DNS, won’t function.
4. Common Symptoms
Some obvious signs that CoreDNS has failed:
- Applications inside pods report “Name resolution errors”.
ping
orcurl
using service names returns: makefileCopyEditping: unknown host
nslookup
ordig
inside pods returns DNS failure.- Logs show: nginxCopyEdit
lookup my-service on 10.96.0.10:53: no such host
kubectl get pods -n kube-system
shows CoreDNS pods inCrashLoopBackOff
,Error
, orPending
.
5. Root Causes of CoreDNS Failure
Several issues may lead to CoreDNS failure:
- Resource Constraints
CPU or memory starvation causes pods to get evicted or crash. - Configuration Errors
IncorrectCorefile
(CoreDNS configuration) leads to syntax or runtime errors. - Network Issues
Flannel/Calico/Cilium misconfiguration blocks communication on port53
(DNS). - Node Problems
If CoreDNS pods are scheduled to nodes that are NotReady or tainted. - Service/Endpoint Missing
Thekube-dns
service or endpoints are accidentally deleted or misconfigured. - Pod Scheduling Issues
Taints or affinity rules prevent CoreDNS pods from scheduling.
6. How Kubernetes Reacts Internally
Here’s what happens under the hood when CoreDNS fails:
- Pods can still run, but they cannot discover or talk to other services by name.
- Service discovery fails, even though
kubectl get svc
may show everything as normal. - Kubelet and control plane components don’t depend on CoreDNS, so the cluster appears “healthy” from the outside.
- Any service using environment variable-based discovery may still function, but this method is deprecated and limited.
7. Diagnosing CoreDNS Issues
1. Check CoreDNS Pod Status
kubectl get pods -n kube-system -l k8s-app=kube-dns
2. Inspect Logs
kubectl logs -n kube-system -l k8s-app=kube-dns
3. Test DNS Inside a Pod
kubectl run -it test --image=busybox --restart=Never -- sh nslookup kubernetes.default
4. Check Corefile Configuration
kubectl -n kube-system edit configmap coredns
5. Ensure kube-dns Service Exists
kubectl get svc -n kube-system
6. Describe CoreDNS Pods for Events
kubectl describe pod <pod-name> -n kube-system
8. Recovery Steps
Step 1: Restart CoreDNS
kubectl delete pod -n kube-system -l k8s-app=kube-dns
Step 2: Fix Resource Limits (if pods are getting OOMKilled)
resources: limits: memory: "170Mi" cpu: "100m"
Step 3: Roll Back Misconfigured Corefile
kubectl -n kube-system edit configmap coredns
Step 4: Scale Up CoreDNS
kubectl scale deployment coredns -n kube-system --replicas=3
Step 4: Scale Up CoreDNS
kubectl scale deployment coredns -n kube-system --replicas=3
Step 5: Check Node and Network Health
- Ensure nodes are Ready.
- Ensure Calico/Cilium/Flannel pods are working.