Optimizing DNS in Big Kubernetes Clusters

If the phrase “paged at 3 a.m.” brings back memories of mysterious app errors and dependency timeouts, DNS issues in Kubernetes probably played a starring role. Scaling clusters introduces DNS latency, frustrating outages, and debugging marathons, especially when CoreDNS is a central bottleneck. For most ops team managing thousands of nodes, DNS instability isn't just theory; it is a constant reality, until you’re introduced to node local DNS cache. Let's walk through why it works and how it's set up.

The usual DNS problems in Kubernetes

When clusters scale up, DNS struggles become performance blockers:

CoreDNS overload: Too many DNS queries flood central CoreDNS instances. This slows everything and risks outages.
Noisy neighbors: Pods sharing bandwidth can throttle each other's DNS lookups.
Latency spikes: Applications slow to a crawl due to backlogged DNS queries.
Single point of failure: If CoreDNS goes down, so do your applications.
Timeouts and disruptions: Peak loads leave some queries unresolved, breaking services.
Scaling limitations: Adding more CoreDNS replicas helps a bit, but cannot address basic network and design limits.

Node local DNS cache: A game changer

Node local DNS cache puts a tiny DNS resolver directly on each node via a DaemonSet. Here is the basic request flow:

Pod requests DNS
Node-local cache responds immediately if possible
If the cache misses, it forwards to CoreDNS
CoreDNS handles the resolution and cache refresh

Request flow before and after

Before:
Pod -> CoreDNS (central, often over network) -> upstream DNS

After:
Pod -> node-local DNS cache (on same node) -> [miss?] CoreDNS -> upstream DNS

Key benefits

Reliability: Pods can resolve names even if CoreDNS is briefly unavailable, resulting in fewer timeouts and more resilience during outages.
Lower latency: Local answers mean faster application start-up and lower query delay.
Scalability: Each node splits the DNS workload, avoiding bottlenecks.
Troubleshooting simplicity: Local logging makes debugging DNS issues easier.
Minimized external dependency: Less reliance on upstream DNS means fewer chances for external factors to derail service-level agreements.

Comparing old versus new DNS designs

Classic (CoreDNS only):

Pods send DNS queries to CoreDNS IP, routed based on DNS policy
CoreDNS checks if it is an internal (Kube API) or external (upstream DNS) request
Heavy reliance on multiple network hops and central service availability

Node local DNS cache design:

Pods send queries to per-node cache IP
Cache resolves locally if possible, otherwise forwards the request to CoreDNS
Simple failover: If the cache dies, pods fall back to CoreDNS

Real-world testing: is it worth it?

AWS cluster example

Without cache: At high load (max QPS), about 0.08 percent queries lost, latency rises steeply.
With cache: Nearly zero query loss, 96 percent improvement in latency at peak.

Azure cluster example

Celium CNI already improves DNS performance, but node-local cache further cuts latency by over 96 percent.
High-volume queries (hundreds of thousands) show dramatic latency improvements.

Under external load

With cache, even with extra pressure, the cluster kept loss rates near zero and latency dropped by over 99 percent at the highest loads.

Monitoring and metrics

Tracking DNS performance requires:

Forward duration: Time spent waiting for upstream DNS
Response time: Cache hit time for the pod

Switching to node-local cache has been seen to bring average forward time down from 30 milliseconds to approximately 1.5 milliseconds, and response time from 15 milliseconds to about 250 microseconds.

Operational paranoia: failure modes

Systems can fail, so always test them:

Cache failure: Deleted DaemonSet, DNS resolution fell back to CoreDNS, service stayed up but latency increased significantly.
CoreDNS failure: Cache handled existing cached requests, but new lookups stalled since CoreDNS is still the source of truth.

Lessons learned

Node local DNS cache gives big gains in latency and reliability.
Do not remove CoreDNS completely; it remains essential for fresh lookups.
Ensure graceful cache and CoreDNS failover in your setup.
Monitor both the cache and CoreDNS for complete visibility.

Recommendations

Adopt node local DNS cache in production clusters; performance metrics clearly support it.
Use built-in monitoring features, keep a close watch on both cache and CoreDNS health.
Test and tune TTL (time-to-live) settings for cache freshness and controlled DNS updates.
Scale resources thoughtfully for both cache and CoreDNS layers.

Closing thoughts

Node local DNS cache transforms large Kubernetes cluster's DNS from a mysterious bottleneck to reliable, lightning-fast infrastructure. If DNS woes haunt your deployments, the switch is worth the effort. Your next on-call shift will thank you.

If you want supporting YAML snippets, troubleshooting guides, or monitoring dashboard setups, share your interest in the comments. Let's make DNS boring again.

Battling DNS in large Kubernetes clusters

The usual DNS problems in Kubernetes

Node local DNS cache: A game changer

Request flow before and after

Key benefits

Comparing old versus new DNS designs

Real-world testing: is it worth it?

Monitoring and metrics

Operational paranoia: failure modes

Lessons learned

Recommendations

Closing thoughts

Comments

Kubernetes

Mastering Kube-Proxy Modes: IPVS vs iptables

More from this blog

Navigating Hostname Conflicts in Linux: /etc/hosts vs DNS

Safeguarding Kubernetes: Mastering Pod Disruption Budgets for Uninterrupted Services

SadServers Day 9: Closing an Open File Without Killing the Process

Day 5 of KodeKloud’s Free AI Learning Week: Surviving a 3 AM Outage

Day 4 KodeKloud's Free AI Learning Week: Subagents

Command Palette

The usual DNS problems in Kubernetes

Node local DNS cache: A game changer

Request flow before and after

Key benefits

Comparing old versus new DNS designs

Real-world testing: is it worth it?

Monitoring and metrics

Operational paranoia: failure modes

Lessons learned

Recommendations

Closing thoughts

Comments

Kubernetes

Mastering Kube-Proxy Modes: IPVS vs iptables

More from this blog