PREQUEL-2025-0076
NATS Route Error caused by DNS Resolution FailureMediumImpact: 6/10Mitigation: 4/10
Description
A NATS server establishes a TCP route, logs **“Route connection\ncreated”**, but within milliseconds DNS resolution for its peer\nfails; the server reports\n\n```\nError trying to connect to route [nats://cluster-b:6222]:\nlookup for host cluster-b no such host\n```\n\nand immediately closes the socket. \nWhen this sequence happens repeatedly the cluster oscillates between\n**full mesh** and **partitioned** states, leading to intermittent\npublish / subscribe errors and duplicate message deliveries.\n
Mitigation
1. **Verify hostnames** listed in every server’s `routes:` stanza:\n ```bash\n dig +short cluster-b.default.svc.cluster.local\n ```\n should return one or more IPs.\n2. **Use headless Services or IPs** when running inside Kubernetes:\n ```yaml\n apiVersion: v1\n kind: Service\n metadata:\n name: nats-route\n spec:\n clusterIP: None # headless\n ```\n3. **Short-term relief** – scale the Deployment/StatefulSet down to a\n single replica until DNS is stable.\n4. **Tune reconnection back-off** so flaps don’t overload DNS:\n ```\n --connect-retries 60 --reconnect-delay 5s\n ```\n5. **Add DNS liveness probe** to catch CoreDNS regression early:\n ```yaml\n readinessProbe:\n exec: { command: [\"nslookup\", \"nats-route\"] }\n