Fix WireGuard Crashes & MTU Mismatch in GKE

Lovable's infrastructure team debugged a cluster-wide networking issue on Google Kubernetes Engine (GKE) that caused intermittent connection failures. Using an AI agent to scan Clickhouse logs, they discovered that anetd pods (Google's Cilium implementation) were crashing ~120 times per pod over six days — nearly once per hour. Crash dumps revealed a concurrent map-access panic in Google's WireGuard integration code, not in WireGuard itself.

First Fix: Disable Transparent Encryption

Google support recommended disabling node-to-node encryption to bypass the WireGuard bug. The team applied the change and restarted all anetd pods. Crashes stopped for about four hours — then users started seeing random connection failures to Valkey (their in-memory data store).

Second Bug: MTU Mismatch

Engineer Erik used tcpdump and Wireshark to capture packets. The smoking gun: "Destination unreachable (Fragmentation needed)". Here's the cause:

With WireGuard enabled, cluster MTU was set to 1420 bytes (accounting for WireGuard's 80-byte encapsulation overhead).
After disabling WireGuard, configs should have reverted to standard 1500 bytes, but some nodes weren't restarted — they still used the old 1420 MTU.
Valkey connections crossing nodes with mismatched MTUs failed intermittently.

Resolution

The fix: rolling restart of all nodes to ensure consistent MTU configuration across the cluster. This eliminated fragmentation errors and restored stability.

Key Takeaways

The first bug was in Google's anetd integration of WireGuard — a concurrency bug in map access. It's specific to GKE's implementation.
Disabling encryption bypassed the panic but introduced an MTU mismatch that needed a full node rollout.
AI agents helped surface the anetd crash pattern quickly from millions of log lines.

📖 Read the full source: HN AI Agents