Bug Hunt: WireGuard Crashes and MTU Mismatch in GKE

Lovable's infrastructure team debugged a cluster-wide networking issue on Google Kubernetes Engine (GKE) that caused intermittent connection failures. Using an AI agent to scan Clickhouse logs, they discovered that anetd pods (Google's Cilium implementation) were crashing ~120 times per pod over six days — nearly once per hour. Crash dumps revealed a concurrent map-access panic in Google's WireGuard integration code, not in WireGuard itself.
First Fix: Disable Transparent Encryption
Google support recommended disabling node-to-node encryption to bypass the WireGuard bug. The team applied the change and restarted all anetd pods. Crashes stopped for about four hours — then users started seeing random connection failures to Valkey (their in-memory data store).
Second Bug: MTU Mismatch
Engineer Erik used tcpdump and Wireshark to capture packets. The smoking gun: "Destination unreachable (Fragmentation needed)". Here's the cause:
- With WireGuard enabled, cluster MTU was set to 1420 bytes (accounting for WireGuard's 80-byte encapsulation overhead).
- After disabling WireGuard, configs should have reverted to standard 1500 bytes, but some nodes weren't restarted — they still used the old 1420 MTU.
- Valkey connections crossing nodes with mismatched MTUs failed intermittently.
Resolution
The fix: rolling restart of all nodes to ensure consistent MTU configuration across the cluster. This eliminated fragmentation errors and restored stability.
Key Takeaways
- The first bug was in Google's
anetdintegration of WireGuard — a concurrency bug in map access. It's specific to GKE's implementation. - Disabling encryption bypassed the panic but introduced an MTU mismatch that needed a full node rollout.
- AI agents helped surface the anetd crash pattern quickly from millions of log lines.
📖 Read the full source: HN AI Agents
👀 See Also

Building a serverless AI agent platform on AWS for $0.01/month with Claude Code
A developer built a complete AWS serverless platform running AI agents for approximately $0.01/month using Claude Code over 29 hours, eliminating expensive components like NAT Gateway ($32/month) and ALB ($18/month). The project includes 233 unit tests, 35 E2E tests, and deploys with a single cdk deploy command.

Claude Code Structure That Survived Multiple Real Projects
A developer shares a Claude Code setup that held up across 2-3 real projects with multiple skills, MCP servers, and agents. Key findings include using CLAUDE MD for consistency, splitting skills by intent, implementing hooks, and keeping context usage under 60%.

End-to-End LLM Stack Trace: From Keystroke to Streamed Token
A software engineer has created a comprehensive document tracing every layer of the stack when sending a prompt to an LLM, covering client-side token counting, network protocols, API gateways, safety classifiers, tokenization, KV cache, sampling pipeline, and streaming mechanics.

Practical Prompt Engineering Lessons from Using Claude Code
A project manager shares specific techniques that improved Claude Code results: two-phase prompting, single-objective prompts, and highly specific role definitions.