Lessons from running multiple OpenClaw gateways in production

Production failures and their causes
A developer running 3+ OpenClaw gateways 24/7 for personal use, a non-profit, and a community organization experienced repeated production failures by treating OpenClaw changes like scratch work instead of production deployments.
Specific failure scenarios
The upgrade that wouldn't die: Running pnpm add -g openclaw@latest caused the gateway to crash with MODULE_NOT_FOUND because the new version installed to a different path while the service file had the old path hardcoded. A rescue script that restarted every 5 minutes couldn't distinguish between transient crashes (where restart works) and structural failures (requiring service file fixes first).
Silent capability loss: After configuring new integrations and restarting the gateway, capabilities like text-to-speech for board accessibility, email sending, and X.com posting appeared configured but were actually broken due to API keys in wrong config sections or expired credentials. These failures went undetected for days.
Root cause analysis
OpenClaw gateway configuration is spread across at least five locations:
- Main JSON file
- Environment variables in service files
- Docker flags
- Provider blocks
- Skills with their own credentials
Rotating a key in one location leaves others stale. Upgrading OpenClaw breaks hardcoded paths. Updating a skill causes credentials to silently stop loading. These are regressions that CI/CD would catch in software development, but there was no CI for the gateway infrastructure.
Solution being implemented
Capability audit: Before and after any change:
- Parse config to enumerate claimed capabilities
- Verify each one actually works with live API tests (5-second timeout)
- Diff before/after snapshots
Config validation gate: No direct edits to live config:
- JSON validity check
- Timestamped backups
- Blocks known dangerous patterns
Reproducible environment:
- Version-agnostic service files (no hardcoded paths)
- One canonical credential file, with everything else deriving from it
- Crash-loop detection (3 failures = diagnose mode, not restart mode)
Regression detector:
- Daily comparison against known-good baseline
- Classify changes as improvement vs. degradation
- Alert on capability loss
The developer is sharing this work early and asks other AI infrastructure operators: "How do you handle gateway management?" and "What's your testing strategy for your openclaw?"
📖 Read the full source: r/openclaw
👀 See Also

Corporate Developer's Claude Workflow for Backend Development
A backend developer at a large US finance company shares their Claude workflow: providing detailed task descriptions with specs and internal documents, using Claude to create a working markdown document, then employing a codeReviewing agent with organizational style guidelines.

Building a Technical Book with Claude Code: Process and Pitfalls
A developer created an EPUB book about intermediate Claude Code features by using Claude to collect Anthropic documentation, researching real-world examples in finance, and structuring chapters with technical features followed by practical applications. The process revealed specific workflow constraints when using agents.

How One Developer Fixed 16 Architectural Weak Points in Their AI Agent System
A developer documented 16 architectural problems in their OpenClaw AI agent system and implemented specific fixes including explicit layer definitions, gateway authorization, and evidence-based execution.

OpenClaw user automates parking payments by reverse engineering government portal
An OpenClaw user created a script that automatically pays for parking by reverse engineering a local government portal, reducing costs from $3 per transaction to zero by running locally on a Mac mini.