OpenClaw Gateway Silent Failures After 25 Days: Zombified State Fix

Gateway Failure Pattern

An OpenClaw user running the system daily for approximately 25 days with 18+ cron jobs and Telegram integration has documented a recurring reliability issue. The gateway doesn't crash outright but enters a 'zombified' state where status shows as 'running' while all functionality ceases. Cron jobs become stuck indefinitely, messages fail to deliver, and no alerts are generated—including the health monitor cron job itself.

Specific Issues Encountered

Invalid model in config: Gateway accepted invalid configuration at write time, then failed silently on every agent turn instead of rejecting immediately.
Session hangs: Connection errors caused 15-minute blackouts with no auto-recovery or notification.
Session file locks held forever: Hung tool calls maintain write locks indefinitely, blocking ALL cron jobs. Only fix is full restart.
Gateway won't start on boot: LaunchAgent proved unreliable on macOS, requiring a @reboot sleep 30 crontab workaround.
Restarts reset cron timing: Jobs re-fire or miss windows after restart. Model aliases also break intermittently.
Cron delivery fails in isolated sessions: Message tool lacks delivery permissions in isolated sessions, requiring payload restructuring.
Major incident: Session write lock held for 4.3 hours with 7 cron jobs stuck in phantom 'running' state. Simultaneously, an update broke plugin paths and the model catalog module.

Proposed Fixes

Write lock timeouts (force-release after 10 minutes)
Gateway self-health loop (check model resolution, session writes, channel connectivity every 5 minutes)
Cron stuck detection (auto-reset jobs 'running' longer than 2x timeout)
Update-safe restarts (npm update should trigger graceful restart)
openclaw cron reset <id> command to unstick jobs without full restart