Practical Lessons from Building a Permanent Local AI Companion Agent

✍️ OpenClawRadar📅 Published: March 24, 2026🔗 Source
Practical Lessons from Building a Permanent Local AI Companion Agent
Ad

Setup and Architecture

A developer has been running a self-hosted AI agent on an M4 Mac mini for several months. The setup uses a Rust runtime with qwen2.5:14b on Ollama for fast local inference. The system implements a model ladder that escalates to cloud models when tasks require more capability. Memory is handled with SQLite and local embeddings using nomic-embed-text for semantic recall across sessions. The agent runs 24/7 via launchd and performs various tasks including monitoring a trading bot, checking email, deploying websites, and delegating heavy implementation work to Claude Code through a task runner.

Ad

Key Lessons Learned

Memory architecture is everything: The developer found that hybrid recall combining BM25 keyword search with vector similarity, weighted and merged, was a breakthrough. A 14B model with good memory recall outperforms a 70B model that starts every conversation cold.

The system prompt tax is real: Initial identity files started at ~10K tokens, but were reduced to ~2,800 tokens by cutting anything the agent could look up on demand. The rule: if the agent needs something occasionally, put it in memory; if it needs it every message, put it in the system prompt.

Local embeddings changed the economics: Using nomic-embed-text on Ollama alongside the conversation model makes every memory store and recall operation free, eliminating costs that previously accumulated from OpenAI embedding requests.

The model ladder matters more than the default model: The agent defaults to local qwen for conversation (free, fast) but can escalate to Minimax, Kimi, Haiku, Sonnet, or Opus depending on task requirements. The key insight: let humans switch models manually with commands like /model sonnet for reasoning tasks and /model qwen for chatting, rather than trying to auto-detect.

Tool iteration limits need headroom: Starting with 10 max tool calls per message proved insufficient. Simple tasks burn 3-5 tool calls, while complex tasks need 15-20. The current setup uses 25 tool calls with a 200 action/hour rate limit as a safety net.

The hardest bug was cross-session memory: Memories stored explicitly via a store tool initially had no session_id, and recall queries filtered by current session_id. This made deliberately memorized facts invisible in future sessions. The fix was adding OR session_id IS NULL to the SQL query.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also

Developer Implements AI-Ready Feedback Loop for Feature Shipping
Use Cases

Developer Implements AI-Ready Feedback Loop for Feature Shipping

A developer built a feedback system that captures app context and automatically generates structured GitHub issues, then uses Claude Code with a triage skill to turn those issues into scoped development tasks. Two features were shipped using this workflow from mobile devices.

OpenClawRadar
Non-developer builds iOS app with Claude over one year: practical insights
Use Cases

Non-developer builds iOS app with Claude over one year: practical insights

A non-developer with zero software experience built BloomDay, a full iOS productivity app using Claude over a year. The app includes task tracking, habit tracking, focus mode with ambient sounds, and a virtual garden, built with React Native and Expo.

OpenClawRadar
OpenClaw Cost Optimization: How a Developer Fixed a $750 Mistake with Model Routing
Use Cases

OpenClaw Cost Optimization: How a Developer Fixed a $750 Mistake with Model Routing

A developer shares how switching all OpenClaw subagents to the free Hunter Alpha model on OpenRouter led to silent failures, including a video production agent that generated valid code but produced a 9-second silent black video. The solution involved implementing explicit model routing based on task requirements.

OpenClawRadar
Automated AI Development Pipeline with 11 Quality Gates and Confidence Profiles
Use Cases

Automated AI Development Pipeline with 11 Quality Gates and Confidence Profiles

A developer built an AI-powered pipeline with 11 automated quality gates that runs end-to-end without manual approvals, using confidence profiles, auto-recovery, and caching to handle design, planning, building, testing, and security checks autonomously, reducing token usage by 60-84%.

OpenClawRadar