Local Qwen3-0.6B INT8 as Embedding Backbone for AI Memory System

A developer has shared their implementation of a local embedding system using Qwen3-0.6B quantized to INT8 via ONNX Runtime as the backbone for an AI memory lifecycle system that runs inside Claude Code.
Problem and Requirements
The system addresses scaling issues with embedding APIs: typical AI coding assistants make hundreds of API calls per day (15-25 sessions), creating latency on every write and dependency on external services with variable pricing. Requirements included 1024-dimensional vectors, cosine similarity above 0.75 indicating genuine semantic relatedness, batch processing for 20+ entries, and zero API calls.
Model Selection and Implementation
After testing several models, Qwen3-0.6B at 1024 dimensions provided better separation between genuinely related entries and structural noise (session logs sharing format but not topic) compared to sentence-transformers models.
The implementation uses ONNX Runtime with INT8 quantization. The cold start problem (3-second model loading) was solved with a persistent embedding server on localhost:52525 that loads the model once at system boot. Warm inference achieves ~12ms per batch, roughly 250x faster than cold start.
System Architecture
- The server starts automatically via a startup hook
- If the server goes down, the system falls back to direct ONNX loading (slower but functional)
- All CPU-based, no GPU needed
- Single Python script, ~2,900 lines, SQLite + ONNX
Memory Lifecycle Phases
The system processes knowledge through 5 phases, with embeddings driving phases 2 through 4:
- Buffer
- Connect: New entries get linked to existing entries above 0.75 cosine similarity. Isolated entries fade over time while connected entries survive. Expiry based on isolation, not time.
- Consolidate: Groups of 3+ connected entries get merged into proven knowledge by an LLM (Gemini Flash free tier)
- Route: Proven knowledge gets routed to the right config file based on embedding distance to existing content
- Age
Technical Details
- Model: Qwen3-0.6B quantized to INT8
- Vector dimensions: 1024
- Similarity threshold: 0.75 cosine similarity for genuine semantic relatedness
- Performance: ~12ms per batch for warm inference
- Hardware: Runs on any modern machine with CPU only
The project is open source at github.com/living0tribunal-dev/claude-memory-lifecycle with a detailed engineering story covering threshold decisions and failure modes after processing 3,874 memories.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Using Lava's MCP Gateway with Claude Code for Low-Cost Content Workflow
A user connected Lava's MCP gateway to Claude Code and accessed research tools like Exa, Serper, and Tavily without accounts or API keys, creating a social media content workflow for $0.03.

Optimizing OpenClaw Agent Costs with DOM Optimization and Dashboard Monitoring
Reduced OpenClaw agent costs by 41% using custom JavaScript evaluation for DOM reads, minimizing API calls and token bloat. Real-time token dashboard supports usage tracking.

OpenClaw User Details Setup Challenges and Abandonment After Mac Switch
A developer switching from Windows to macOS encountered significant hurdles installing and configuring OpenClaw, including environment setup, channel configuration issues with Telegram and iMessage, and unexpected costs from AI model APIs. Despite getting basic functionality working, practical use cases like automated news briefing and multi-bot coordination in Feishu proved unreliable, leading to project abandonment.

Using OpenClaw with AI video tools to scale short-form content creation
A developer shares their workflow using OpenClaw to find content angles and hooks, then pairing it with an AI video tool to create and batch-post Shorts, Reels, and TikToks, resulting in consistent affiliate clicks and platform payouts.