Qwen3-0.6B INT8 Embedding Backbone for AI Memory System

A developer has shared their implementation of a local embedding system using Qwen3-0.6B quantized to INT8 via ONNX Runtime as the backbone for an AI memory lifecycle system that runs inside Claude Code.

Problem and Requirements

The system addresses scaling issues with embedding APIs: typical AI coding assistants make hundreds of API calls per day (15-25 sessions), creating latency on every write and dependency on external services with variable pricing. Requirements included 1024-dimensional vectors, cosine similarity above 0.75 indicating genuine semantic relatedness, batch processing for 20+ entries, and zero API calls.

Model Selection and Implementation

After testing several models, Qwen3-0.6B at 1024 dimensions provided better separation between genuinely related entries and structural noise (session logs sharing format but not topic) compared to sentence-transformers models.

The implementation uses ONNX Runtime with INT8 quantization. The cold start problem (3-second model loading) was solved with a persistent embedding server on localhost:52525 that loads the model once at system boot. Warm inference achieves ~12ms per batch, roughly 250x faster than cold start.

System Architecture

The server starts automatically via a startup hook
If the server goes down, the system falls back to direct ONNX loading (slower but functional)
All CPU-based, no GPU needed
Single Python script, ~2,900 lines, SQLite + ONNX

Memory Lifecycle Phases

The system processes knowledge through 5 phases, with embeddings driving phases 2 through 4:

Buffer
Connect: New entries get linked to existing entries above 0.75 cosine similarity. Isolated entries fade over time while connected entries survive. Expiry based on isolation, not time.
Consolidate: Groups of 3+ connected entries get merged into proven knowledge by an LLM (Gemini Flash free tier)
Route: Proven knowledge gets routed to the right config file based on embedding distance to existing content
Age

Technical Details

Model: Qwen3-0.6B quantized to INT8
Vector dimensions: 1024
Similarity threshold: 0.75 cosine similarity for genuine semantic relatedness
Performance: ~12ms per batch for warm inference
Hardware: Runs on any modern machine with CPU only

The project is open source at github.com/living0tribunal-dev/claude-memory-lifecycle with a detailed engineering story covering threshold decisions and failure modes after processing 3,874 memories.

📖 Read the full source: r/LocalLLaMA