Ninetails Memory Engine V4.5: Int8 Quantization + LRU Cache Cuts Local MCP Memory to 60MB

The Ninetails Memory Engine V4.5 addresses the memory bottleneck in local MCP (Model Context Protocol) tools by implementing Int8 scalar quantization combined with LRU cache eviction. The solution keeps the entire engine process running inside a Tauri desktop app at 40-60MB of RAM.
The Memory Problem
A standard 1536-dimension float32 embedding takes about 6144 bytes (~6KB). Storing 10,000 memories means ~60MB just for vectors, scaling to ~600MB for 100,000 memories. For a local tool running on SQLite, this resource consumption is unacceptable.
Technical Implementation
Layer 1: Int8 Scalar Quantization
By compressing float32 (4 bytes/dim) down to int8 (1 byte/dim), storage volume is reduced to a quarter of its original size. The implementation calculates the numerical range of each dimension, maps floats to a -128 to 127 integer range, and dequantizes back to float32 during retrieval for cosine similarity.
# Quantize: float32 → int8
def quantize_vector(vector_fp32, scale, zero_point):
quantized = np.round(vector_fp32 / scale) + zero_point
return np.clip(quantized, -128, 127).astype(np.int8)
# Dequantize: int8 → float32 (Approximation)
def dequantize_vector(vector_int8, scale, zero_point):
return (vector_int8.astype(np.float32) - zero_point) * scale
Real-world result: A 1536-dim vector drops from 6144 bytes to 1536 bytes. Factoring in global scale and zero_point overhead, the real compression ratio is around 3.8x - 4.0x.
Layer 2: LRU Cache Eviction
Quantized vectors are stored in a SQLite database (vector_cache.sqlite) using a Least Recently Used strategy with a hard cap of 10,000 entries. High-frequency vectors stay in RAM while stale ones are evicted.
Precision Considerations
Int8 quantization is lossy but acceptable for memory retrieval because:
- The engine uses hybrid search: 70% vector similarity + 30% BM25. Even if quantization slightly skews vector ranking, exact keyword matching via BM25 pulls relevant memories back up.
- AI memory retrieval only needs to surface context into the Top-5 results, unlike recommendation algorithms that need absolute precision for the #1 spot.
Clarification on "TurboQuant"
The engine uses standard Int8 scalar quantization for SQLite vector storage, not Google's TurboQuant (ICLR 2026), which is a 3-bit compression algorithm (PolarQuant + QJL) designed for KV Cache during LLM GPU inference. The branding "TurboQuant Compression" in the UI is a nod to the philosophy of aggressive bit-reduction.
Full Tech Stack
- Vector Compression: Int8 Scalar Quantization (~4x real compression)
- Cache Management: SQLite + LRU Eviction (Cap: 10,000 entries)
- Search Engine: Hybrid: 70% Vector Similarity + 30% BM25
- Profile Manager: Automatic STATIC/DYNAMIC fact extraction
- Fact Extraction:
asyncio.to_threadbackground async LLM calls - Data Storage: 3x SQLite Databases (100% Local)
- Desktop App: Tauri + Vue 3 + PyInstaller sidecar
The engine is open-source under MIT License at GitHub: sunhonghua1/ninetails-memory-engine.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Claude Code's Read Tool Silently Downscales Images, Causing Hallucinations
Claude Code's `read` tool silently downscales images before the model sees them, leading to degraded output and unrecognized hallucinations when extracting text from screenshots.

cc-session-utils: TUI Dashboard for Managing Claude Code Sessions and Costs
A developer built cc-session-utils, a terminal UI tool for managing Claude Code session files, tracking costs by model, cleaning orphaned sessions, and migrating data between projects. It requires Python 3.11+ and is built with Textual.

Claude Code Adds Remote Control Feature for Mobile Session Management
Claude Code now allows developers to start tasks in their terminal and continue controlling sessions from mobile devices via the Claude app or claude.ai/code while Claude runs locally on their machine.

Akemon: Publish and Hire AI Coding Agents Directly from Your Laptop
Akemon is a tool that lets developers publish their AI coding agents with one command and hire others' agents with another, working directly from laptops through a relay tunnel without needing servers. It's protocol-agnostic, supporting agents from Claude Code, Codex, Gemini, OpenCode, Cursor, and Windsurf.