Ninetails Memory Engine V4.5: Int8 Quantization + LRU Cache Cuts Local MCP Memory to 60MB

✍️ OpenClawRadar📅 Published: April 1, 2026🔗 Source
Ninetails Memory Engine V4.5: Int8 Quantization + LRU Cache Cuts Local MCP Memory to 60MB
Ad

The Ninetails Memory Engine V4.5 addresses the memory bottleneck in local MCP (Model Context Protocol) tools by implementing Int8 scalar quantization combined with LRU cache eviction. The solution keeps the entire engine process running inside a Tauri desktop app at 40-60MB of RAM.

The Memory Problem

A standard 1536-dimension float32 embedding takes about 6144 bytes (~6KB). Storing 10,000 memories means ~60MB just for vectors, scaling to ~600MB for 100,000 memories. For a local tool running on SQLite, this resource consumption is unacceptable.

Technical Implementation

Layer 1: Int8 Scalar Quantization

By compressing float32 (4 bytes/dim) down to int8 (1 byte/dim), storage volume is reduced to a quarter of its original size. The implementation calculates the numerical range of each dimension, maps floats to a -128 to 127 integer range, and dequantizes back to float32 during retrieval for cosine similarity.

# Quantize: float32 → int8
def quantize_vector(vector_fp32, scale, zero_point):
    quantized = np.round(vector_fp32 / scale) + zero_point
    return np.clip(quantized, -128, 127).astype(np.int8)

# Dequantize: int8 → float32 (Approximation)
def dequantize_vector(vector_int8, scale, zero_point):
    return (vector_int8.astype(np.float32) - zero_point) * scale

Real-world result: A 1536-dim vector drops from 6144 bytes to 1536 bytes. Factoring in global scale and zero_point overhead, the real compression ratio is around 3.8x - 4.0x.

Layer 2: LRU Cache Eviction

Quantized vectors are stored in a SQLite database (vector_cache.sqlite) using a Least Recently Used strategy with a hard cap of 10,000 entries. High-frequency vectors stay in RAM while stale ones are evicted.

Ad

Precision Considerations

Int8 quantization is lossy but acceptable for memory retrieval because:

  • The engine uses hybrid search: 70% vector similarity + 30% BM25. Even if quantization slightly skews vector ranking, exact keyword matching via BM25 pulls relevant memories back up.
  • AI memory retrieval only needs to surface context into the Top-5 results, unlike recommendation algorithms that need absolute precision for the #1 spot.

Clarification on "TurboQuant"

The engine uses standard Int8 scalar quantization for SQLite vector storage, not Google's TurboQuant (ICLR 2026), which is a 3-bit compression algorithm (PolarQuant + QJL) designed for KV Cache during LLM GPU inference. The branding "TurboQuant Compression" in the UI is a nod to the philosophy of aggressive bit-reduction.

Full Tech Stack

  • Vector Compression: Int8 Scalar Quantization (~4x real compression)
  • Cache Management: SQLite + LRU Eviction (Cap: 10,000 entries)
  • Search Engine: Hybrid: 70% Vector Similarity + 30% BM25
  • Profile Manager: Automatic STATIC/DYNAMIC fact extraction
  • Fact Extraction: asyncio.to_thread background async LLM calls
  • Data Storage: 3x SQLite Databases (100% Local)
  • Desktop App: Tauri + Vue 3 + PyInstaller sidecar

The engine is open-source under MIT License at GitHub: sunhonghua1/ninetails-memory-engine.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also