Developer Seeks Architecture Advice for Serving Embed, Rerank, and Zero-Shot Models on 8GB VRAM

Problem Overview
A developer is building a unified Knowledge Graph/RAG service for a local coding agent that runs in a single Docker container via FastAPI. The system initially ran okay on Windows (WSL), but moving to native Linux exposed severe memory limit issues under stress tests.
Hardware and Model Constraints
Hardware:
- 8GB VRAM (Laptop GPU)
- ~16GB System RAM (Docker limits hit fast, usually only ~6GB free when models are loaded)
Model Stack:
- Embedding: nomic-ai/nomic-embed-text-v2-moe
- Reranking: BAAI/bge-reranker-base
- Classification: MoritzLaurer/ModernBERT-large-zeroshot-v2.0 (used to classify text pairs into 4 relations: dependency, expansion, contradiction, unrelated)
Technical Challenges
The developer cannot aggressively truncate text because they're feeding code chunks and natural text into these models and need to process variable, long sequences.
Specific issues encountered:
- Latency vs. OOM: Using
torch.cuda.empty_cache()to keep the GPU clean causes latency spikes to 18-20 seconds per request due to driver syncs. Removing it causes the GPU to instantly OOM when concurrent requests hit. - System RAM Explosion (Linux Exit 137): Using the Hugging Face pipeline("zero-shot-classification") caused massive CPU RAM bloat. Without truncation, the pipeline generates massive combination matrices in memory before sending them to the GPU, causing the Linux kernel to instantly kill the container.
- VRAM Spikes:
cudnn.benchmark = Truewas caching workspaces for every unique sequence length, draining 3GB of free VRAM in seconds during stress tests.
Current Implementation
The developer has a pure Python/FastAPI setup with the following workarounds:
- Bypassed the HF pipeline and wrote a manual NLI inference loop for ModernBERT
- Using
asyncio.Lock()to force serial execution (only one model touches the GPU at a time) - Using deterministic deallocation (
del inputs + gc.collect()) via FastAPI background tasks
This approach is better but still unstable under a 3-minute stress test.
Questions for the Community
The developer is seeking advice on:
- Model Alternatives: Smaller/faster models that maintain high accuracy for Zero-Shot NLI and Reranking that fit better in an 8GB envelope
- Prebuilt Architectures: Previously looked at infinity_emb but struggled to integrate custom 4-way NLI classification logic without double-loading models. Considering TEI (Text Generation Inference), TensorRT, or other solutions optimized for Encoder models
- Serving Strategy: Standard design patterns for hosting 3 transformer models on a single consumer GPU without them stepping on each other's memory
📖 Read the full source: r/LocalLLaMA
👀 See Also

Claude Code v2.1.85 Release: MCP Improvements, Hook Filters, and Bug Fixes
Claude Code v2.1.85 adds environment variables for MCP headersHelper scripts, conditional if fields for hooks to reduce process spawning, and fixes for /compact failures, plugin enable/disable issues, and terminal keyboard problems in Ghostty, Kitty, and WezTerm.

OpenClaw Users Report Model Replacements After Anthropic Ban
A community survey of Reddit, X, YouTube, and GitHub reveals GPT-5.x as the most-adopted replacement for Claude in OpenClaw workflows, with Kimi K2.5 leading community votes and hybrid setups gaining popularity.

Research on AI Agent Consistency: Key Findings and Practical Takeaways
A study of 3,000 experiments across Claude, GPT-4o, and Llama reveals that consistent agents achieve 80–92% accuracy while inconsistent ones drop to 25–60%, with 69% of divergence occurring at the first tool call.

Anam Cara-3: Advancements in Interactive AI Avatars
Anam Cara-3 introduces advanced interactive avatars with a two-stage pipeline for audio-to-video conversion, achieving impressive speed and responsiveness.