ZSE: Open-source LLM inference engine with 3.9-second cold starts

What ZSE does
ZSE (Z Server Engine) is an open-source LLM inference engine focused on memory efficiency and fast cold starts. It addresses the problem where running a 32B model normally requires ~64GB VRAM, and cold starts with bitsandbytes NF4 take 2+ minutes on first load.
Key performance improvements
ZSE fits 32B models in 19.3GB VRAM (70% reduction vs FP16) and runs on a single A100-40GB. For 7B models, it uses 5.2GB VRAM (63% reduction) and runs on consumer GPUs.
The cold start improvements are significant: 3.9s for 7B models and 21.4s for 32B models with the .zse format, compared to 45s and 120s with bitsandbytes. These benchmarks were verified on Modal A100-80GB in February 2026.
Technical approach
The cold start improvement comes from the .zse format storing pre-quantized weights as memory-mapped safetensors. This eliminates quantization at load time and weight conversion, using just mmap + GPU transfer. On NVMe SSDs, this gets under 4 seconds for 7B models.
Installation and usage
Install with: pip install zllm-zse
Basic server start: zse serve Qwen/Qwen2.5-7B-Instruct
For fast cold starts (one-time conversion):
zse convert Qwen/Qwen2.5-Coder-7B-Instruct -o qwen-7b.zse zse serve qwen-7b.zse # 3.9s every time
Features
- OpenAI-compatible API server (drop-in replacement)
- Interactive CLI (zse serve, zse chat, zse convert, zse hardware)
- Web dashboard with real-time GPU monitoring
- Continuous batching (3.45× throughput)
- GGUF support via llama.cpp CPU fallback — works without a GPU
- Rate limiting, audit logging, API key auth
Architecture components
- zAttention: Custom CUDA kernels for paged, flash, and sparse attention
- zQuantize: Per-tensor INT2-8 mixed precision quantization
- zKV: Quantized KV cache with sliding precision (4x memory savings)
- zStream: Layer streaming with async prefetch (run 70B on 24GB GPU)
- zOrchestrator: Smart recommendations based on FREE memory
Efficiency modes
- speed: Maximum throughput (production with ample GPU memory)
- balanced: Good throughput, moderate memory (standard deployment, default)
- memory: Low memory, reduced throughput (consumer GPUs)
- ultra: Extreme memory savings (4GB GPUs, laptops)
Supported models
Any HuggingFace transformers model, safetensors, GGUF, or .zse format. Popular choices include Qwen, Llama, Mistral, Phi, Gemma, DeepSeek, and Yi.
📖 Read the full source: HN LLM Tools
👀 See Also

Decision Passport: An Audit Layer for AI Agent Execution Governance
The Claude Code leak highlights a gap in AI agent governance. Decision Passport addresses this with append-only execution records, portable proof bundles, and offline verification for tamper-evident audit trails.

engram v3.4.0 Adds Anthropic Plugin to Keep Claude Code Running Under New Rate Limits
engram v3.4.0 introduces a dedicated Anthropic plugin for Claude Code, adding three skills to manage costs, query context, and surface errors. Install with `/plugin install engram` or `npm install -g engramx@latest`.

LocalSynapse MCP Server Adds macOS Support and Search Improvements
LocalSynapse, an offline MCP server for searching local documents, now supports macOS and includes fixes for multi-word search queries. The developer has implemented feedback-driven improvements including position-adjusted click boosting and time decay as promotion.

Runtime: Sandboxed Coding Agents for Every Team Member
Runtime (YC P26) provides sandboxed coding agent infrastructure that lets non-engineers use Claude Code, Codex, and other agents safely. It snapshots multi-service environments (Docker, Kafka, Redis, seeded DBs) that boot in milliseconds, with guardrails at the infrastructure level.