ZSE: Open-source LLM inference engine with 3.9-second cold starts

✍️ OpenClawRadar📅 Published: February 26, 2026🔗 Source
ZSE: Open-source LLM inference engine with 3.9-second cold starts
Ad

What ZSE does

ZSE (Z Server Engine) is an open-source LLM inference engine focused on memory efficiency and fast cold starts. It addresses the problem where running a 32B model normally requires ~64GB VRAM, and cold starts with bitsandbytes NF4 take 2+ minutes on first load.

Key performance improvements

ZSE fits 32B models in 19.3GB VRAM (70% reduction vs FP16) and runs on a single A100-40GB. For 7B models, it uses 5.2GB VRAM (63% reduction) and runs on consumer GPUs.

The cold start improvements are significant: 3.9s for 7B models and 21.4s for 32B models with the .zse format, compared to 45s and 120s with bitsandbytes. These benchmarks were verified on Modal A100-80GB in February 2026.

Technical approach

The cold start improvement comes from the .zse format storing pre-quantized weights as memory-mapped safetensors. This eliminates quantization at load time and weight conversion, using just mmap + GPU transfer. On NVMe SSDs, this gets under 4 seconds for 7B models.

Installation and usage

Install with: pip install zllm-zse

Basic server start: zse serve Qwen/Qwen2.5-7B-Instruct

For fast cold starts (one-time conversion):

zse convert Qwen/Qwen2.5-Coder-7B-Instruct -o qwen-7b.zse
zse serve qwen-7b.zse  # 3.9s every time
Ad

Features

  • OpenAI-compatible API server (drop-in replacement)
  • Interactive CLI (zse serve, zse chat, zse convert, zse hardware)
  • Web dashboard with real-time GPU monitoring
  • Continuous batching (3.45× throughput)
  • GGUF support via llama.cpp CPU fallback — works without a GPU
  • Rate limiting, audit logging, API key auth

Architecture components

  • zAttention: Custom CUDA kernels for paged, flash, and sparse attention
  • zQuantize: Per-tensor INT2-8 mixed precision quantization
  • zKV: Quantized KV cache with sliding precision (4x memory savings)
  • zStream: Layer streaming with async prefetch (run 70B on 24GB GPU)
  • zOrchestrator: Smart recommendations based on FREE memory

Efficiency modes

  • speed: Maximum throughput (production with ample GPU memory)
  • balanced: Good throughput, moderate memory (standard deployment, default)
  • memory: Low memory, reduced throughput (consumer GPUs)
  • ultra: Extreme memory savings (4GB GPUs, laptops)

Supported models

Any HuggingFace transformers model, safetensors, GGUF, or .zse format. Popular choices include Qwen, Llama, Mistral, Phi, Gemma, DeepSeek, and Yi.

📖 Read the full source: HN LLM Tools

Ad

👀 See Also