Atlas Inference Engine Goes Open Source: Pure Rust + CUDA, 100+ tok/s on DGX Spark

The Atlas inference engine, previously teased hitting 102 tok/s on Qwen3.5-35B on a DGX Spark, is now open source on GitHub. Written in pure Rust and CUDA with no PyTorch or Python runtime, Atlas delivers a ~2.5 GB Docker image and sub-2-minute cold start. The team rewrote the full stack from HTTP handler to kernel dispatch to eliminate the 20+ GB Python overhead that was bottlenecking the GPU.
Key Benchmarks on DGX Spark (GB10)
- Qwen3.5-35B (NVFP4, MTP K=2): 130 tok/s peak, ~111 tok/s sustained — 3.0–3.3× vLLM at testing time
- Qwen3.5-122B (NVFP4, EP=2): ~50 tok/s decode
- Qwen3-Next-80B-A3B (NVFP4, MTP): ~87 tok/s
- Nemotron-3 Nano 30B (FP8): ~88 tok/s
- Full model matrix including MiniMax2.7, Qwen3.6, Gemma available on the site
What Makes Atlas Different
- Hand-tuned CUDA kernels for Blackwell SM120/121: attention, MoE, GDN, Mamba-2 — no generic fallbacks
- Native NVFP4 + FP8 on tensor cores
- MTP (Multi-Token Prediction) speculative decoding for up to 3× throughput on decode
- OpenAI + Anthropic API compatibility on the same port — works with Claude Code, Cline, OpenCode, Open WebUI out of the box
Quick Start
docker pull avarok/atlas-gb10:latest
sudo docker run -d --name atlas --network host --gpus all --ipc=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
avarok/atlas-gb10:latest serve Qwen/Qwen3.6-35B-A3B-FP8 \
--port 8888 --speculative --enable-prefix-caching
Roadmap & Community
The team is working on a Strix Halo port with Spectral Compute (AMD-provided hardware), and an RTX 6000 Pro Blackwell port is planned. The roadmap is community-driven — MiniMax M2.7 support landed from a Discord request. Atlas targets four chips well rather than twenty poorly.
For non-Spark users, the current binary is DGX Spark only, but the code is open for adaptation.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Curated List of 260+ AI Agent Tools with Claude Ecosystem Highlights
A GitHub repository contains a curated list of 260+ AI agent tools, including specific Claude-related entries like Claude Code (80.9% SWE-bench), Claude Computer Use, and Claude in Chrome, plus tools that work well with Claude such as Cline and Cursor.

80-line Python script uses Claude to auto-generate internal link suggestions, cuts linking time from 2 hours to 8 minutes
A Reddit user built an 80-line Python script that feeds an article draft and sitemap to Claude, returning relevant internal link targets with suggested anchor text — reducing manual linking time from 2 hours to 8 minutes per article.

SimplePDF Copilot: Client-Side AI Tool Calling for PDF Form Filling
SimplePDF Copilot uses client-side tool calling to let an LLM fill fields, add fields, delete pages, and more in PDFs — without the PDF leaving the browser.

Vibeyard IDE adds embedded browser for direct web UI editing with AI agents
Vibeyard, an open-source IDE for AI coding agents, now includes a browser tab session type that lets users click elements in a web UI and instruct an AI agent to edit them directly, eliminating selector guessing and component hunting.