Unsloth and NVIDIA Collaborate to Speed Up LLM Training by ~25%

Unsloth's collaboration with NVIDIA yields a ~25% training speedup (no accuracy loss) by implementing three key optimizations: caching packed-sequence metadata, double-buffered async gradient checkpointing, and MoE routing improvements. These are auto-enabled on RTX laptops, data center GPUs, and DGX Spark with an Unsloth update.
Caching Packed-Sequence Metadata
Packed training concatenates short examples to avoid padding waste. Each transformer layer previously rebuilt the same sequence metadata (lengths, cu_seqlens, max_seqlen, mask structure) from scratch, causing device-host synchronization overhead. By caching the metadata once per batch and reusing it across layers, Unsloth reduces repeated work.
Benchmarks on Qwen3-14B QLoRA SFT show:
- Forward pass: +43.3% faster
- Backward pass: +5.8% faster
- Overall per batch: +14.3% faster
A microbenchmark on NVIDIA Blackwell GPUs measured the dominant mask-construction cost at ~13.7 ms per packed batch. For Llama-3.2-1B (16 layers), this translates to ~199 ms saved per step (11.5% lower); for Qwen3-0.6B (28 layers), ~319 ms saved (14.8% lower).
Double-Buffered Async Gradient Checkpointing
Async gradient checkpointing overlaps recomputation with computation. This gives an 8% speedup without impacting accuracy.
MoE Routing: argsort + bincount
For MoE models, using torch.argsort and torch.bincount instead of custom kernels speeds up gpt-oss training by 15%.
All optimizations are auto-enabled on supported hardware. Update Unsloth to get them.
📖 Read the full source: HN LLM Tools
👀 See Also
PullMD v2.4.1 Adds Native MCP Connector for claude.ai Web and Multi-User Auth
PullMD v2.4.1 now supports the claude.ai web custom connector dialog via OAuth 2.1 + PKCE-S256 and adds multi-user auth modes. Turn any URL into clean Markdown via self-hosted MCP.

Nakkas MCP Server Generates Animated SVGs from AI Descriptions
Nakkas is an MCP server where AI constructs complete animated SVG configurations from descriptions, rendering clean animated SVGs with shapes, gradients, animations, and filters. It supports parametric curves, 15 filter presets, CSS @keyframes and SMIL animations, and works anywhere SVG renders.

Headless OpenClaw Setup with Discord via Docker Scripts
A GitHub repository provides scripts to run OpenClaw with Discord in a headless Docker container, avoiding the TUI/WebUI. It includes a management script with commands like claw init, start, and stop, plus preconfigured support for OpenAI Responses API, Chromium, and various tools.

InsAIts Runtime Security Monitor for Claude Code Hits 8,000 PyPI Downloads
InsAIts, a runtime security monitor for Claude Code agentic sessions, has reached 8,140 total downloads on PyPI. Version 3.4.0 adds an Adaptive Context Manager, layered anchor injection system, and dashboard improvements.