NVIDIA Releases Nemotron-3-Ultra-550B: 55B Active Parameters, 1M Context, LatentMoE Hybrid

✍️ OpenClawRadar📅 Published: June 4, 2026🔗 Source
NVIDIA Releases Nemotron-3-Ultra-550B: 55B Active Parameters, 1M Context, LatentMoE Hybrid
Ad

NVIDIA released Nemotron-3-Ultra-550B-A55B-BF16, a frontier-scale LLM with 550B total parameters and 55B active. The model uses a hybrid Latent Mixture-of-Experts (LatentMoE) architecture that interleaves Mamba-2, MoE, and attention layers, plus Multi-Token Prediction (MTP) for faster generation. Context length reaches up to 1M tokens.

Ad

Key Specs

  • Architecture: LatentMoE hybrid – Mamba-2 + MoE + Attention + MTP
  • Parameters: 550B total / 55B active
  • Context: Up to 1M tokens
  • Min GPU: 8x GB200/B200/GB300/B300, 16x H100, 8x H200
  • Languages: English, French, Spanish, Italian, German, Japanese, Korean, Hindi, Brazilian Portuguese, Chinese
  • Reasoning: Configurable on/off via chat template (enable_thinking=True/False)
  • License: OpenMDW License Agreement v1.1

The model is built for frontier reasoning, complex agentic workflows, long-context analysis, tool use, multilingual reasoning, and high-stakes RAG. It's trained with NVFP4 pre-training recipe for compute efficiency. Open weights, training data, and recipes are included under the OpenMDW license. For local inference, you'll need at least 8x H200 or equivalent.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also