Nvidia's Nemotron 3 Super: 120B Parameter Model with 12B Active Inference

Nvidia released Nemotron 3 Super, a 120 billion parameter model that activates only 12 billion parameters during inference. This challenges the assumption that bigger models always mean better results by providing 120B model knowledge at roughly the compute cost of a 12B model. The model isn't approximating a larger one through compression - it's a 120B model that learned to route efficiently, with the other 108 billion parameters available when relevant and idle when not.
Architectural Decisions
Three key architectural decisions make this possible:
- LatentMoE: Projects tokens into a compressed latent space before routing, making routing decisions cheaper. This allows activating 4x more experts for the same inference cost as standard MoE.
- Hybrid Mamba-Attention: Replaces quadratically expensive transformer attention with Mamba-2 for most sequence processing, making the 1 million token context window practical rather than theoretical. Achieves 91.75% accuracy on RULER at 1M tokens.
- Multi-Token Prediction: Generates multiple future tokens per forward pass, providing native speculative decoding up to 3x faster wall-clock inference without needing a separate draft model. Results in 5x higher throughput than its predecessor and outperforms models activating 3x more parameters per token.
Broader Trend
This is the third independent confirmation of this architectural approach. DeepSeek V3 first demonstrated this with 671B total parameters and 37B active, outperforming Llama 3 405B dense. Qwen3-Coder-Next followed with 80B total parameters and only 3B active at inference, matching Claude Sonnet 4.5 on SWE-Bench Pro and outperforming DeepSeek V3 which activates 37B per token. The efficiency gains compound rather than trade off - each architectural decision benefits more from scale than dense attention does, and the gap between this architecture and dense transformers grows as models scale.
The key insight from these three independent releases is that the path to capability isn't more activation - it's better routing. While parameter count leaderboards will continue publishing numbers, active parameters per token is becoming the more honest metric for comparing model efficiency and performance.
📖 Read the full source: r/LocalLLaMA
👀 See Also

ACP Bug Investigation: Protocol Mismatch Causes 'metadata is missing' Error with Local Ollama
A confirmed bug in the ACP/OpenClaw integration prevents acpx spawn commands from working with local Ollama models due to a protocol mismatch where acpx expects JSON but receives text output.

Claude Platform on AWS Now GA: Native Anthropic Experience via IAM, CloudTrail, and AWS Billing
AWS announced GA of Claude Platform on AWS, giving developers direct access to Anthropic's native Claude experience through existing AWS accounts with IAM auth, AWS billing, and CloudTrail logging — but customer data is processed outside AWS security boundary.

Claude Code v2.1.181: /config Syntax, Sandbox Apple Events, Streaming Fixes
Claude Code v2.1.181 adds /config key=value syntax for inline settings, sandbox.allowAppleEvents on macOS, and CLAUDE_CLIENT_PRESENCE_FILE. It also upgrades Bun to 1.4, fixes prompt caching on custom API URLs, network drive write issues, and many startup regressions.

Google Chrome Installs 4 GB Gemini Nano AI Model Silently – No User Consent
Google Chrome has been found to silently download and install the 4 GB Gemini Nano AI model on user devices without explicit consent, sparking privacy and storage concerns.