Benchmarking Nemotron 3 Super 120B with 1M token context on M1 Ultra

✍️ OpenClawRadar📅 Published: March 12, 2026🔗 Source
Benchmarking Nemotron 3 Super 120B with 1M token context on M1 Ultra
Ad

Local 1M Token Context Test with Nemotron 3 Super

A Reddit user conducted a benchmark test to evaluate the feasibility of processing 1 million token contexts locally using Nemotron 3 Super 120B on an M1 Ultra system. The test leveraged the model's hybrid mamba-2 architecture, which provides memory efficiency at increased context lengths.

Hardware and Setup Details

The test was run on an M1 Ultra using llama.cpp with the following configuration:

  • Model: Nemotron-3-Super-120B-Q4_K.gguf (Q4_K_M quantization)
  • Context allocation: Full 1 million tokens
  • VRAM usage: Approximately 90GB
  • Backend: MTL,BLAS with 1 thread
  • Unified batch size: 2048
  • Flash attention: Enabled (fa 1)
  • GPU layers: 99 (-ngl 99)

Benchmark Command and Results

The user ran llama-bench with this command:

llama-bench -m ~/ml-models/huggingface/ggml-org/Nemotron-3-Super-120B-GGUF/Nemotron-3-Super-120B-Q4_K.gguf -fa 1 -t 1 -ngl 99 -b 2048 -ub 2048 -d 0,10000,20000,30000,40000,50000,60000,70000,80000,90000,100000,150000,200000,250000,1000000

Key performance results from the benchmark:

  • Prompt processing (pp512) at 0 context: 255.03 ± 0.36 tokens/second
  • Token generation (tg128) at 0 context: 26.72 ± 0.02 tokens/second
  • Prompt processing at 100,000 token context: 184.99 ± 0.19 tokens/second
  • Token generation at 100,000 token context: 22.37 ± 0.01 tokens/second
  • Prompt processing at 150,000 token context: 161.60 ± 0.22 tokens/second
  • Token generation at 150,000 token context: 20.58 ± 0.01 tokens/second
  • Prompt processing at 200,000 token context: 141.87 ± 0.19 tokens/second

The results show performance degradation as context length increases, with prompt processing speed dropping from 255 t/s at zero context to approximately 142 t/s at 200,000 tokens.

Ad

System Information

The Metal backend initialization showed:

  • GPU name: MTL0
  • GPU family: MTLGPUFamilyApple7 (1007)
  • Has unified memory: true
  • Has bfloat support: true
  • Recommended max working set size: 134,217.73 MB

This test demonstrates that local processing of extremely large contexts (up to 1 million tokens) is technically possible with high-end Apple Silicon hardware and quantized models, though with significant memory requirements and performance trade-offs as context expands.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also

Heren Godot MCP: Persistent WebSocket Daemon Cuts AI–Godot Interaction Latency to ~20ms
Tools

Heren Godot MCP: Persistent WebSocket Daemon Cuts AI–Godot Interaction Latency to ~20ms

Heren is a new MCP server for Godot that keeps a lightweight WebSocket daemon alive, achieving ~20ms operations instead of waiting for full engine cold starts. It provides 15 tools for scene management, debugging, GPU‑accelerated screenshots, and automatic shutdown after 3 minutes of inactivity.

OpenClawRadar
Claude Code Best Practice Repo Hits 50k Stars, Built Entirely with AI Agents
Tools

Claude Code Best Practice Repo Hits 50k Stars, Built Entirely with AI Agents

A GitHub repository packed with Claude best practices, 100% developed and maintained by autonomous Claude code workflows, crossed 50,000 stars — making it Pakistan's most-starred repo in 2026.

OpenClawRadar
Claude Code Routines: Automated Cloud Tasks for AI Development Workflows
Tools

Claude Code Routines: Automated Cloud Tasks for AI Development Workflows

Claude Code Routines allow developers to save Claude Code configurations as automated tasks that run on Anthropic-managed cloud infrastructure. Routines support scheduled, API, and GitHub triggers for unattended execution of prompts against repositories.

OpenClawRadar
Agents Observe: Real-time Dashboard for Monitoring Claude Code Agent Teams
Tools

Agents Observe: Real-time Dashboard for Monitoring Claude Code Agent Teams

Agents Observe is a local dashboard that provides real-time observability for Claude Code agent sessions using hooks instead of OTEL. It captures every tool call, agent hierarchy, and event with filtering and search capabilities, running as a Docker container that auto-starts with Claude sessions.

OpenClawRadar