Benchmarking Nemotron 3 Super 120B with 1M token context on M1 Ultra

Local 1M Token Context Test with Nemotron 3 Super
A Reddit user conducted a benchmark test to evaluate the feasibility of processing 1 million token contexts locally using Nemotron 3 Super 120B on an M1 Ultra system. The test leveraged the model's hybrid mamba-2 architecture, which provides memory efficiency at increased context lengths.
Hardware and Setup Details
The test was run on an M1 Ultra using llama.cpp with the following configuration:
- Model: Nemotron-3-Super-120B-Q4_K.gguf (Q4_K_M quantization)
- Context allocation: Full 1 million tokens
- VRAM usage: Approximately 90GB
- Backend: MTL,BLAS with 1 thread
- Unified batch size: 2048
- Flash attention: Enabled (fa 1)
- GPU layers: 99 (-ngl 99)
Benchmark Command and Results
The user ran llama-bench with this command:
llama-bench -m ~/ml-models/huggingface/ggml-org/Nemotron-3-Super-120B-GGUF/Nemotron-3-Super-120B-Q4_K.gguf -fa 1 -t 1 -ngl 99 -b 2048 -ub 2048 -d 0,10000,20000,30000,40000,50000,60000,70000,80000,90000,100000,150000,200000,250000,1000000Key performance results from the benchmark:
- Prompt processing (pp512) at 0 context: 255.03 ± 0.36 tokens/second
- Token generation (tg128) at 0 context: 26.72 ± 0.02 tokens/second
- Prompt processing at 100,000 token context: 184.99 ± 0.19 tokens/second
- Token generation at 100,000 token context: 22.37 ± 0.01 tokens/second
- Prompt processing at 150,000 token context: 161.60 ± 0.22 tokens/second
- Token generation at 150,000 token context: 20.58 ± 0.01 tokens/second
- Prompt processing at 200,000 token context: 141.87 ± 0.19 tokens/second
The results show performance degradation as context length increases, with prompt processing speed dropping from 255 t/s at zero context to approximately 142 t/s at 200,000 tokens.
System Information
The Metal backend initialization showed:
- GPU name: MTL0
- GPU family: MTLGPUFamilyApple7 (1007)
- Has unified memory: true
- Has bfloat support: true
- Recommended max working set size: 134,217.73 MB
This test demonstrates that local processing of extremely large contexts (up to 1 million tokens) is technically possible with high-end Apple Silicon hardware and quantized models, though with significant memory requirements and performance trade-offs as context expands.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Heren Godot MCP: Persistent WebSocket Daemon Cuts AI–Godot Interaction Latency to ~20ms
Heren is a new MCP server for Godot that keeps a lightweight WebSocket daemon alive, achieving ~20ms operations instead of waiting for full engine cold starts. It provides 15 tools for scene management, debugging, GPU‑accelerated screenshots, and automatic shutdown after 3 minutes of inactivity.

Claude Code Best Practice Repo Hits 50k Stars, Built Entirely with AI Agents
A GitHub repository packed with Claude best practices, 100% developed and maintained by autonomous Claude code workflows, crossed 50,000 stars — making it Pakistan's most-starred repo in 2026.

Claude Code Routines: Automated Cloud Tasks for AI Development Workflows
Claude Code Routines allow developers to save Claude Code configurations as automated tasks that run on Anthropic-managed cloud infrastructure. Routines support scheduled, API, and GitHub triggers for unattended execution of prompts against repositories.

Agents Observe: Real-time Dashboard for Monitoring Claude Code Agent Teams
Agents Observe is a local dashboard that provides real-time observability for Claude Code agent sessions using hooks instead of OTEL. It captures every tool call, agent hierarchy, and event with filtering and search capabilities, running as a Docker container that auto-starts with Claude sessions.