Mercury 2: Diffusion-Based Model for Real-Time AI Coding

What Mercury 2 Is
Mercury 2 is a diffusion-based AI model that generates tokens in parallel rather than sequentially, using a process that refines output over multiple steps. This approach differs from traditional autoregressive models that decode tokens one by one.
Technical Specifications
- Generation method: Diffusion-based generation instead of sequential token-by-token decoding
- Processing approach: Generates tokens in parallel and refines them over a few steps
- Performance: Claims 1,009 tokens/sec on NVIDIA Blackwell GPUs
- Pricing: $0.25 per 1 million input tokens, $0.75 per 1 million output tokens
- Context window: 128K tokens
- Reasoning capability: Tunable reasoning
- Tool integration: Native tool use with schema-aligned JSON output
- API compatibility: OpenAI API compatible
Target Use Cases
The developers are positioning Mercury 2 for:
- Coding assistants
- Agentic loops (multi-step inference chains)
- Real-time voice systems
- RAG/search pipelines with multi-hop retrieval
📖 Read the full source: r/LocalLLaMA
👀 See Also

Claude Code 2.1.63 adds bundled slash commands, HTTP hooks, and memory leak fixes
Anthropic released Claude Code 2.1.63 with 26 CLI changes including new /simplify and /batch slash commands, HTTP hooks that POST JSON to URLs, and fixes for multiple memory leaks in long-running sessions.

Apple's AI Strategy and the Commoditization of Intelligence
The article argues that Apple's conservative approach to AI may be advantageous as intelligence becomes commoditized, with models like Gemma4 achieving 85.2% on MMLU Pro while running on phones, and OpenAI's Sora costing $15M daily against $2.1M revenue.

Qwen3.5-27B 8-bit vs 16-bit Performance Comparison
A Reddit user tested Qwen3.5-27B with vLLM comparing bf16 weights and 16-bit KV cache against Qwen's fp8 quantization with 8-bit KV cache, finding practically identical results on the Aider benchmark using an RTX 6000 Pro.

Tokenmaxxing Is the New Stopwatch: Why Your AI Policy Needs to Be Coherent
Brian Meeker argues against vanity metrics like tokenmaxxing and shares his team's four-point AI policy: no mandate, understand generated code, survive without AI tools, care about teammates and customers.