ATLAS: Adaptive Test-time Learning Framework Outperforms Claude Sonnet on Coding Benchmarks with $500 GPU

What ATLAS Does
ATLAS (Adaptive Test-time Learning and Autonomous Specialization) is a framework that wraps a frozen smaller model in intelligent infrastructure to compete with frontier API models. It uses structured generation, energy-based verification, and self-verified repair without fine-tuning, API calls, or cloud dependencies. The system is fully self-hosted with no data leaving the machine.
Benchmark Results
Hardware: RTX 5060 Ti 16GB | Model: Qwen3-14B-Q4_K_M (frozen)
- LiveCodeBench v5: 74.6% pass@1-v(k=3) on 599 tasks
- GPQA Diamond: 47.0% on 198 k=5 multiple-choice knowledge reasoning tasks
- SciCode: 14.7% on 341 k=1 cross-domain scientific coding tasks
Note: pass@k-v(k=3) means one solution submitted per task, generated via best-of-3 candidates + Lens selection + iterative repair on failures. Not single-shot generation.
V3 Pipeline Ablation Breakdown
- Baseline (no V3): 54.9%
- +Phase 1 (PlanSearch + BudgetForcing + DivSampling): 67.3% (+12.4pp)
- +Phase 1+2 (Lens routing): 67.3% (+0.0pp)
- +Phase 1+3 (self-verified refinement): 74.6% (+7.3pp)
Phase 3 uses self-generated test cases for internal verification — the model never sees the answer key during repair. PR-CoT rescues 36/42 tasks (85.7% of Phase 3 rescues).
Cost and Performance Comparison
- DeepSeek V3.2 Reasoning: 86.2% LCB pass@1, ~$0.002/task (API, single-shot)
- GPT-5 (high): 84.6%, ~$0.043/task (API, single-shot)
- ATLAS V3 (pass@1-v(k=3)): 74.6%, ~$0.004/task (local electricity only, best-of-3 + repair pipeline)
- Claude 4.5 Sonnet: 71.4%, ~$0.066/task (API, single-shot)
- Claude 4 Sonnet: 65.5%, ~$0.066/task (API, single-shot)
ATLAS cost calculation: electricity at $0.12/kWh (~165W GPU, ~1h 55m for 599 tasks). ATLAS trades latency for cost — the pipeline takes longer per task than a single API call.
How It Works
The V3 pipeline has three phases:
- Phase 1: Generate — PlanSearch with constraint extraction and diverse plans, Budget Forcing with thinking token control
- Verify — Geometric Lens with energy scoring (5120-dim self-embeddings) and sandbox code execution
- Phase 3: Repair — Self-Test Generation with model-generated I/O pairs and PR-CoT Repair with multi-perspective chain-of-thought
The workflow: PlanSearch → Budget Forcing → k=3 candidates → Geometric Lens → energy-sorted → Sandbox → if all fail → Self-Test Generation → PR-CoT Repair → repaired code → Sandbox.
A single patched llama-server runs on K3s, providing both generation with speculative execution and embedding services.
📖 Read the full source: HN AI Agents
👀 See Also

SOPHIA Meta-Agent for AI Agent Maintenance
SOPHIA is a meta-agent designed as a Chief Learning Officer that observes, diagnoses, researches, and proposes improvements to other AI agents in production ecosystems. The system was designed through 7 iterations using 4 frontier models with human approval required for all deployments.

Jan-Code-4B: A Lightweight Code-Tuned Model for Local Development
The Jan team released Jan-Code-4B, a 4B parameter code-tuned model based on Jan-v3-4B-base-instruct. It's designed as a drop-in replacement for Haiku in Claude Code, offering improved coding assistance while running locally.

Lumyr: Dashboard Generation via Claude with Python and Streamlit Automation
Lumyr is a tool that generates live, shareable dashboards from plain English descriptions using Claude for dashboard generation and automating the Python and Streamlit layer. Users don't need to write Python, open Streamlit, deploy, set up hosting, or manage infrastructure.

Multi-Agent Debate Approach Improves LLM Reasoning Quality
A developer experimented with a multi-agent debate approach using CyrcloAI, where different AI agents take on roles like analyst, critic, and synthesizer to critique each other's responses before producing a final answer, resulting in more structured and deliberate outputs.