Methodology for Consistent Benchmarking of Local vs Cloud LLMs

A developer on r/LocalLLaMA has detailed a methodology for obtaining consistent benchmark numbers when comparing local LLMs with cloud APIs, addressing common frustrations with apples-to-oranges comparisons due to differing latencies, scoring, and methodologies.
The Core Problem with Benchmarking
Naive comparisons that fire requests at both local and cloud models measure different things. Cloud APIs involve queueing, load balancing, and routing. Local models involve warm-up, batching, and GPU contention. The solution implemented is to use sequential requests only. While slower—a 60-call benchmark takes ~3 minutes instead of 45 seconds—it ensures each measurement is clean, isolating inference time from queue time.
The Measurement Setup
The setup uses ZenMux as a unified endpoint, providing one base URL for four models: GPT-5.4, Claude Sonnet 4.6, Gemini 3.1 Pro, and a local Llama 4 quant. The approach works with any OpenAI-compatible endpoint, such as:
- llama.cpp server:
curl http://localhost:8080/v1/chat/completions ... - vLLM:
curl http://localhost:8000/v1/chat/completions ... - Ollama:
curl http://localhost:11434/v1/chat/completions ...
The key is using the same client code, timeout settings, and retry logic for everything.
How the Measurement Works
The system is structured into five modules: YAML Config → BenchRunner → AIClient → Analyzer → Reporter.
The YAML config defines tasks and models. Example:
suite: coding-benchmark
models:
- gpt-5.4
- claude-sonnet-4.6
- gemini-3.1-pro
- llama-4
runs_per_model: 3
tasks:
- name: fizzbuzz
prompt: "Write a Python function that prints FizzBuzz for numbers 1-100"
- name: refactor-suggestion
prompt: "Given this code, suggest improvements:\n\ndef calc(x):\n if x == 0: return 0\n if x == 1: return 1\n return calc(x-1) + calc(x-2)"The BenchRunner takes the Cartesian product of tasks × models × runs and calls the API sequentially, recording latency, prompt tokens, and completion tokens.
The Scoring Part
Quality scoring is rule-based, not LLM-as-judge, to avoid self-preference bias and ensure reproducibility. The _quality_score function uses three signals:
- Response length: 50–3000 characters scores 4.0, shorter scores 1.0, longer scores 3.0.
- Formatting: Presence of bullet points adds up to 3.0 points.
- Code presence: Detecting code blocks or function definitions adds 2.0 points.
Maximum score is 9.0. This reliably separates "good structured response" from "garbage/empty/hallucinated" for relative ranking. For latency, the 95th percentile response time (P95) is also calculated.
📖 Read the full source: r/LocalLLaMA
👀 See Also

How OpenCLAW Memory Actually Works: Fixing Agent 'Forgetting'
OpenCLAW agents don't have persistent memory between conversations - they reconstruct context from files like SOUL.md, USER.md, and MEMORY.md each time. Common 'forgetting' issues stem from old sessions, unstructured memory files, and storing important info in chat history instead of permanent files.

Using AI to Write Better Code More Slowly: A Bug-Finding Workflow
Nolan Lawson describes a workflow using multiple AI agents (Claude, Codex, Cursor Bugbot) to find and prioritize bugs in PRs, improving code quality over raw velocity.

OpenClaw Onboarding: How to Train Your AI Agent Right

Running OpenClaw Locally with Ollama to Avoid API Costs
A Reddit user shares their experience switching from API-based OpenClaw to running it locally with Ollama, eliminating API costs while maintaining workflows. They created a step-by-step installation video guide.