Benchmark Local vs Cloud LLMs: A Consistent Methodology

A developer on r/LocalLLaMA has detailed a methodology for obtaining consistent benchmark numbers when comparing local LLMs with cloud APIs, addressing common frustrations with apples-to-oranges comparisons due to differing latencies, scoring, and methodologies.

The Core Problem with Benchmarking

Naive comparisons that fire requests at both local and cloud models measure different things. Cloud APIs involve queueing, load balancing, and routing. Local models involve warm-up, batching, and GPU contention. The solution implemented is to use sequential requests only. While slower—a 60-call benchmark takes ~3 minutes instead of 45 seconds—it ensures each measurement is clean, isolating inference time from queue time.

The Measurement Setup

The setup uses ZenMux as a unified endpoint, providing one base URL for four models: GPT-5.4, Claude Sonnet 4.6, Gemini 3.1 Pro, and a local Llama 4 quant. The approach works with any OpenAI-compatible endpoint, such as:

llama.cpp server: curl http://localhost:8080/v1/chat/completions ...
vLLM: curl http://localhost:8000/v1/chat/completions ...
Ollama: curl http://localhost:11434/v1/chat/completions ...

The key is using the same client code, timeout settings, and retry logic for everything.

How the Measurement Works

The system is structured into five modules: YAML Config → BenchRunner → AIClient → Analyzer → Reporter.

The YAML config defines tasks and models. Example:

suite: coding-benchmark
models:
  - gpt-5.4
  - claude-sonnet-4.6
  - gemini-3.1-pro
  - llama-4
runs_per_model: 3
tasks:
  - name: fizzbuzz
    prompt: "Write a Python function that prints FizzBuzz for numbers 1-100"
  - name: refactor-suggestion
    prompt: "Given this code, suggest improvements:\n\ndef calc(x):\n if x == 0: return 0\n if x == 1: return 1\n return calc(x-1) + calc(x-2)"

The BenchRunner takes the Cartesian product of tasks × models × runs and calls the API sequentially, recording latency, prompt tokens, and completion tokens.

The Scoring Part

Quality scoring is rule-based, not LLM-as-judge, to avoid self-preference bias and ensure reproducibility. The _quality_score function uses three signals:

Response length: 50–3000 characters scores 4.0, shorter scores 1.0, longer scores 3.0.
Formatting: Presence of bullet points adds up to 3.0 points.
Code presence: Detecting code blocks or function definitions adds 2.0 points.

Maximum score is 9.0. This reliably separates "good structured response" from "garbage/empty/hallucinated" for relative ranking. For latency, the 95th percentile response time (P95) is also calculated.

📖 Read the full source: r/LocalLLaMA