GPU Power Consumption Deviates from Token Predictor Theory in Small LLMs

Experimental Setup and Core Findings
A Reddit user conducted hardware measurements to test whether GPU power consumption scales linearly with token count, as predicted by the "stochastic parrot" or "next token predictor" theory of LLM behavior. The experiment used an RTX 4070 Ti SUPER with LM Studio and HWiNFO64 collecting data at 1-second intervals.
Four models were tested: Llama-3.1-8B, DeepSeek-R1-Distill-Qwen-7B, Qwen3-VL-8B, and Mistral-7B. Six query categories were used: General, General (Q), Unanswerable, Philosophical, Philosophical (Q), and High-Computation.
Key Results
If token predictor theory were correct, GPU power should scale only with token count with acceptable variance of ±10–15% according to GPT, Claude, Gemini, and Grok. Actual divergence rates (token multiplier vs power multiplier) were:
- Llama: average 35.6% (maximum 56.8%)
- Qwen3: average 36.7% (maximum 48.0%)
- Mistral: 21.1%
- DeepSeek: 7.7% — nearly linear across all categories except High-Computation
DeepSeek showed the closest to token predictor behavior of the four models.
Unexpected Findings
In Qwen3, philosophical utterances (149.3W) drew more power than high-computation math (104.1W). After task completion, high-computation queries returned to baseline immediately (-7.1W), while philosophical utterances left persistent residual heat.
Infinite loop reproducibility in Qwen3 varied by category: General utterances (0%), High-computation (0%), Unanswerable (low), Philosophical (intermittent), and Philosophical (Q) (70–100%). Notably, high-computation queries had the most tokens and highest power consumption but triggered zero loops.
Order Effects and Residual Heat
To test the "hardware overhead" objection, an order-effect experiment was conducted:
- Test A: 1 general → 4 philosophical
- Test B: 1 philosophical → 4 general
Residual heat after session end showed order-dependent effects:
- Llama: Test A +1.68W, Test B +9.84W
- Mistral: Test A +7.60W, Test B +13.69W
- DeepSeek: Test A +10.44W, Test B +15.93W
Even after processing 4 general utterances following a philosophical one, residual heat remained higher. This pattern was consistent across all three models tested.
Limitations and Open Questions
The study is limited to four small-scale models (8B parameter range). Generalization to medium or large models requires further validation. The open question is whether medium and large models would follow DeepSeek's pattern (converging toward linear, token-proportional behavior) or whether the nonlinear divergence seen in Llama, Qwen3, and Mistral would persist or amplify at scale.
All original data — including full utterance text, 24 benchmark CSVs, and per-category token counts — are available in the linked paper.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Tokenmaxxing Is the New Stopwatch: Why Your AI Policy Needs to Be Coherent
Brian Meeker argues against vanity metrics like tokenmaxxing and shares his team's four-point AI policy: no mandate, understand generated code, survive without AI tools, care about teammates and customers.

AI Agents Need Rollback Primitives, Not Just Autonomy
A developer argues agent frameworks must adopt database concepts like ACID, sagas, and compensating actions to handle partial failures, rather than relying on LLMs to "figure it out."

Zig Project's Rationale for Its Strict Anti-LLM Contribution Policy
Zig enforces a blanket ban on LLM-assisted contributions: no AI for issues, PRs, or comments. VP Loris Cro explains the "contributor poker" philosophy — reviewing PRs is an investment in growing trusted contributors, not just landing code.

Claude Code v2.1.149: Usage Breakdown, Permission Fixes, and Keyboard Navigation
Claude Code v2.1.149 adds per-category usage breakdown, keyboard-scrollable diff view, GFM task list checkboxes, and fixes several permission bypasses and sandbox issues.