GPU Power Consumption Deviates from Token Predictor Theory in Small LLMs

Experimental Setup and Core Findings
A Reddit user conducted hardware measurements to test whether GPU power consumption scales linearly with token count, as predicted by the "stochastic parrot" or "next token predictor" theory of LLM behavior. The experiment used an RTX 4070 Ti SUPER with LM Studio and HWiNFO64 collecting data at 1-second intervals.
Four models were tested: Llama-3.1-8B, DeepSeek-R1-Distill-Qwen-7B, Qwen3-VL-8B, and Mistral-7B. Six query categories were used: General, General (Q), Unanswerable, Philosophical, Philosophical (Q), and High-Computation.
Key Results
If token predictor theory were correct, GPU power should scale only with token count with acceptable variance of ±10–15% according to GPT, Claude, Gemini, and Grok. Actual divergence rates (token multiplier vs power multiplier) were:
- Llama: average 35.6% (maximum 56.8%)
- Qwen3: average 36.7% (maximum 48.0%)
- Mistral: 21.1%
- DeepSeek: 7.7% — nearly linear across all categories except High-Computation
DeepSeek showed the closest to token predictor behavior of the four models.
Unexpected Findings
In Qwen3, philosophical utterances (149.3W) drew more power than high-computation math (104.1W). After task completion, high-computation queries returned to baseline immediately (-7.1W), while philosophical utterances left persistent residual heat.
Infinite loop reproducibility in Qwen3 varied by category: General utterances (0%), High-computation (0%), Unanswerable (low), Philosophical (intermittent), and Philosophical (Q) (70–100%). Notably, high-computation queries had the most tokens and highest power consumption but triggered zero loops.
Order Effects and Residual Heat
To test the "hardware overhead" objection, an order-effect experiment was conducted:
- Test A: 1 general → 4 philosophical
- Test B: 1 philosophical → 4 general
Residual heat after session end showed order-dependent effects:
- Llama: Test A +1.68W, Test B +9.84W
- Mistral: Test A +7.60W, Test B +13.69W
- DeepSeek: Test A +10.44W, Test B +15.93W
Even after processing 4 general utterances following a philosophical one, residual heat remained higher. This pattern was consistent across all three models tested.
Limitations and Open Questions
The study is limited to four small-scale models (8B parameter range). Generalization to medium or large models requires further validation. The open question is whether medium and large models would follow DeepSeek's pattern (converging toward linear, token-proportional behavior) or whether the nonlinear divergence seen in Llama, Qwen3, and Mistral would persist or amplify at scale.
All original data — including full utterance text, 24 benchmark CSVs, and per-category token counts — are available in the linked paper.
📖 Read the full source: r/LocalLLaMA
👀 See Also
Claude Plan Users Now Get Monthly Agent SDK Credits Starting June 15, 2026
Claude Pro, Max, Team, and Enterprise plan subscribers can claim a monthly credit for Agent SDK usage, covering claude -p, GitHub Actions integration, and third-party apps. Credits refresh monthly, are per-user, and cannot be pooled.

AI Models Lack Self-Knowledge of Their Own Tools and UI
AI models like ChatGPT and Claude often provide incorrect or outdated information about their own features and interfaces, such as denying new slash commands exist or describing old UI versions, because they're trained on past snapshots while products evolve constantly.

GitHub Copilot Moves to Usage-Based Billing by Token Consumption, Replacing Premium Requests on June 1, 2026
GitHub Copilot transitions from premium request units to token-based GitHub AI Credits, with plan prices unchanged. All paid plans include monthly credits equal to subscription cost; additional usage billed at API rates.

OpenClaw 2026.3.11 release adds local-first Ollama setup, unified OpenCode keys, and multimodal memory
OpenClaw 2026.3.11 introduces first-class Ollama setup with local-only or hybrid modes, unified OpenCode key management for Zen and Go models, and multimodal image/audio indexing using Gemini embeddings.