ThermoQA: Open Benchmark for Engineering Thermodynamics Tests LLMs on 293 Calculation Problems

ThermoQA Benchmark Overview
ThermoQA is an open benchmark for engineering thermodynamics with 293 open-ended calculation problems across three tiers:
- Tier 1: Property lookups (110 questions) — Example: "what is the enthalpy of water at 5 MPa, 400°C?"
- Tier 2: Component analysis (101 questions) — Turbines, compressors, heat exchangers with energy/entropy/exergy calculations
- Tier 3: Full cycle analysis (82 questions) — Rankine, Brayton, combined-cycle gas turbines
Ground truth comes from CoolProp (IAPWS-IF97). No multiple choice — models must produce exact numerical values.
Leaderboard Results (3-run mean)
- 1. Claude Opus 4.6: Tier 1: 96.4%, Tier 2: 92.1%, Tier 3: 93.6%, Composite: 94.1%
- 2. GPT-5.4: Tier 1: 97.8%, Tier 2: 90.8%, Tier 3: 89.7%, Composite: 93.1%
- 3. Gemini 3.1 Pro: Tier 1: 97.9%, Tier 2: 90.8%, Tier 3: 87.5%, Composite: 92.5%
- 4. DeepSeek-R1: Tier 1: 90.5%, Tier 2: 89.2%, Tier 3: 81.0%, Composite: 87.4%
- 5. Grok 4: Tier 1: 91.8%, Tier 2: 87.9%, Tier 3: 80.4%, Composite: 87.3%
- 6. MiniMax M2.5: Tier 1: 85.2%, Tier 2: 76.2%, Tier 3: 52.7%, Composite: 73.0%
Key Findings
- Rankings flip between tiers: Gemini leads Tier 1 (97.9%) but drops to #3 on Tier 3 (87.5%). Opus is #3 on lookups but #1 on cycle analysis, showing that memorizing steam tables ≠ reasoning.
- Supercritical water breaks everything: 44.5 percentage point spread. Models memorize textbook tables but can't handle nonlinear regions near the critical point. One model gave h = 1,887 kJ/kg where the correct value is 2,586 kJ/kg — a 27% error.
- R-134a is the blind spot: All models collapse to 44–63% on refrigerant problems vs 75–98% on water, showing training data bias.
- Run-to-run consistency varies 10×: GPT-5.4 σ = ±0.1% on Tier 3 vs DeepSeek-R1 σ = ±2.5% on Tier 2.
Open-Source Resources
- Dataset: https://huggingface.co/datasets/olivenet/thermoqa
- Code: https://github.com/olivenet-iot/ThermoQA
📖 Read the full source: r/LocalLLaMA
👀 See Also

Setting Up Subagents in OpenClaw: Key Considerations
Users experimenting with OpenClaw are facing issues with setting up subagents, particularly when modifying JSON files.

Kimi K2.6 beats Claude, GPT-5.5 and Gemini in coding challenge with aggressive sliding strategy
In the AI Coding Contest's Day 12 Word Gem Puzzle, Moonshot AI's open-weights Kimi K2.6 scored 22 match points (7-1-0), outperforming GPT-5.5 (16), Claude Opus 4.7 (12), and Gemini Pro 3.1 (9). MiMo V2-Pro took second. Kimi won by sliding aggressively.

NVIDIA announces NemoClaw with OpenShell security features
NVIDIA announced NemoClaw at GTC, building on OpenClaw to add enterprise-grade security through OpenShell, which enforces policy-based privacy and security guardrails for AI agents.

OpenClaw Empowers Developers with AI Agents While GethCity Innovates with Thinking Networks
OpenClaw launches an AI agent service, making coding faster and more efficient, while GethCity introduces a network that mimics human thought processes. Discover the innovations driving automation.