ThermoQA: Open Benchmark for Engineering Thermodynamics Tests LLMs on 293 Calculation Problems

✍️ OpenClawRadar📅 Published: March 21, 2026🔗 Source
ThermoQA: Open Benchmark for Engineering Thermodynamics Tests LLMs on 293 Calculation Problems
Ad

ThermoQA Benchmark Overview

ThermoQA is an open benchmark for engineering thermodynamics with 293 open-ended calculation problems across three tiers:

  • Tier 1: Property lookups (110 questions) — Example: "what is the enthalpy of water at 5 MPa, 400°C?"
  • Tier 2: Component analysis (101 questions) — Turbines, compressors, heat exchangers with energy/entropy/exergy calculations
  • Tier 3: Full cycle analysis (82 questions) — Rankine, Brayton, combined-cycle gas turbines

Ground truth comes from CoolProp (IAPWS-IF97). No multiple choice — models must produce exact numerical values.

Leaderboard Results (3-run mean)

  • 1. Claude Opus 4.6: Tier 1: 96.4%, Tier 2: 92.1%, Tier 3: 93.6%, Composite: 94.1%
  • 2. GPT-5.4: Tier 1: 97.8%, Tier 2: 90.8%, Tier 3: 89.7%, Composite: 93.1%
  • 3. Gemini 3.1 Pro: Tier 1: 97.9%, Tier 2: 90.8%, Tier 3: 87.5%, Composite: 92.5%
  • 4. DeepSeek-R1: Tier 1: 90.5%, Tier 2: 89.2%, Tier 3: 81.0%, Composite: 87.4%
  • 5. Grok 4: Tier 1: 91.8%, Tier 2: 87.9%, Tier 3: 80.4%, Composite: 87.3%
  • 6. MiniMax M2.5: Tier 1: 85.2%, Tier 2: 76.2%, Tier 3: 52.7%, Composite: 73.0%
Ad

Key Findings

  • Rankings flip between tiers: Gemini leads Tier 1 (97.9%) but drops to #3 on Tier 3 (87.5%). Opus is #3 on lookups but #1 on cycle analysis, showing that memorizing steam tables ≠ reasoning.
  • Supercritical water breaks everything: 44.5 percentage point spread. Models memorize textbook tables but can't handle nonlinear regions near the critical point. One model gave h = 1,887 kJ/kg where the correct value is 2,586 kJ/kg — a 27% error.
  • R-134a is the blind spot: All models collapse to 44–63% on refrigerant problems vs 75–98% on water, showing training data bias.
  • Run-to-run consistency varies 10×: GPT-5.4 σ = ±0.1% on Tier 3 vs DeepSeek-R1 σ = ±2.5% on Tier 2.

Open-Source Resources

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also