TL;DR
ThermoQA is a comprehensive benchmark with 293 thermodynamics problems designed to evaluate large language models' reasoning abilities across three difficulty tiers, emphasizing reasoning consistency and discriminative problem types.
Contribution
The paper introduces ThermoQA, a novel three-tier benchmark for assessing thermodynamic reasoning in large language models, with open-source dataset and code.
Findings
Claude Opus 4.6 leads with 94.1% accuracy.
Cross-tier performance degradation varies up to 32.5 percentage points.
Reasoning consistency varies from +/-0.1% to +/-2.5% across models.
Abstract
We present ThermoQA, a benchmark of 293 open-ended engineering thermodynamics problems in three tiers: property lookups (110 Q), component analysis (101 Q), and full cycle analysis (82 Q). Ground truth is computed programmatically from CoolProp 7.2.0, covering water, R-134a, and variable-cp air. Six frontier LLMs are evaluated across three independent runs each. The composite leaderboard is led by Claude Opus 4.6 (94.1%), GPT-5.4 (93.1%), and Gemini 3.1 Pro (92.5%). Cross-tier degradation ranges from 2.8 pp (Opus) to 32.5 pp (MiniMax), confirming that property memorization does not imply thermodynamic reasoning. Supercritical water, R-134a refrigerant, and combined-cycle gas turbine analysis serve as natural discriminators with 40-60 pp performance spreads. Multi-run sigma ranges from +/-0.1% to +/-2.5%, quantifying reasoning consistency as a distinct evaluation axis. Dataset and code…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
