ThermoQA: A Three-Tier Benchmark for Evaluating Thermodynamic Reasoning in Large Language Models

Kemal D\"uzkar

arXiv:2604.19758·cs.AI·April 23, 2026

ThermoQA: A Three-Tier Benchmark for Evaluating Thermodynamic Reasoning in Large Language Models

Kemal D\"uzkar

PDF

1 Repo

TL;DR

ThermoQA is a comprehensive benchmark with 293 thermodynamics problems designed to evaluate large language models' reasoning abilities across three difficulty tiers, emphasizing reasoning consistency and discriminative problem types.

Contribution

The paper introduces ThermoQA, a novel three-tier benchmark for assessing thermodynamic reasoning in large language models, with open-source dataset and code.

Findings

01

Claude Opus 4.6 leads with 94.1% accuracy.

02

Cross-tier performance degradation varies up to 32.5 percentage points.

03

Reasoning consistency varies from +/-0.1% to +/-2.5% across models.

Abstract

We present ThermoQA, a benchmark of 293 open-ended engineering thermodynamics problems in three tiers: property lookups (110 Q), component analysis (101 Q), and full cycle analysis (82 Q). Ground truth is computed programmatically from CoolProp 7.2.0, covering water, R-134a, and variable-cp air. Six frontier LLMs are evaluated across three independent runs each. The composite leaderboard is led by Claude Opus 4.6 (94.1%), GPT-5.4 (93.1%), and Gemini 3.1 Pro (92.5%). Cross-tier degradation ranges from 2.8 pp (Opus) to 32.5 pp (MiniMax), confirming that property memorization does not imply thermodynamic reasoning. Supercritical water, R-134a refrigerant, and combined-cycle gas turbine analysis serve as natural discriminators with 40-60 pp performance spreads. Multi-run sigma ranges from +/-0.1% to +/-2.5%, quantifying reasoning consistency as a distinct evaluation axis. Dataset and code…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://huggingface.co/datasets/olivenet/thermoqa
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.