TL;DR
SCICONVBENCH is a benchmark designed to evaluate large language models' ability to clarify and refine ill-posed scientific questions through multi-turn dialogue in computational science.
Contribution
It introduces a structured benchmark with a rubric-based evaluation for assessing LLMs' clarification and correction capabilities in scientific task formulation.
Findings
Frontier models resolve only 52.7% of disambiguation cases in fluid mechanics.
Models perform relatively well on inconsistency resolution.
LLMs often make implicit assumptions not grounded in conversation.
Abstract
Large Language Models (LLMs) are increasingly deployed as scientific AI as- sistants, and a growing body of benchmarks evaluates their capabilities across knowledge retrieval, reasoning, code generation, and tool use. These evaluations, however, typically assume the scientific problem is already well-posed, whereas practical scientific assistance often begins with an ill-posed user request that must be refined through dialogue before any computation, analysis, or experiment can be carried out reliably. We introduce SCICONVBENCH, a benchmark for multi- turn clarification in scientific task formulation across four computational science problem domains: fluid mechanics, solid mechanics, materials science, and par- tial differential equations (PDEs). SCICONVBENCH targets two complementary capabilities: eliciting missing information (disambiguation) and detecting and correcting erroneous…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
