EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning

Ayesha Gull; Muhammad Usman Safder; Rania Elbadry; Fan Zhang; Veselin Stoyanov; Preslav Nakov; Zhuohan Xie

arXiv:2511.01650·cs.CL·January 8, 2026

EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning

Ayesha Gull, Muhammad Usman Safder, Rania Elbadry, Fan Zhang, Veselin Stoyanov, Preslav Nakov, Zhuohan Xie

PDF

Open Access

TL;DR

EngTrace introduces a symbolic benchmark for evaluating the verifiable reasoning capabilities of large language models in engineering, emphasizing the importance of intermediate trace validation and domain-specific challenges.

Contribution

The paper presents EngTrace, a novel benchmark with a two-stage evaluation framework for assessing LLMs' reasoning in engineering, focusing on trace verification and domain-aware testing.

Findings

01

Identifies a trade-off between numeric precision and trace fidelity in LLMs.

02

Reveals a complexity cliff where mathematical pre-training does not ensure integrative reasoning.

03

Provides a diverse, contamination-resistant set of test cases for robust evaluation.

Abstract

Large Language Models (LLMs) are increasingly entering specialized, safety-critical engineering workflows governed by strict quantitative standards and immutable physical laws, making rigorous evaluation of their reasoning capabilities imperative. However, existing benchmarks such as MMLU, MATH, and HumanEval assess isolated cognitive skills, failing to capture the physically grounded reasoning central to engineering, where scientific principles, quantitative modeling, and practical constraints must converge. To enable verifiable process supervision in engineering, we introduce EngTrace, a symbolic benchmark comprising 90 templates across three major engineering branches, nine core domains and 20 distinct areas. Through domain-aware parameterization, we generate 1,350 unique, contamination-resistant test cases to stress-test generalization. Moving beyond outcome matching, we introduce a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Topic Modeling · Explainable Artificial Intelligence (XAI)