Engineering Reasoning and Instruction (ERI) Benchmark: A Large Taxonomy-driven Dataset for Foundation Models and Agents

MZ Naser; Ahmad Bani Awwad; Zoie McCreery; Radwa Eissa; Ahmad Naser; Gianluca Cusatis; Andrew Metcalf; Kapil Madathil; Jamal Abdalla; Venkatesh Kodur; Mohammad Reza Saeb

arXiv:2603.02239·cs.AI·March 4, 2026

Engineering Reasoning and Instruction (ERI) Benchmark: A Large Taxonomy-driven Dataset for Foundation Models and Agents

MZ Naser, Ahmad Bani Awwad, Zoie McCreery, Radwa Eissa, Ahmad Naser, Gianluca Cusatis, Andrew Metcalf, Kapil Madathil, Jamal Abdalla, Venkatesh Kodur, Mohammad Reza Saeb

PDF

Open Access

TL;DR

The ERI benchmark provides a comprehensive, taxonomy-driven dataset for evaluating engineering reasoning in large language models across multiple fields, difficulty levels, and intent types, enabling standardized assessment and comparison.

Contribution

This paper introduces ERI, a large, detailed engineering reasoning dataset with a validation protocol, addressing evaluation challenges and supporting reproducible benchmarking of LLMs in engineering tasks.

Findings

01

Frontier models outperform smaller models significantly.

02

Performance drops on graduate-level questions for mid-tier models.

03

Validated low hallucination risk in the benchmark.

Abstract

The Engineering Reasoning and Instruction (ERI) benchmark is a taxonomy-driven instruction dataset designed to train and evaluate engineering-capable large language models (LLMs) and agents. This dataset spans nine engineering fields (namely: civil, mechanical, electrical, chemical, environmental, aerospace, materials, fire, and industrial engineering) and 55 subdomains, and is crossed with seven intent types (i.e., definition, explanation, calculation, comparison, design/synthesis, troubleshooting, and code-related) and three difficulty tiers (undergraduate, graduate, and professional), yielding 57,750 records with field/subdomain/type/difficulty metadata and solution formatting. We examined ERI via seven LLMs and report a statistically significant three-tier performance structure, with frontier models (GPT-5, Claude Sonnet 4, DeepSeek V3.1) achieving mean scores above 4.30 on a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Intelligent Tutoring Systems and Adaptive Learning · Text Readability and Simplification