DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules

Devin Yasith De Silva; Dhaval Patel; Christodoulos Constantinides; Shuxin Lin; Nianjun Zhou; Paul J Adams; Sal Rosato; Nicolas Constantinides; Deborah L. McGuinness; Jayant Kalagnanam

arXiv:2605.08614·cs.AI·May 12, 2026

DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules

Devin Yasith De Silva, Dhaval Patel, Christodoulos Constantinides, Shuxin Lin, Nianjun Zhou, Paul J Adams, Sal Rosato, Nicolas Constantinides, Deborah L. McGuinness, Jayant Kalagnanam

PDF

TL;DR

DiagnosticIQ introduces a comprehensive benchmark to evaluate LLMs' ability to translate symbolic industrial maintenance rules into actionable steps, highlighting current capabilities and limitations.

Contribution

The paper presents a new benchmark, a symbolic-to-MCQA pipeline, and an analysis of LLM performance on industrial maintenance decision support tasks.

Findings

01

Top LLMs are within one Macro point of each other.

02

Models lose 13-60% accuracy with distractor expansion.

03

Models often rely on pattern-matching and break under structural perturbation.

Abstract

Monitoring complex industrial assets relies on engineer-authored symbolic rules that trigger based on sensor conditions and prompt technicians to perform corrective actions. The bottleneck is not detection but response: translating rules into maintenance steps requires asset-specific knowledge gained through years of practice. We investigate whether LLMs can serve as decision support for this rule-to-action step and introduce \ours{}, a benchmark of 6{,}690 expert-validated multiple-choice questions from 118 rule-action pairs across 16 asset types. We contribute (i) a symbolic-to-MCQA pipeline normalizing rules to Disjunctive Normal Form with embedding-based distractor sampling, (ii) five variants probing distinct failure modes (Pro, Pert, Verbose, Aug, Rationale), and (iii) a benchmark of 29 LLMs and 4 embedding baselines. A human evaluation (9 practitioners, mean 45.0\%) confirms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.