PHMForge: Evaluating LLM Agents on Industrial Prognostics through MCP-Native, Algorithm-Grounded Tools

Tianjun Feng; Yunfeng Chen; Chun-Yi Tsai; Yihan Sun; Ayan Das; Kaoutar El Maghraoui; Shuxin Lin; Dhaval Patel

arXiv:2604.01532·cs.AI·May 12, 2026

PHMForge: Evaluating LLM Agents on Industrial Prognostics through MCP-Native, Algorithm-Grounded Tools

Tianjun Feng, Yunfeng Chen, Chun-Yi Tsai, Yihan Sun, Ayan Das, Kaoutar El Maghraoui, Shuxin Lin, Dhaval Patel

PDF

TL;DR

PHMForge is a comprehensive evaluation environment for testing LLM agents on industrial prognostics, revealing limitations in tool orchestration and the impact of retrieval methods on prognostic accuracy.

Contribution

Introduces PHMForge, a new benchmark with diverse scenarios and tools, and analyzes the structural limits of static retrieval versus dynamic orchestration in LLM-based prognostics.

Findings

01

Krippendorff's alpha indicates high agreement among raters.

02

Strong LLM configurations achieve up to 80.8% pass@1.

03

Replacing MCP with RAG reduces prognosis success from 100% to 20%.

Abstract

LLM agents are beginning to invoke industrial asset-management tools through the Model Context Protocol (MCP), yet whether they can act reliably on this substrate for safety-critical \emph{Prognostics and Health Management (PHM)} is unanswered. Prior benchmarks conflate protocol fluency with reasoning, instrumentation failures with agent failures, and tool use with tool retrieval. We introduce \textbf{PHMForge}, an evaluation environment that closes each conflation. PHMForge ships 99 SME-authored scenarios across eight industrial asset classes spanning rotating equipment, aero-engines, and lithium-ion cells, on public datasets including NASA PCoE, served through 39 MCP-native tools wrapping published PHM algorithms (C-MAPSS, ISO~10816, Arrhenius capacity-fade models, time-series foundation models). Krippendorff's $α \in [0.74, 0.82]$ on a 30-scenario stratified…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.