PHMForge: Evaluating LLM Agents on Industrial Prognostics through MCP-Native, Algorithm-Grounded Tools
Tianjun Feng, Yunfeng Chen, Chun-Yi Tsai, Yihan Sun, Ayan Das, Kaoutar El Maghraoui, Shuxin Lin, Dhaval Patel

TL;DR
PHMForge is a comprehensive evaluation environment for testing LLM agents on industrial prognostics, revealing limitations in tool orchestration and the impact of retrieval methods on prognostic accuracy.
Contribution
Introduces PHMForge, a new benchmark with diverse scenarios and tools, and analyzes the structural limits of static retrieval versus dynamic orchestration in LLM-based prognostics.
Findings
Krippendorff's alpha indicates high agreement among raters.
Strong LLM configurations achieve up to 80.8% pass@1.
Replacing MCP with RAG reduces prognosis success from 100% to 20%.
Abstract
LLM agents are beginning to invoke industrial asset-management tools through the Model Context Protocol (MCP), yet whether they can act reliably on this substrate for safety-critical \emph{Prognostics and Health Management (PHM)} is unanswered. Prior benchmarks conflate protocol fluency with reasoning, instrumentation failures with agent failures, and tool use with tool retrieval. We introduce \textbf{PHMForge}, an evaluation environment that closes each conflation. PHMForge ships 99 SME-authored scenarios across eight industrial asset classes spanning rotating equipment, aero-engines, and lithium-ion cells, on public datasets including NASA PCoE, served through 39 MCP-native tools wrapping published PHM algorithms (C-MAPSS, ISO~10816, Arrhenius capacity-fade models, time-series foundation models). Krippendorff's on a 30-scenario stratified…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
