FactoryBench: Evaluating Industrial Machine Understanding

Yanis Merzouki; Coral Izquierdo; Matei Ignuta-Ciuncanu; Marcos Gomez-Bracamonte; Riccardo Maggioni; Alessandro Lombardi; Camilla Mazzoleni; Federico Martelli; Balazs Gunther; Jonas Petersen; and Philipp Petersen

arXiv:2605.07675·cs.AI·May 11, 2026

FactoryBench: Evaluating Industrial Machine Understanding

Yanis Merzouki, Coral Izquierdo, Matei Ignuta-Ciuncanu, Marcos Gomez-Bracamonte, Riccardo Maggioni, Alessandro Lombardi, Camilla Mazzoleni, Federico Martelli, Balazs Gunther, Jonas Petersen, and Philipp Petersen

PDF

TL;DR

FactoryBench is a comprehensive benchmark for assessing machine understanding in industrial robotics using time-series data, structured Q&A, and LLM evaluation, revealing significant gaps in current model capabilities.

Contribution

It introduces FactoryBench, a large-scale, causally-structured benchmark with a new dataset and evaluation framework for industrial machine understanding.

Findings

01

No model exceeds 50% on structured causal levels.

02

Models score below 18% on decision-making tasks.

03

FactoryBench reveals significant gaps in current AI capabilities.

Abstract

We introduce FactoryBench, a benchmark for evaluating time-series models and LLMs on machine understanding over industrial robotic telemetry. Q&A pairs are organized along four causal levels (state, intervention, counterfactual, decision) instantiating Pearl's ladder of causation, and span five answer formats: four structured formats are scored deterministically and free-form answers are scored by an LLM-as-judge voting protocol. We propose a scalable Q&A generation framework built around structured question templates, present FactoryWave (a dense, multitask, multivariate sensor dataset collected from a UR3 cobot and a KUKA KR10 industrial arm), and construct FactoryBench as a large-scale benchmark of over 70k Q&A items grounded in roughly 15k normalized episodes from FactoryWave, AURSAD, and voraus-AD. Zero-shot evaluation of six frontier LLMs shows that no model exceeds 50% on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.