FailureSensorIQ: A Multi-Choice QA Dataset for Understanding Sensor Relationships and Failure Modes

Christodoulos Constantinides; Dhaval Patel; Shuxin Lin; Claudio Guerrero; Sunil Dagajirao Patil; Jayant Kalagnanam

arXiv:2506.03278·cs.CL·June 5, 2025

FailureSensorIQ: A Multi-Choice QA Dataset for Understanding Sensor Relationships and Failure Modes

Christodoulos Constantinides, Dhaval Patel, Shuxin Lin, Claudio Guerrero, Sunil Dagajirao Patil, Jayant Kalagnanam

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

FailureSensorIQ introduces a multi-choice QA benchmark to evaluate large language models' ability to understand and reason about sensor data and failure modes in industrial settings, revealing strengths and weaknesses of current models.

Contribution

This work presents a novel MCQA benchmark, FailureSensorIQ, for assessing LLMs' reasoning on industrial sensor data and failure modes, along with analysis tools and a feature selection pipeline.

Findings

01

Closed-source models like GPT-4 approach expert-level performance.

02

Models show performance drops when faced with perturbations and distractions.

03

Significant knowledge gaps and fragility in current LLM reasoning capabilities.

Abstract

We introduce FailureSensorIQ, a novel Multi-Choice Question-Answering (MCQA) benchmarking system designed to assess the ability of Large Language Models (LLMs) to reason and understand complex, domain-specific scenarios in Industry 4.0. Unlike traditional QA benchmarks, our system focuses on multiple aspects of reasoning through failure modes, sensor data, and the relationships between them across various industrial assets. Through this work, we envision a paradigm shift where modeling decisions are not only data-driven using statistical tools like correlation analysis and significance tests, but also domain-driven by specialized LLMs which can reason about the key contributors and useful patterns that can be captured with feature engineering. We evaluate the Industrial knowledge of over a dozen LLMs-including GPT-4, Llama, and Mistral-on FailureSensorIQ from different lens using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ibm/failuresensoriq
noneOfficial

Datasets

ibm-research/FailureSensorIQ
dataset· 204 dl
204 dl

Videos

FailureSensorIQ: A Multi-Choice QA Dataset for Understanding Sensor Relationships and Failure Modes· slideslive

Taxonomy

TopicsText Readability and Simplification · Topic Modeling · Machine Learning in Materials Science

MethodsAbsolute Position Encodings · Layer Normalization · Byte Pair Encoding · Label Smoothing · Softmax · Dropout · Dense Connections · Transformer · GPT-4 · Feature Selection