SmartBench: Evaluating LLMs in Smart Homes with Anomalous Device States and Behavioral Contexts
Qingsong Zou, Zhi Yan, Zhiyao Xu, Kuofeng Gao, Jingyu Xiao, Yong Jiang

TL;DR
SmartBench introduces a novel dataset to evaluate large language models' ability to detect anomalies in smart home environments, revealing current models' limitations in accurately identifying abnormal device states and transitions.
Contribution
This paper presents the first smart home dataset designed for LLMs, focusing on normal and anomalous device states and transitions, and evaluates 13 models' anomaly detection performance.
Findings
Most LLMs perform poorly on anomaly detection tasks.
Claude-Sonnet-4.5 achieves 66.1% accuracy on context-independent anomalies.
Models struggle with context-dependent anomaly detection.
Abstract
Due to the strong context-awareness capabilities demonstrated by large language models (LLMs), recent research has begun exploring their integration into smart home assistants to help users manage and adjust their living environments. While LLMs have been shown to effectively understand user needs and provide appropriate responses, most existing studies primarily focus on interpreting and executing user behaviors or instructions. However, a critical function of smart home assistants is the ability to detect when the home environment is in an anomalous state. This involves two key requirements: the LLM must accurately determine whether an anomalous condition is present, and provide either a clear explanation or actionable suggestions. To enhance the anomaly detection capabilities of next-generation LLM-based smart home assistants, we introduce SmartBench, which is the first smart…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Software System Performance and Reliability · Context-Aware Activity Recognition Systems
