HomeBench: Evaluating LLMs in Smart Homes with Valid and Invalid Instructions Across Single and Multiple Devices

Silin Li; Yuhang Guo; Jiashu Yao; Zeming Liu; Haifeng Wang

arXiv:2505.19628·cs.CL·May 28, 2025

HomeBench: Evaluating LLMs in Smart Homes with Valid and Invalid Instructions Across Single and Multiple Devices

Silin Li, Yuhang Guo, Jiashu Yao, Zeming Liu, Haifeng Wang

PDF

1 Repo 1 Video

TL;DR

HomeBench introduces a comprehensive dataset for evaluating LLMs in smart home scenarios involving valid and invalid instructions across single and multiple devices, revealing current limitations of state-of-the-art models.

Contribution

This paper presents the first dataset, HomeBench, for assessing LLM performance in complex smart home tasks involving errors and multi-device control, highlighting existing models' shortcomings.

Findings

01

GPT-4o achieves 0.0% success in invalid multi-device instructions

02

State-of-the-art LLMs struggle with error detection and multi-device control

03

Existing methods do not significantly improve performance in complex scenarios

Abstract

Large language models (LLMs) have the potential to revolutionize smart home assistants by enhancing their ability to accurately understand user needs and respond appropriately, which is extremely beneficial for building a smarter home environment. While recent studies have explored integrating LLMs into smart home systems, they primarily focus on handling straightforward, valid single-device operation instructions. However, real-world scenarios are far more complex and often involve users issuing invalid instructions or controlling multiple devices simultaneously. These have two main challenges: LLMs must accurately identify and rectify errors in user instructions and execute multiple user instructions perfectly. To address these challenges and advance the development of LLM-based smart home assistants, we introduce HomeBench, the first smart home dataset with valid and invalid…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bithlp/homebench
pytorchOfficial

Videos

HomeBench: Evaluating LLMs in Smart Homes with Valid and Invalid Instructions Across Single and Multiple Devices· underline

Taxonomy

MethodsFocus