TL;DR
HomeBench introduces a comprehensive dataset for evaluating LLMs in smart home scenarios involving valid and invalid instructions across single and multiple devices, revealing current limitations of state-of-the-art models.
Contribution
This paper presents the first dataset, HomeBench, for assessing LLM performance in complex smart home tasks involving errors and multi-device control, highlighting existing models' shortcomings.
Findings
GPT-4o achieves 0.0% success in invalid multi-device instructions
State-of-the-art LLMs struggle with error detection and multi-device control
Existing methods do not significantly improve performance in complex scenarios
Abstract
Large language models (LLMs) have the potential to revolutionize smart home assistants by enhancing their ability to accurately understand user needs and respond appropriately, which is extremely beneficial for building a smarter home environment. While recent studies have explored integrating LLMs into smart home systems, they primarily focus on handling straightforward, valid single-device operation instructions. However, real-world scenarios are far more complex and often involve users issuing invalid instructions or controlling multiple devices simultaneously. These have two main challenges: LLMs must accurately identify and rectify errors in user instructions and execute multiple user instructions perfectly. To address these challenges and advance the development of LLM-based smart home assistants, we introduce HomeBench, the first smart home dataset with valid and invalid…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
MethodsFocus
