MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes

Maximillian Chen; Xuanming Zhang; Michael Peng; Zhou Yu; Alexandros Papangelis; Yohan Jo

arXiv:2605.06897·cs.CL·May 11, 2026

MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes

Maximillian Chen, Xuanming Zhang, Michael Peng, Zhou Yu, Alexandros Papangelis, Yohan Jo

PDF

TL;DR

MIST introduces a synthetic dataset for voice-driven IoT device control, highlighting the challenges and gaps in current multimodal language models for real-world smart home applications.

Contribution

The paper presents a new dataset and framework for research on multimodal voice assistants that handle physical constraints and dynamic interactions in smart homes.

Findings

01

Significant performance gap between open- and closed-weight multimodal LLMs on MIST.

02

Even advanced closed-weight LLMs have substantial room for improvement.

03

Release of MIST dataset and data generation framework to facilitate further research.

Abstract

The rise of Internet of Things (IoT) devices in the physical world necessitates voice-based interfaces capable of handling complex user experiences. While modern Large Language Models (LLMs) already demonstrate strong tool-usage capabilities, modeling real-world IoT devices presents a difficult, understudied challenge which combines modeling spatiotemporal constraints with speech inputs, dynamic state tracking, and mixed-initiative interaction patterns. We introduce MIST (the Multimodal Interactive Speech-based Tool-calling Dataset), a synthetic multi-turn, voice-driven code generation task that operates over IoT devices. We find that there is a significant gap between open- and closed-weight multimodal LLMs on MIST, and that even frontier closed-weight LLMs have substantial headroom. We release MIST and an extensible data generation framework to build related datasets in order to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.