MUG: Interactive Multimodal Grounding on User Interfaces
Tao Li, Gang Li, Jingjie Zheng, Purple Wang, Yang Li

TL;DR
MUG introduces an interactive multimodal grounding task for user interfaces, enabling multi-round collaboration between users and agents, significantly improving task completion rates on a new mobile interface dataset.
Contribution
The paper presents a new interactive task and dataset for multimodal UI grounding, emphasizing multi-round interactions to enhance real-world applicability.
Findings
Iterative interaction improves task completion by 18% overall.
Multi-round interactions increase success by 31% on challenging cases.
Benchmarking with various models demonstrates the importance of iterative dialogue.
Abstract
We present MUG, a novel interactive task for multimodal grounding where a user and an agent work collaboratively on an interface screen. Prior works modeled multimodal UI grounding in one round: the user gives a command and the agent responds to the command. Yet, in a realistic scenario, a user command can be ambiguous when the target action is inherently difficult to articulate in natural language. MUG allows multiple rounds of interactions such that upon seeing the agent responses, the user can give further commands for the agent to refine or even correct its actions. Such interaction is critical for improving grounding performances in real-world use cases. To investigate the problem, we create a new dataset that consists of 77,820 sequences of human user-agent interaction on mobile interfaces in which 20% involves multiple rounds of interactions. To establish our benchmark, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Natural Language Processing Techniques · Topic Modeling
MethodsTest
