Grounding Task Assistance with Multimodal Cues from a Single Demonstration
Gabriel Sarch, Balasaravanan Thoravi Kumaravel, Sahithya Ravi, and Vibhav Vineet, Andrew D. Wilson

TL;DR
This paper introduces MICA, a multimodal framework that enhances task assistance by integrating eye gaze and speech cues to better understand fine-grained user intent and context from demonstrations.
Contribution
MICA is the first framework to combine eye gaze and speech cues for improved contextual understanding in task assistance, surpassing frame-based methods.
Findings
Gaze cues alone achieve 93% of speech performance.
Multimodal cues significantly improve response accuracy.
Task type influences the effectiveness of implicit vs. explicit cues.
Abstract
A person's demonstration often serves as a key reference for others learning the same task. However, RGB video, the dominant medium for representing these demonstrations, often fails to capture fine-grained contextual cues such as intent, safety-critical environmental factors, and subtle preferences embedded in human behavior. This sensory gap fundamentally limits the ability of Vision Language Models (VLMs) to reason about why actions occur and how they should adapt to individual users. To address this, we introduce MICA (Multimodal Interactive Contextualized Assistance), a framework that improves conversational agents for task assistance by integrating eye gaze and speech cues. MICA segments demonstrations into meaningful sub-tasks and extracts keyframes and captions that capture fine-grained intent and user-specific cues, enabling richer contextual grounding for visual question…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech and dialogue systems
