Benchmarking Egocentric Multimodal Goal Inference for Assistive Wearable Agents
Vijay Veerabadran, Fanyi Xiao, Nitin Kamra, Pedro Matias, Joy Chen, Caley Drooff, Brett D Roads, Riley Williams, Ethan Henderson, Xuanyi Zhao, Kevin Carlberg, Joseph Tighe, Karl Ridgeway

TL;DR
This paper introduces WAGIBench, a new benchmark and dataset for evaluating vision-language models on the task of inferring user goals from multimodal data in assistive wearable devices, highlighting current model limitations.
Contribution
The work provides a novel multimodal dataset, a benchmark for goal inference, and insights into model performance and modality importance in assistive wearable contexts.
Findings
Human accuracy exceeds model performance (93% vs. 84%).
Larger models perform better but are still not practically reliable (55% accuracy).
Models benefit from relevant modalities with minimal impact from irrelevant ones.
Abstract
There has been a surge of interest in assistive wearable agents: agents embodied in wearable form factors (e.g., smart glasses) who take assistive actions toward a user's goal/query (e.g. "Where did I leave my keys?"). In this work, we consider the important complementary problem of inferring that goal from multi-modal contextual observations. Solving this "goal inference" problem holds the promise of eliminating the effort needed to interact with such an agent. This work focuses on creating WAGIBench, a strong benchmark to measure progress in solving this problem using vision-language models (VLMs). Given the limited prior work in this area, we collected a novel dataset comprising 29 hours of multimodal data from 348 participants across 3,477 recordings, featuring ground-truth goals alongside accompanying visual, audio, digital, and longitudinal contextual observations. We validate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
