Benchmarking Egocentric Multimodal Goal Inference for Assistive Wearable Agents

Vijay Veerabadran; Fanyi Xiao; Nitin Kamra; Pedro Matias; Joy Chen; Caley Drooff; Brett D Roads; Riley Williams; Ethan Henderson; Xuanyi Zhao; Kevin Carlberg; Joseph Tighe; Karl Ridgeway

arXiv:2510.22443·cs.CV·October 28, 2025

Benchmarking Egocentric Multimodal Goal Inference for Assistive Wearable Agents

Vijay Veerabadran, Fanyi Xiao, Nitin Kamra, Pedro Matias, Joy Chen, Caley Drooff, Brett D Roads, Riley Williams, Ethan Henderson, Xuanyi Zhao, Kevin Carlberg, Joseph Tighe, Karl Ridgeway

PDF

1 Video

TL;DR

This paper introduces WAGIBench, a new benchmark and dataset for evaluating vision-language models on the task of inferring user goals from multimodal data in assistive wearable devices, highlighting current model limitations.

Contribution

The work provides a novel multimodal dataset, a benchmark for goal inference, and insights into model performance and modality importance in assistive wearable contexts.

Findings

01

Human accuracy exceeds model performance (93% vs. 84%).

02

Larger models perform better but are still not practically reliable (55% accuracy).

03

Models benefit from relevant modalities with minimal impact from irrelevant ones.

Abstract

There has been a surge of interest in assistive wearable agents: agents embodied in wearable form factors (e.g., smart glasses) who take assistive actions toward a user's goal/query (e.g. "Where did I leave my keys?"). In this work, we consider the important complementary problem of inferring that goal from multi-modal contextual observations. Solving this "goal inference" problem holds the promise of eliminating the effort needed to interact with such an agent. This work focuses on creating WAGIBench, a strong benchmark to measure progress in solving this problem using vision-language models (VLMs). Given the limited prior work in this area, we collected a novel dataset comprising 29 hours of multimodal data from 348 participants across 3,477 recordings, featuring ground-truth goals alongside accompanying visual, audio, digital, and longitudinal contextual observations. We validate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Benchmarking Egocentric Multimodal Goal Inference for Assistive Wearable Agents· slideslive