Simultaneous Localization and Affordance Prediction of Tasks from Egocentric Video
Zachary Chavis, Hyun Soo Park, Stephen J. Guy

TL;DR
This paper introduces a spatial extension to Vision-Language Models that uses egocentric video to predict where tasks occur and their relation to the viewer, improving task localization and affordance understanding.
Contribution
The authors develop a method that enhances VLMs with spatial reasoning from egocentric videos, enabling better localization and understanding of task affordances in physical space.
Findings
Outperforms baseline in localizing task locations
Reduces error in predicting task occurrence
Enables robots to navigate to task-relevant regions
Abstract
Vision-Language Models (VLMs) have shown great success as foundational models for downstream vision and natural language applications in a variety of domains. However, these models are limited to reasoning over objects and actions currently visible on the image plane. We present a spatial extension to the VLM, which leverages spatially-localized egocentric video demonstrations to augment VLMs in two ways -- through understanding spatial task-affordances, i.e. where an agent must be for the task to physically take place, and the localization of that task relative to the egocentric viewer. We show our approach outperforms the baseline of using a VLM to map similarity of a task's description over a set of location-tagged images. Our approach has less error both on predicting where a task may take place and on predicting what tasks are likely to happen at the current location. The resulting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition
MethodsSparse Evolutionary Training
