Simultaneous Localization and Affordance Prediction of Tasks from Egocentric Video

Zachary Chavis; Hyun Soo Park; Stephen J. Guy

arXiv:2407.13856·cs.RO·June 13, 2025·1 cites

Simultaneous Localization and Affordance Prediction of Tasks from Egocentric Video

Zachary Chavis, Hyun Soo Park, Stephen J. Guy

PDF

Open Access

TL;DR

This paper introduces a spatial extension to Vision-Language Models that uses egocentric video to predict where tasks occur and their relation to the viewer, improving task localization and affordance understanding.

Contribution

The authors develop a method that enhances VLMs with spatial reasoning from egocentric videos, enabling better localization and understanding of task affordances in physical space.

Findings

01

Outperforms baseline in localizing task locations

02

Reduces error in predicting task occurrence

03

Enables robots to navigate to task-relevant regions

Abstract

Vision-Language Models (VLMs) have shown great success as foundational models for downstream vision and natural language applications in a variety of domains. However, these models are limited to reasoning over objects and actions currently visible on the image plane. We present a spatial extension to the VLM, which leverages spatially-localized egocentric video demonstrations to augment VLMs in two ways -- through understanding spatial task-affordances, i.e. where an agent must be for the task to physically take place, and the localization of that task relative to the egocentric viewer. We show our approach outperforms the baseline of using a VLM to map similarity of a task's description over a set of location-tagged images. Our approach has less error both on predicting where a task may take place and on predicting what tasks are likely to happen at the current location. The resulting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition

MethodsSparse Evolutionary Training