What to Do Next? Memorizing skills from Egocentric Instructional Video

Jing Bi; Chenliang Xu

arXiv:2507.02997·cs.LG·July 8, 2025

What to Do Next? Memorizing skills from Egocentric Instructional Video

Jing Bi, Chenliang Xu

PDF

TL;DR

This paper introduces a novel approach for high-level goal-oriented action planning from egocentric videos, combining topological affordance memory with transformers to improve environment understanding and action deviation detection.

Contribution

It presents a new task of interactive action planning and a method that integrates memory and transformer models for better environment representation and action execution.

Findings

01

Improved performance in goal achievement tasks

02

Robust detection of action deviations

03

Meaningful environment representations learned

Abstract

Learning to perform activities through demonstration requires extracting meaningful information about the environment from observations. In this research, we investigate the challenge of planning high-level goal-oriented actions in a simulation setting from an egocentric perspective. We present a novel task, interactive action planning, and propose an approach that combines topological affordance memory with transformer architecture. The process of memorizing the environment's structure through extracting affordances facilitates selecting appropriate actions based on the context. Moreover, the memory model allows us to detect action deviations while accomplishing specific objectives. To assess the method's versatility, we evaluate it in a realistic interactive simulation environment. Our experimental results demonstrate that the proposed approach learns meaningful representations,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.