What to Do Next? Memorizing skills from Egocentric Instructional Video
Jing Bi, Chenliang Xu

TL;DR
This paper introduces a novel approach for high-level goal-oriented action planning from egocentric videos, combining topological affordance memory with transformers to improve environment understanding and action deviation detection.
Contribution
It presents a new task of interactive action planning and a method that integrates memory and transformer models for better environment representation and action execution.
Findings
Improved performance in goal achievement tasks
Robust detection of action deviations
Meaningful environment representations learned
Abstract
Learning to perform activities through demonstration requires extracting meaningful information about the environment from observations. In this research, we investigate the challenge of planning high-level goal-oriented actions in a simulation setting from an egocentric perspective. We present a novel task, interactive action planning, and propose an approach that combines topological affordance memory with transformer architecture. The process of memorizing the environment's structure through extracting affordances facilitates selecting appropriate actions based on the context. Moreover, the memory model allows us to detect action deviations while accomplishing specific objectives. To assess the method's versatility, we evaluate it in a realistic interactive simulation environment. Our experimental results demonstrate that the proposed approach learns meaningful representations,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
