RAP: Retrieval-Augmented Planner for Adaptive Procedure Planning in Instructional Videos
Ali Zare, Yulei Niu, Hammad Ayyubi, Shih-fu Chang

TL;DR
This paper introduces RAP, a retrieval-augmented model for adaptive procedure planning in instructional videos, addressing challenges of variable sequence length, temporal understanding, and limited annotations, with strong experimental results.
Contribution
RAP is the first model to handle variable-length procedure planning using retrieval and weak supervision, improving adaptability and reducing annotation costs.
Findings
RAP outperforms fixed-length models on benchmarks.
Retrieval improves temporal relation understanding.
Weak supervision expands training data effectively.
Abstract
Procedure Planning in instructional videos entails generating a sequence of action steps based on visual observations of the initial and target states. Despite the rapid progress in this task, there remain several critical challenges to be solved: (1) Adaptive procedures: Prior works hold an unrealistic assumption that the number of action steps is known and fixed, leading to non-generalizable models in real-world scenarios where the sequence length varies. (2) Temporal relation: Understanding the step temporal relation knowledge is essential in producing reasonable and executable plans. (3) Annotation cost: Annotating instructional videos with step-level labels (i.e., timestamp) or sequence-level labels (i.e., action category) is demanding and labor-intensive, limiting its generalizability to large-scale datasets. In this work, we propose a new and practical setting, called adaptive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Human Motion and Animation · Educational Games and Gamification
