Skip-Plan: Procedure Planning in Instructional Videos via Condensed Action Space Learning
Zhiheng Li, Wenjia Geng, Muheng Li, Lei Chen, Yansong Tang, Jiwen Lu,, Jie Zhou

TL;DR
Skip-Plan introduces a novel procedure planning method for instructional videos that simplifies long action sequences into reliable shorter sub-chains, improving performance by avoiding high-dimensional supervision and error accumulation.
Contribution
It proposes a chain model with skipping strategy to condense action space, enabling more reliable and efficient procedure planning in instructional videos.
Findings
Achieves state-of-the-art results on CrossTask and COIN benchmarks.
Effectively reduces error propagation in long action sequences.
Demonstrates robustness by skipping unreliable intermediate actions.
Abstract
In this paper, we propose Skip-Plan, a condensed action space learning method for procedure planning in instructional videos. Current procedure planning methods all stick to the state-action pair prediction at every timestep and generate actions adjacently. Although it coincides with human intuition, such a methodology consistently struggles with high-dimensional state supervision and error accumulation on action sequences. In this work, we abstract the procedure planning problem as a mathematical chain model. By skipping uncertain nodes and edges in action chains, we transfer long and complex sequence functions into short but reliable ones in two ways. First, we skip all the intermediate state supervision and only focus on action predictions. Second, we decompose relatively long chains into multiple short sub-chains by skipping unreliable intermediate actions. By this means, our model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Video Analysis and Summarization · Multimodal Machine Learning Applications
MethodsFocus
