PDPP: Projected Diffusion for Procedure Planning in Instructional Videos
Hanlin Wang, Yilu Wu, Sheng Guo, Limin Wang

TL;DR
This paper introduces PDPP, a diffusion-based framework for procedure planning in instructional videos that models entire action sequences directly from task labels, reducing annotation costs and addressing uncertainty.
Contribution
It proposes a novel diffusion-based approach for procedure planning that eliminates the need for intermediate supervision and autoregressive modeling, with joint training for variable horizon lengths.
Findings
Achieves state-of-the-art performance on multiple datasets.
Effectively models uncertainty in procedure planning.
Demonstrates strong generalization across different tasks.
Abstract
In this paper, we study the problem of procedure planning in instructional videos, which aims to make a plan (i.e. a sequence of actions) given the current visual observation and the desired goal. Previous works cast this as a sequence modeling problem and leverage either intermediate visual observations or language instructions as supervision to make autoregressive planning, resulting in complex learning schemes and expensive annotation costs. To avoid intermediate supervision annotation and error accumulation caused by planning autoregressively, we propose a diffusion-based framework, coined as PDPP, to directly model the whole action sequence distribution with task label as supervision instead. Our core idea is to treat procedure planning as a distribution fitting problem under the given observations, thus transform the planning problem to a sampling process from this distribution…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
MethodsConvolution · *Communicated@Fast*How Do I Communicate to Expedia? · Concatenated Skip Connection · Max Pooling · U-Net · Diffusion
