RECIPE: Procedural Planning via Grounding in Instructional Video
Luigi Seminara, Antonino Furnari, Lorenzo Torresani

TL;DR
RECIPE introduces a scalable, grounding-based reward framework for visual procedural planning that leverages large instructional video datasets without relying on costly annotations.
Contribution
It proposes a novel grounding quality reward for procedural planning, enabling learning from noisy video data and improving performance across multiple benchmarks.
Findings
RECIPE-RL outperforms base models at all scales and benchmarks.
It achieves +7 to +8 macro-accuracy points in-domain and up to +16 zero-shot.
It surpasses supervised fine-tuning and maintains diversity in generation.
Abstract
Visual planning asks a model to generate the remaining steps of a procedure in natural language given a partial video context and a goal. Progress on this task is bottlenecked by annotation: clean labeled datasets are small, domain-narrow, and encode a single execution trajectory per example, even though many valid orderings exist. Large-scale instructional video corpora offer orders of magnitude more procedural content, but supervised fine-tuning on pseudo-labels from their noisy ASR narrations propagates segmentation and alignment errors and stays single-trajectory. We identify a key asymmetry: extracting clean step labels from noisy video is hard, but verifying whether a generated step sequence is temporally grounded in ASR transcripts is cheap and scales to millions of videos via precomputed text embeddings. We exploit this asymmetry in RECIPE, which uses grounding quality as a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
