LAP: A Language-Aware Planning Model For Procedure Planning In Instructional Videos

Lei Shi; Victor Aregbede; Andreas Persson; Martin L\"angkvist; Amy Loutfi; Stephanie Lowry

arXiv:2603.09743·cs.CV·March 11, 2026

LAP: A Language-Aware Planning Model For Procedure Planning In Instructional Videos

Lei Shi, Victor Aregbede, Andreas Persson, Martin L\"angkvist, Amy Loutfi, Stephanie Lowry

PDF

Open Access

TL;DR

LAP introduces a novel language-aware approach for procedure planning in instructional videos, leveraging language representations to improve action sequence prediction and achieve state-of-the-art results.

Contribution

The paper proposes a new method that uses language descriptions and a vision-language model to enhance procedure planning in videos, outperforming existing visual-only methods.

Findings

01

LAP achieves state-of-the-art performance on multiple benchmarks.

02

Language embeddings provide more distinctive features than visual ones.

03

The approach significantly improves planning accuracy across various time horizons.

Abstract

Procedure planning requires a model to predict a sequence of actions that transform a start visual observation into a goal in instructional videos. While most existing methods rely primarily on visual observations as input, they often struggle with the inherent ambiguity where different actions can appear visually similar. In this work, we argue that language descriptions offer a more distinctive representation in the latent space for procedure planning. We introduce Language-Aware Planning (LAP), a novel method that leverages the expressiveness of language to bridge visual observation and planning. LAP uses a finetuned Vision Language Model (VLM) to translate visual observations into text descriptions and to predict actions and extract text embeddings. These text embeddings are more distinctive than visual embeddings and are used in a diffusion model for planning action sequences. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Domain Adaptation and Few-Shot Learning