Masked Temporal Interpolation Diffusion for Procedure Planning in Instructional Videos

Yufan Zhou; Zhaobo Qi; Lingshuai Lin; Junqi Jing; Tingting Chai; Beichen Zhang; Shuhui Wang; Weigang Zhang

arXiv:2507.03393·cs.CV·July 8, 2025

Masked Temporal Interpolation Diffusion for Procedure Planning in Instructional Videos

Yufan Zhou, Zhaobo Qi, Lingshuai Lin, Junqi Jing, Tingting Chai, Beichen Zhang, Shuhui Wang, Weigang Zhang

PDF

1 Datasets 3 Reviews

TL;DR

This paper introduces the Masked Temporal Interpolation Diffusion (MTID) model, which enhances procedure planning in instructional videos by generating coherent action sequences through a novel latent space interpolation and action-aware mechanisms.

Contribution

The paper presents a new diffusion-based model with a latent space temporal interpolation module and action-aware mask projection for improved task-specific procedure planning.

Findings

01

Achieves superior action planning accuracy on benchmark datasets.

02

Effectively captures intricate temporal relationships among actions.

03

Enables end-to-end training with richer mid-state supervision.

Abstract

In this paper, we address the challenge of procedure planning in instructional videos, aiming to generate coherent and task-aligned action sequences from start and end visual observations. Previous work has mainly relied on text-level supervision to bridge the gap between observed states and unobserved actions, but it struggles with capturing intricate temporal relationships among actions. Building on these efforts, we propose the Masked Temporal Interpolation Diffusion (MTID) model that introduces a latent space temporal interpolation module within the diffusion model. This module leverages a learnable interpolation matrix to generate intermediate latent features, thereby augmenting visual supervision with richer mid-state details. By integrating this enriched supervision into the model, we enable end-to-end training tailored to task-specific requirements, significantly enhancing the…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 5

Strengths

- This paper addresses the limitation of the previous diffusion model approach, i.e. the temporal dependencies of actions, by using generated intermediate visual supervision. - The Latent Space Temporal Logical Interpolation Module seems to be helpful as the intermediate visual supervision. - The task-adaptive masked proximity loss seems to contribute to the increase of performance.

Weaknesses

- The Latent Space Temporal Logical Interpolation module does not have anything to do with 'logic'. It only has temporal dependencies between actions. Linear interpolation between start and goal visual observations with weights is not logic. - The experiment setting needs to be specified clearly in the main manuscript. The results on Coin and NIV datasets are misleading. This paper follows the setting of PDPP, however the results of baseline models (KEPP, SCHEMA) follow a different experiment

Reviewer 02Rating 8Confidence 4

Strengths

The proposed MTID model is a novel application of diffusion models for procedural planning, introducing an innovative interpolation mechanism to enrich visual supervision, which is rarely addressed in the field. The paper provides strong experimental evidence demonstrating that MTID outperforms existing models on multiple benchmark datasets. The results on CrossTask, COIN, and NIV datasets suggest significant improvements in most evaluation metrics. Plus the ablation studies are detailed and in

Weaknesses

The model's architecture, especially the use of latent space interpolation combined with diffusion processes, may be difficult for readers to grasp fully. More intuitive visualizations or a simplified explanation could aid in better understanding.

Reviewer 03Rating 6Confidence 5

Strengths

1. Based on previous works, this paper additionally uses the interpolated mid-state latent visual feature to supervise the model. 2. The proposed MTID model achieves SOTA performance on several datasets.

Weaknesses

1. The core concept is using the interpolated visual feature to provide visual-level mid-state supervision. However, the motivation of this operation is unclear. Using linear interpolation to interpolate the mid-state visual feature based on the start and goal visual feature is challenging. Additionally, an experiment that directly uses the visual features of mid-state for supervision (like previous works DDN) should be added, which can be seen as an upper bound. 2. In this paper, the task cl

Code & Models

Datasets

WiserZhou/ProcedurePlanning
dataset· 14 dl
14 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.