Masked Diffusion with Task-awareness for Procedure Planning in   Instructional Videos

Fen Fang; Yun Liu; Ali Koksal; Qianli Xu; Joo-Hwee Lim

arXiv:2309.07409·cs.CV·September 15, 2023

Masked Diffusion with Task-awareness for Procedure Planning in Instructional Videos

Fen Fang, Yun Liu, Ali Koksal, Qianli Xu, Joo-Hwee Lim

PDF

Open Access 1 Repo

TL;DR

This paper introduces a masked diffusion model with task-aware attention for improved procedure planning in instructional videos, leveraging joint visual-text embeddings to focus on relevant actions and achieve state-of-the-art results.

Contribution

The paper proposes a novel masked diffusion approach that uses task-aware attention and enhanced visual-text embeddings to better handle large decision spaces in instructional video analysis.

Findings

01

Achieved state-of-the-art performance on multiple datasets.

02

Effectively concentrates on relevant action types during diffusion.

03

Improved task classification accuracy with joint visual-text embeddings.

Abstract

A key challenge with procedure planning in instructional videos lies in how to handle a large decision space consisting of a multitude of action types that belong to various tasks. To understand real-world video content, an AI agent must proficiently discern these action types (e.g., pour milk, pour water, open lid, close lid, etc.) based on brief visual observation. Moreover, it must adeptly capture the intricate semantic relation of the action types and task goals, along with the variable action sequences. Recently, notable progress has been made via the integration of diffusion models and visual representation learning to address the challenge. However, existing models employ rudimentary mechanisms to utilize task information to manage the decision space. To overcome this limitation, we introduce a simple yet effective enhancement - a masked diffusion model. The introduced mask acts…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ffzzy840304/masked-pdpp
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning

MethodsFocus · Diffusion