Learning to Localize Actions in Instructional Videos with LLM-Based   Multi-Pathway Text-Video Alignment

Yuxiao Chen; Kai Li; Wentao Bao; Deep Patel; Yu Kong; Martin Renqiang; Min; Dimitris N. Metaxas

arXiv:2409.16145·cs.CV·September 25, 2024

Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment

Yuxiao Chen, Kai Li, Wentao Bao, Deep Patel, Yu Kong, Martin Renqiang, Min, Dimitris N. Metaxas

PDF

Open Access

TL;DR

This paper introduces a novel multi-pathway text-video alignment framework leveraging Large Language Models to improve the localization of procedure steps in instructional videos, addressing noise and unreliability in narration data.

Contribution

The work proposes a multi-pathway alignment strategy and LLM-based filtering to enhance step localization accuracy in instructional videos, outperforming existing methods.

Findings

01

Surpasses state-of-the-art in step grounding, localization, and narration grounding.

02

Effectively filters task-irrelevant narration using LLMs.

03

Improves pseudo-matching reliability through multi-pathway alignment.

Abstract

Learning to localize temporal boundaries of procedure steps in instructional videos is challenging due to the limited availability of annotated large-scale training videos. Recent works focus on learning the cross-modal alignment between video segments and ASR-transcripted narration texts through contrastive learning. However, these methods fail to account for the alignment noise, i.e., irrelevant narrations to the instructional task in videos and unreliable timestamps in narrations. To address these challenges, this work proposes a novel training framework. Motivated by the strong capabilities of Large Language Models (LLMs) in procedure understanding and text summarization, we first apply an LLM to filter out task-irrelevant information and summarize task-related procedure steps (LLM-steps) from narrations. To further generate reliable pseudo-matching between the LLM-steps and the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Educational Tools and Methods

MethodsFocus