Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment
Yuxiao Chen, Kai Li, Wentao Bao, Deep Patel, Yu Kong, Martin Renqiang, Min, Dimitris N. Metaxas

TL;DR
This paper introduces a novel multi-pathway text-video alignment framework leveraging Large Language Models to improve the localization of procedure steps in instructional videos, addressing noise and unreliability in narration data.
Contribution
The work proposes a multi-pathway alignment strategy and LLM-based filtering to enhance step localization accuracy in instructional videos, outperforming existing methods.
Findings
Surpasses state-of-the-art in step grounding, localization, and narration grounding.
Effectively filters task-irrelevant narration using LLMs.
Improves pseudo-matching reliability through multi-pathway alignment.
Abstract
Learning to localize temporal boundaries of procedure steps in instructional videos is challenging due to the limited availability of annotated large-scale training videos. Recent works focus on learning the cross-modal alignment between video segments and ASR-transcripted narration texts through contrastive learning. However, these methods fail to account for the alignment noise, i.e., irrelevant narrations to the instructional task in videos and unreliable timestamps in narrations. To address these challenges, this work proposes a novel training framework. Motivated by the strong capabilities of Large Language Models (LLMs) in procedure understanding and text summarization, we first apply an LLM to filter out task-irrelevant information and summarize task-related procedure steps (LLM-steps) from narrations. To further generate reliable pseudo-matching between the LLM-steps and the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Educational Tools and Methods
MethodsFocus
