Context-Enhanced Video Moment Retrieval with Large Language Models
Weijia Liu, Bo Miao, Jiuxin Cao, Xuelin Zhu, Bo Liu, Mehwish Nasim,, Ajmal Mian

TL;DR
This paper introduces a Large Language Model-guided approach for Video Moment Retrieval that leverages LLMs to enhance context understanding and improve localization accuracy, especially for complex queries.
Contribution
The paper presents a novel LLM-guided method that enhances video context representation and cross-modal alignment for more accurate video moment retrieval.
Findings
Achieves state-of-the-art results on QVHighlights and Charades-STA benchmarks.
Outperforms previous methods by up to 3.28% and 4.06%.
Significantly improves localization of complex queries.
Abstract
Current methods for Video Moment Retrieval (VMR) struggle to align complex situations involving specific environmental details, character descriptions, and action narratives. To tackle this issue, we propose a Large Language Model-guided Moment Retrieval (LMR) approach that employs the extensive knowledge of Large Language Models (LLMs) to improve video context representation as well as cross-modal alignment, facilitating accurate localization of target moments. Specifically, LMR introduces a context enhancement technique with LLMs to generate crucial target-related context semantics. These semantics are integrated with visual features for producing discriminative video representations. Finally, a language-conditioned transformer is designed to decode free-form language queries, on the fly, using aligned video representations for moment retrieval. Extensive experiments demonstrate that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
MethodsALIGN
