Context-Enhanced Video Moment Retrieval with Large Language Models

Weijia Liu; Bo Miao; Jiuxin Cao; Xuelin Zhu; Bo Liu; Mehwish Nasim,; Ajmal Mian

arXiv:2405.12540·cs.CV·May 22, 2024

Context-Enhanced Video Moment Retrieval with Large Language Models

Weijia Liu, Bo Miao, Jiuxin Cao, Xuelin Zhu, Bo Liu, Mehwish Nasim,, Ajmal Mian

PDF

Open Access

TL;DR

This paper introduces a Large Language Model-guided approach for Video Moment Retrieval that leverages LLMs to enhance context understanding and improve localization accuracy, especially for complex queries.

Contribution

The paper presents a novel LLM-guided method that enhances video context representation and cross-modal alignment for more accurate video moment retrieval.

Findings

01

Achieves state-of-the-art results on QVHighlights and Charades-STA benchmarks.

02

Outperforms previous methods by up to 3.28% and 4.06%.

03

Significantly improves localization of complex queries.

Abstract

Current methods for Video Moment Retrieval (VMR) struggle to align complex situations involving specific environmental details, character descriptions, and action narratives. To tackle this issue, we propose a Large Language Model-guided Moment Retrieval (LMR) approach that employs the extensive knowledge of Large Language Models (LLMs) to improve video context representation as well as cross-modal alignment, facilitating accurate localization of target moments. Specifically, LMR introduces a context enhancement technique with LLMs to generate crucial target-related context semantics. These semantics are integrated with visual features for producing discriminative video representations. Finally, a language-conditioned transformer is designed to decode free-form language queries, on the fly, using aligned video representations for moment retrieval. Extensive experiments demonstrate that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques

MethodsALIGN