TL;DR
CONQUER introduces a novel approach for video moment retrieval by effectively utilizing query context through multi-modal fusion and joint representation, improving localization accuracy in large video corpora.
Contribution
The paper presents a new model, CONQUER, that enhances video moment retrieval by integrating query context into multi-modal fusion and joint video-query representation learning.
Findings
Improved performance on TVR and DiDeMo datasets.
Effective query-aware multi-modal fusion enhances localization.
Joint representation captures query-specific multi-modal signals.
Abstract
This paper tackles a recently proposed Video Corpus Moment Retrieval task. This task is essential because advanced video retrieval applications should enable users to retrieve a precise moment from a large video corpus. We propose a novel CONtextual QUery-awarE Ranking~(CONQUER) model for effective moment localization and ranking. CONQUER explores query context for multi-modal fusion and representation learning in two different steps. The first step derives fusion weights for the adaptive combination of multi-modal video content. The second step performs bi-directional attention to tightly couple video and query as a single joint representation for moment localization. As query context is fully engaged in video representation learning, from feature fusion to transformation, the resulting feature is user-centered and has a larger capacity in capturing multi-modal signals specific to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
