Unified Interactive Multimodal Moment Retrieval via Cascaded Embedding-Reranking and Temporal-Aware Score Fusion

Toan Le Ngo Thanh; Phat Ha Huu; Tan Nguyen Dang Duy; Thong Nguyen Le Minh; Anh Nguyen Nhu Tinh

arXiv:2512.12935·cs.CV·December 16, 2025

Unified Interactive Multimodal Moment Retrieval via Cascaded Embedding-Reranking and Temporal-Aware Score Fusion

Toan Le Ngo Thanh, Phat Ha Huu, Tan Nguyen Dang Duy, Thong Nguyen Le Minh, Anh Nguyen Nhu Tinh

PDF

Open Access 1 Video

TL;DR

This paper introduces a unified multimodal video moment retrieval system that combines cascaded embedding, temporal-aware scoring, and GPT-4o guided query decomposition to improve retrieval accuracy and handle ambiguous queries.

Contribution

The proposed system integrates a cascaded dual-embedding pipeline, temporal-aware score fusion, and GPT-4o based query decomposition, addressing key challenges in multimodal moment retrieval.

Findings

01

Effective handling of ambiguous queries with GPT-4o

02

Improved retrieval of temporally coherent event sequences

03

Dynamic fusion strategy adapts to different modalities

Abstract

The exponential growth of video content has created an urgent need for efficient multimodal moment retrieval systems. However, existing approaches face three critical challenges: (1) fixed-weight fusion strategies fail across cross modal noise and ambiguous queries, (2) temporal modeling struggles to capture coherent event sequences while penalizing unrealistic gaps, and (3) systems require manual modality selection, reducing usability. We propose a unified multimodal moment retrieval system with three key innovations. First, a cascaded dual-embedding pipeline combines BEIT-3 and SigLIP for broad retrieval, refined by BLIP-2 based reranking to balance recall and precision. Second, a temporal-aware scoring mechanism applies exponential decay penalties to large temporal gaps via beam search, constructing coherent event sequences rather than isolated frames. Third, Agent-guided query…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Unified Interactive Multimodal Moment Retrieval via Cascaded Embedding-Reranking and Temporal-Aware Score Fusion· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition