Enhanced Multimodal Video Retrieval System: Integrating Query Expansion and Cross-modal Temporal Event Retrieval

Van-Thinh Vo; Minh-Khoi Nguyen; Minh-Huy Tran; Anh-Quan Nguyen-Tran; Duy-Tan Nguyen; Khanh-Loi Nguyen; Anh-Minh Phan

arXiv:2512.06334·cs.IR·December 9, 2025

Enhanced Multimodal Video Retrieval System: Integrating Query Expansion and Cross-modal Temporal Event Retrieval

Van-Thinh Vo, Minh-Khoi Nguyen, Minh-Huy Tran, Anh-Quan Nguyen-Tran, Duy-Tan Nguyen, Khanh-Loi Nguyen, Anh-Minh Phan

PDF

Open Access

TL;DR

This paper introduces a novel multimodal video retrieval system that uses cross-modal scene descriptions, adaptive thresholding for scene detection, and LLM-based query expansion to improve accuracy and robustness in complex video sequences.

Contribution

It presents a new framework combining cross-modal retrieval, KDE-GMM thresholding, and LLM-driven query refinement for enhanced video search performance.

Findings

01

Achieved strong results in the Ho Chi Minh AI Challenge 2025.

02

Improved retrieval precision and efficiency with keyframe-based scene representation.

03

Enhanced robustness in complex temporal video contexts.

Abstract

Multimedia information retrieval from videos remains a challenging problem. While recent systems have advanced multimodal search through semantic, object, and OCR queries - and can retrieve temporally consecutive scenes - they often rely on a single query modality for an entire sequence, limiting robustness in complex temporal contexts. To overcome this, we propose a cross-modal temporal event retrieval framework that enables different query modalities to describe distinct scenes within a sequence. To determine decision thresholds for scene transition and slide change adaptively, we build Kernel Density Gaussian Mixture Thresholding (KDE-GMM) algorithm, ensuring optimal keyframe selection. These extracted keyframes act as compact, high-quality visual exemplars that retain each segment's semantic essence, improving retrieval precision and efficiency. Additionally, the system incorporates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization