Enhanced Multimodal Video Retrieval System: Integrating Query Expansion and Cross-modal Temporal Event Retrieval
Van-Thinh Vo, Minh-Khoi Nguyen, Minh-Huy Tran, Anh-Quan Nguyen-Tran, Duy-Tan Nguyen, Khanh-Loi Nguyen, Anh-Minh Phan

TL;DR
This paper introduces a novel multimodal video retrieval system that uses cross-modal scene descriptions, adaptive thresholding for scene detection, and LLM-based query expansion to improve accuracy and robustness in complex video sequences.
Contribution
It presents a new framework combining cross-modal retrieval, KDE-GMM thresholding, and LLM-driven query refinement for enhanced video search performance.
Findings
Achieved strong results in the Ho Chi Minh AI Challenge 2025.
Improved retrieval precision and efficiency with keyframe-based scene representation.
Enhanced robustness in complex temporal video contexts.
Abstract
Multimedia information retrieval from videos remains a challenging problem. While recent systems have advanced multimodal search through semantic, object, and OCR queries - and can retrieve temporally consecutive scenes - they often rely on a single query modality for an entire sequence, limiting robustness in complex temporal contexts. To overcome this, we propose a cross-modal temporal event retrieval framework that enables different query modalities to describe distinct scenes within a sequence. To determine decision thresholds for scene transition and slide change adaptively, we build Kernel Density Gaussian Mixture Thresholding (KDE-GMM) algorithm, ensuring optimal keyframe selection. These extracted keyframes act as compact, high-quality visual exemplars that retain each segment's semantic essence, improving retrieval precision and efficiency. Additionally, the system incorporates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
