MERVIN: A Unified Framework for Multimodal Event Retrieval in Vietnamese News Videos
Anh-Tai Pham-Nguyen, Tung-Duong Le-Duc, Anh-Duy Le, Trung-Hieu Truong-Le

TL;DR
MERVIN is a comprehensive multimodal framework designed for effective event retrieval in Vietnamese news videos, integrating visual, textual, and summarization features with enhanced transcript quality.
Contribution
The paper introduces MERVIN, a novel unified framework that combines multimodal data and improved transcript processing for Vietnamese news video retrieval.
Findings
Achieved 79/88 points in AI Challenge HCMC 2025 qualification phase.
Successfully retrieved all results for every query in the final round.
Demonstrated effectiveness of multimodal integration in Vietnamese news videos.
Abstract
The growth of online video platforms drives the need for effective, semantically grounded event retrieval. We present MERVIN, a unified multimodal framework for Vietnamese news videos that integrates keyframes, transcripts, and video summaries. Transcript quality is enhanced via Gemini 1.5 Flash, reducing noise from accents, background sounds, and recognition errors. Visual features are extracted with Perception Encoder, while a Vietnamese language model produces textual embeddings; both are indexed in Milvus for efficient similarity-based retrieval. In addition, a React-based interface enables iterative query refinement across modalities, improving semantic alignment. Experimental results on Vietnamese news videos demonstrate the effectiveness of the proposed system, with MERVIN achieving 79 out of 88 points in AI Challenge HCMC 2025 qualification phase and successfully retrieved all…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
