MERVIN: A Unified Framework for Multimodal Event Retrieval in Vietnamese News Videos

Anh-Tai Pham-Nguyen; Tung-Duong Le-Duc; Anh-Duy Le; Trung-Hieu Truong-Le

arXiv:2605.16120·cs.IR·May 18, 2026

MERVIN: A Unified Framework for Multimodal Event Retrieval in Vietnamese News Videos

Anh-Tai Pham-Nguyen, Tung-Duong Le-Duc, Anh-Duy Le, Trung-Hieu Truong-Le

PDF

TL;DR

MERVIN is a comprehensive multimodal framework designed for effective event retrieval in Vietnamese news videos, integrating visual, textual, and summarization features with enhanced transcript quality.

Contribution

The paper introduces MERVIN, a novel unified framework that combines multimodal data and improved transcript processing for Vietnamese news video retrieval.

Findings

01

Achieved 79/88 points in AI Challenge HCMC 2025 qualification phase.

02

Successfully retrieved all results for every query in the final round.

03

Demonstrated effectiveness of multimodal integration in Vietnamese news videos.

Abstract

The growth of online video platforms drives the need for effective, semantically grounded event retrieval. We present MERVIN, a unified multimodal framework for Vietnamese news videos that integrates keyframes, transcripts, and video summaries. Transcript quality is enhanced via Gemini 1.5 Flash, reducing noise from accents, background sounds, and recognition errors. Visual features are extracted with Perception Encoder, while a Vietnamese language model produces textual embeddings; both are indexed in Milvus for efficient similarity-based retrieval. In addition, a React-based interface enables iterative query refinement across modalities, improving semantic alignment. Experimental results on Vietnamese news videos demonstrate the effectiveness of the proposed system, with MERVIN achieving 79 out of 88 points in AI Challenge HCMC 2025 qualification phase and successfully retrieved all…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.