Expertized Caption Auto-Enhancement for Video-Text Retrieval

Baoyao Yang; Junxiang Chen; Wanyun Li; Wenbin Yao; Yang Zhou

arXiv:2502.02885·cs.CV·April 9, 2025

Expertized Caption Auto-Enhancement for Video-Text Retrieval

Baoyao Yang, Junxiang Chen, Wanyun Li, Wenbin Yao, Yang Zhou

PDF

Open Access 1 Repo

TL;DR

This paper introduces an automatic caption enhancement approach for video-text retrieval that improves cross-modal alignment by generating and selecting high-quality, personalized captions, leading to state-of-the-art retrieval performance.

Contribution

It proposes a novel self-learning caption enhancement and expertized caption selection mechanism to improve video-text matching without heavy data requirements.

Findings

01

Achieved Top-1 recall of 68.5% on MSR-VTT

02

Improved retrieval accuracy on multiple benchmarks

03

Enhanced caption quality through self-learning and expertized selection

Abstract

Video-text retrieval has been stuck in the information mismatch caused by personalized and inadequate textual descriptions of videos. The substantial information gap between the two modalities hinders an effective cross-modal representation alignment, resulting in ambiguous retrieval results. Although text rewriting methods have been proposed to broaden text expressions, the modality gap remains significant, as the text representation space is hardly expanded with insufficient semantic enrichment.Instead, this paper turns to enhancing visual presentation, bridging video expression closer to textual representation via caption generation and thereby facilitating video-text matching.While multimodal large language models (mLLM) have shown a powerful capability to convert video content into text, carefully crafted prompts are essential to ensure the reasonableness and completeness of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

caryxiang/eca4vtr
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Subtitles and Audiovisual Media