Expertized Caption Auto-Enhancement for Video-Text Retrieval
Baoyao Yang, Junxiang Chen, Wanyun Li, Wenbin Yao, Yang Zhou

TL;DR
This paper introduces an automatic caption enhancement approach for video-text retrieval that improves cross-modal alignment by generating and selecting high-quality, personalized captions, leading to state-of-the-art retrieval performance.
Contribution
It proposes a novel self-learning caption enhancement and expertized caption selection mechanism to improve video-text matching without heavy data requirements.
Findings
Achieved Top-1 recall of 68.5% on MSR-VTT
Improved retrieval accuracy on multiple benchmarks
Enhanced caption quality through self-learning and expertized selection
Abstract
Video-text retrieval has been stuck in the information mismatch caused by personalized and inadequate textual descriptions of videos. The substantial information gap between the two modalities hinders an effective cross-modal representation alignment, resulting in ambiguous retrieval results. Although text rewriting methods have been proposed to broaden text expressions, the modality gap remains significant, as the text representation space is hardly expanded with insufficient semantic enrichment.Instead, this paper turns to enhancing visual presentation, bridging video expression closer to textual representation via caption generation and thereby facilitating video-text matching.While multimodal large language models (mLLM) have shown a powerful capability to convert video content into text, carefully crafted prompts are essential to ensure the reasonableness and completeness of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Subtitles and Audiovisual Media
