Beyond Coarse-Grained Matching in Video-Text Retrieval
Aozhu Chen, Hazel Doughty, Xirong Li, Cees G. M. Snoek

TL;DR
This paper introduces a fine-grained evaluation method for video-text retrieval models, using automatically generated hard negative captions with subtle word differences, revealing limitations of current benchmarks and improving model understanding.
Contribution
The paper proposes a novel fine-grained evaluation approach with hard negatives and a new baseline to better assess and enhance models' ability to detect subtle caption differences.
Findings
Current benchmarks inadequately evaluate subtle difference detection.
Models struggle with distinguishing fine-grained caption variations.
The proposed baseline improves fine-grained understanding in models.
Abstract
Video-text retrieval has seen significant advancements, yet the ability of models to discern subtle differences in captions still requires verification. In this paper, we introduce a new approach for fine-grained evaluation. Our approach can be applied to existing datasets by automatically generating hard negative test captions with subtle single-word variations across nouns, verbs, adjectives, adverbs, and prepositions. We perform comprehensive experiments using four state-of-the-art models across two standard benchmarks (MSR-VTT and VATEX) and two specially curated datasets enriched with detailed descriptions (VLN-UVO and VLN-OOPS), resulting in a number of novel insights: 1) our analyses show that the current evaluation benchmarks fall short in detecting a model's ability to perceive subtle single-word differences, 2) our fine-grained evaluation highlights the difficulty models face…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
