Beyond Coarse-Grained Matching in Video-Text Retrieval

Aozhu Chen; Hazel Doughty; Xirong Li; Cees G. M. Snoek

arXiv:2410.12407·cs.CV·October 18, 2024

Beyond Coarse-Grained Matching in Video-Text Retrieval

Aozhu Chen, Hazel Doughty, Xirong Li, Cees G. M. Snoek

PDF

Open Access

TL;DR

This paper introduces a fine-grained evaluation method for video-text retrieval models, using automatically generated hard negative captions with subtle word differences, revealing limitations of current benchmarks and improving model understanding.

Contribution

The paper proposes a novel fine-grained evaluation approach with hard negatives and a new baseline to better assess and enhance models' ability to detect subtle caption differences.

Findings

01

Current benchmarks inadequately evaluate subtle difference detection.

02

Models struggle with distinguishing fine-grained caption variations.

03

The proposed baseline improves fine-grained understanding in models.

Abstract

Video-text retrieval has seen significant advancements, yet the ability of models to discern subtle differences in captions still requires verification. In this paper, we introduce a new approach for fine-grained evaluation. Our approach can be applied to existing datasets by automatically generating hard negative test captions with subtle single-word variations across nouns, verbs, adjectives, adverbs, and prepositions. We perform comprehensive experiments using four state-of-the-art models across two standard benchmarks (MSR-VTT and VATEX) and two specially curated datasets enriched with detailed descriptions (VLN-UVO and VLN-OOPS), resulting in a number of novel insights: 1) our analyses show that the current evaluation benchmarks fall short in detecting a model's ability to perceive subtle single-word differences, 2) our fine-grained evaluation highlights the difficulty models face…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization