Benchmark Granularity and Model Robustness for Image-Text Retrieval

Mariya Hendriksen; Shuo Zhang; Ridho Reinanda; Mohamed Yahya; Edgar Meij; and Maarten de Rijke

arXiv:2407.15239·cs.CV·June 10, 2025

Benchmark Granularity and Model Robustness for Image-Text Retrieval

Mariya Hendriksen, Shuo Zhang, Ridho Reinanda, Mohamed Yahya, Edgar Meij, and Maarten de Rijke

PDF

Open Access

TL;DR

This paper investigates how dataset granularity and query perturbations influence the performance and robustness of Vision-Language Models in image-text retrieval, revealing that richer captions improve accuracy and that word order critically affects model behavior.

Contribution

It introduces a detailed analysis of dataset granularity and perturbations, highlighting their impact on model robustness and performance in image-text retrieval tasks.

Findings

01

Richer captions significantly improve retrieval accuracy, especially in text-to-image tasks.

02

Perturbations generally degrade performance but can sometimes unexpectedly enhance retrieval.

03

Word order is a critical factor affecting model robustness, contrary to prior assumptions.

Abstract

Image-Text Retrieval (ITR) systems are central to multimodal information access, with Vision-Language Models (VLMs) showing strong performance on standard benchmarks. However, these benchmarks predominantly rely on coarse-grained annotations, limiting their ability to reveal how models perform under real-world conditions, where query granularity varies. Motivated by this gap, we examine how dataset granularity and query perturbations affect retrieval performance and robustness across four architecturally diverse VLMs (ALIGN, AltCLIP, CLIP, and GroupViT). Using both standard benchmarks (MS-COCO, Flickr30k) and their fine-grained variants, we show that richer captions consistently enhance retrieval, especially in text-to-image tasks, where we observe an average improvement of 16.23%, compared to 6.44% in image-to-text. To assess robustness, we introduce a taxonomy of perturbations and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications

MethodsSparse Evolutionary Training · Contrastive Language-Image Pre-training · Focus · AltCLIP · ALIGN