Entity Image and Mixed-Modal Image Retrieval Datasets

Cristian-Ioan Blaga; Paul Suganthan; Sahil Dua; Krishna Srinivasan; Enrique Alfonseca; Peter Dornbach; Tom Duerig; Imed Zitouni; Zhe Dong

arXiv:2506.02291·cs.CV·June 4, 2025

Entity Image and Mixed-Modal Image Retrieval Datasets

Cristian-Ioan Blaga, Paul Suganthan, Sahil Dua, Krishna Srinivasan, Enrique Alfonseca, Peter Dornbach, Tom Duerig, Imed Zitouni, Zhe Dong

PDF

Open Access

TL;DR

This paper introduces two new datasets, EI and MMIR, to evaluate and improve mixed-modal image retrieval models that combine visual and textual information, emphasizing deep cross-modal understanding.

Contribution

The paper presents novel datasets for mixed-modal image retrieval, enabling rigorous benchmarking and training of models that understand complex visual-textual relationships.

Findings

01

Datasets validated through crowd-sourced annotations

02

Benchmark effectively evaluates deep cross-modal understanding

03

Datasets support training and evaluation of retrieval models

Abstract

Despite advances in multimodal learning, challenging benchmarks for mixed-modal image retrieval that combines visual and textual information are lacking. This paper introduces a novel benchmark to rigorously evaluate image retrieval that demands deep cross-modal contextual understanding. We present two new datasets: the Entity Image Dataset (EI), providing canonical images for Wikipedia entities, and the Mixed-Modal Image Retrieval Dataset (MMIR), derived from the WIT dataset. The MMIR benchmark features two challenging query types requiring models to ground textual descriptions in the context of provided visual entities: single entity-image queries (one entity image with descriptive text) and multi-entity-image queries (multiple entity images with relational text). We empirically validate the benchmark's utility as both a training corpus and an evaluation set for mixed-modal retrieval.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications