Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study
Mariya Hendriksen, Svitlana Vakulenko, Ernst Kuiper, Maarten de Rijke

TL;DR
This study evaluates the reproducibility and generalizability of state-of-the-art cross-modal retrieval models across object-centric and scene-centric datasets, revealing challenges in reproducibility and performance disparities.
Contribution
It systematically assesses the reproducibility of leading CMR models on diverse datasets, highlighting issues and differences in performance across dataset types.
Findings
Experiments are not fully reproducible or replicable.
Performance partially generalizes across dataset types.
Object-centric datasets yield lower scores than scene-centric datasets.
Abstract
Most approaches to cross-modal retrieval (CMR) focus either on object-centric datasets, meaning that each document depicts or describes a single object, or on scene-centric datasets, meaning that each image depicts or describes a complex scene that involves multiple objects and relations between them. We posit that a robust CMR model should generalize well across both dataset types. Despite recent advances in CMR, the reproducibility of the results and their generalizability across different dataset types has not been studied before. We address this gap and focus on the reproducibility of the state-of-the-art CMR results when evaluated on object-centric and scene-centric datasets. We select two state-of-the-art CMR models with different architectures: (i) CLIP; and (ii) X-VLM. Additionally, we select two scene-centric datasets, and three object-centric datasets, and determine the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques
