StacMR: Scene-Text Aware Cross-Modal Retrieval

Andr\'es Mafla; Rafael Sampaio de Rezende; Llu\'is G\'omez and; Diane Larlus; Dimosthenis Karatzas

arXiv:2012.04329·cs.CV·December 9, 2020

StacMR: Scene-Text Aware Cross-Modal Retrieval

Andr\'es Mafla, Rafael Sampaio de Rezende, Llu\'is G\'omez and, Diane Larlus, Dimosthenis Karatzas

PDF

1 Repo

TL;DR

This paper introduces a new dataset and methods for cross-modal retrieval that incorporate scene text in images, improving matching accuracy between visual and textual data.

Contribution

It presents a novel dataset and a scene-text aware retrieval approach that effectively integrates scene text into cross-modal representations.

Findings

01

Scene text improves retrieval performance

02

Proposed method outperforms existing models

03

Dataset enables new research directions

Abstract

Recent models for cross-modal retrieval have benefited from an increasingly rich understanding of visual scenes, afforded by scene graphs and object interactions to mention a few. This has resulted in an improved matching between the visual representation of an image and the textual representation of its caption. Yet, current visual representations overlook a key aspect: the text appearing in images, which may contain crucial information for retrieval. In this paper, we first propose a new dataset that allows exploration of cross-modal retrieval where images contain scene-text instances. Then, armed with this dataset, we describe several approaches which leverage scene text, including a better scene-text aware cross-modal retrieval method which uses specialized representations for text from the captions and text from the visual scene, and reconcile them in a common embedding space.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

AndresPMD/StacMR
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttentive Walk-Aggregating Graph Neural Network