Towards Efficient Cross-Modal Visual Textual Retrieval using   Transformer-Encoder Deep Features

Nicola Messina; Giuseppe Amato; Fabrizio Falchi; Claudio Gennaro,; St\'ephane Marchand-Maillet

arXiv:2106.00358·cs.CV·June 2, 2021

Towards Efficient Cross-Modal Visual Textual Retrieval using Transformer-Encoder Deep Features

Nicola Messina, Giuseppe Amato, Fabrizio Falchi, Claudio Gennaro,, St\'ephane Marchand-Maillet

PDF

TL;DR

This paper investigates efficient cross-modal image-sentence retrieval by using sparsified deep features from the TERN architecture, aiming to improve scalability and indexing in large datasets.

Contribution

It introduces preprocessing techniques to produce sparse multi-modal features from TERN, facilitating scalable indexing for image-text retrieval tasks.

Findings

01

Preliminary results show promising potential of sparsified features.

02

The approach enables more efficient retrieval compared to sequential scanning.

03

Further research directions are suggested for improved performance.

Abstract

Cross-modal retrieval is an important functionality in modern search engines, as it increases the user experience by allowing queries and retrieved objects to pertain to different modalities. In this paper, we focus on the image-sentence retrieval task, where the objective is to efficiently find relevant images for a given sentence (image-retrieval) or the relevant sentences for a given image (sentence-retrieval). Computer vision literature reports the best results on the image-sentence matching task using deep neural networks equipped with attention and self-attention mechanisms. They evaluate the matching performance on the retrieval task by performing sequential scans of the whole dataset. This method does not scale well with an increasing amount of images or captions. In this work, we explore different preprocessing techniques to produce sparsified deep multi-modal features…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.