Towards Efficient Cross-Modal Visual Textual Retrieval using Transformer-Encoder Deep Features
Nicola Messina, Giuseppe Amato, Fabrizio Falchi, Claudio Gennaro,, St\'ephane Marchand-Maillet

TL;DR
This paper investigates efficient cross-modal image-sentence retrieval by using sparsified deep features from the TERN architecture, aiming to improve scalability and indexing in large datasets.
Contribution
It introduces preprocessing techniques to produce sparse multi-modal features from TERN, facilitating scalable indexing for image-text retrieval tasks.
Findings
Preliminary results show promising potential of sparsified features.
The approach enables more efficient retrieval compared to sequential scanning.
Further research directions are suggested for improved performance.
Abstract
Cross-modal retrieval is an important functionality in modern search engines, as it increases the user experience by allowing queries and retrieved objects to pertain to different modalities. In this paper, we focus on the image-sentence retrieval task, where the objective is to efficiently find relevant images for a given sentence (image-retrieval) or the relevant sentences for a given image (sentence-retrieval). Computer vision literature reports the best results on the image-sentence matching task using deep neural networks equipped with attention and self-attention mechanisms. They evaluate the matching performance on the retrieval task by performing sequential scans of the whole dataset. This method does not scale well with an increasing amount of images or captions. In this work, we explore different preprocessing techniques to produce sparsified deep multi-modal features…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
