VITR: Augmenting Vision Transformers with Relation-Focused Learning for   Cross-Modal Information Retrieval

Yan Gong; Georgina Cosma; and Axel Finke

arXiv:2302.06350·cs.CV·July 31, 2023

VITR: Augmenting Vision Transformers with Relation-Focused Learning for Cross-Modal Information Retrieval

Yan Gong, Georgina Cosma, and Axel Finke

PDF

Open Access

TL;DR

VITR enhances Vision Transformers by incorporating relation-focused reasoning on image regions, significantly improving cross-modal retrieval accuracy across multiple datasets.

Contribution

The paper introduces VITR, a novel network that extends ViT with relation reasoning and a fusion module for improved cross-modal retrieval.

Findings

01

VITR outperforms state-of-the-art models on RefCOCOg, CLEVR, and Flickr30K datasets.

02

VITR effectively models image region relations for better retrieval accuracy.

03

The approach improves both image-to-text and text-to-image retrieval tasks.

Abstract

The relations expressed in user queries are vital for cross-modal information retrieval. Relation-focused cross-modal retrieval aims to retrieve information that corresponds to these relations, enabling effective retrieval across different modalities. Pre-trained networks, such as Contrastive Language-Image Pre-training (CLIP), have gained significant attention and acclaim for their exceptional performance in various cross-modal learning tasks. However, the Vision Transformer (ViT) used in these networks is limited in its ability to focus on image region relations. Specifically, ViT is trained to match images with relevant descriptions at the global level, without considering the alignment between image regions and descriptions. This paper introduces VITR, a novel network that enhances ViT by extracting and reasoning about image region relations based on a local encoder. VITR is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Adam · Label Smoothing · Dropout · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Softmax · Linear Layer