VITR: Augmenting Vision Transformers with Relation-Focused Learning for Cross-Modal Information Retrieval
Yan Gong, Georgina Cosma, and Axel Finke

TL;DR
VITR enhances Vision Transformers by incorporating relation-focused reasoning on image regions, significantly improving cross-modal retrieval accuracy across multiple datasets.
Contribution
The paper introduces VITR, a novel network that extends ViT with relation reasoning and a fusion module for improved cross-modal retrieval.
Findings
VITR outperforms state-of-the-art models on RefCOCOg, CLEVR, and Flickr30K datasets.
VITR effectively models image region relations for better retrieval accuracy.
The approach improves both image-to-text and text-to-image retrieval tasks.
Abstract
The relations expressed in user queries are vital for cross-modal information retrieval. Relation-focused cross-modal retrieval aims to retrieve information that corresponds to these relations, enabling effective retrieval across different modalities. Pre-trained networks, such as Contrastive Language-Image Pre-training (CLIP), have gained significant attention and acclaim for their exceptional performance in various cross-modal learning tasks. However, the Vision Transformer (ViT) used in these networks is limited in its ability to focus on image region relations. Specifically, ViT is trained to match images with relevant descriptions at the global level, without considering the alignment between image regions and descriptions. This paper introduces VITR, a novel network that enhances ViT by extracting and reasoning about image region relations based on a local encoder. VITR is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Adam · Label Smoothing · Dropout · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Softmax · Linear Layer
