Composing Object Relations and Attributes for Image-Text Matching

Khoi Pham; Chuong Huynh; Ser-Nam Lim; Abhinav Shrivastava

arXiv:2406.11820·cs.CV·June 18, 2024·1 cites

Composing Object Relations and Attributes for Image-Text Matching

Khoi Pham, Chuong Huynh, Ser-Nam Lim, Abhinav Shrivastava

PDF

Open Access 1 Repo

TL;DR

This paper introduces CORA, a dual-encoder model that uses scene graphs and graph attention networks to efficiently and effectively match images and texts by capturing object relations and attributes, outperforming more expensive cross-attention methods.

Contribution

The paper presents a novel dual-encoder approach utilizing scene graphs and graph attention networks for image-text matching, improving efficiency and accuracy over existing methods.

Findings

01

CORA outperforms state-of-the-art cross-attention models on Flickr30K and MSCOCO.

02

The model achieves faster computation while maintaining high recall scores.

03

Incorporating object relations and attributes improves matching performance.

Abstract

We study the visual semantic embedding problem for image-text matching. Most existing work utilizes a tailored cross-attention mechanism to perform local alignment across the two image and text modalities. This is computationally expensive, even though it is more powerful than the unimodal dual-encoder approach. This work introduces a dual-encoder image-text matching model, leveraging a scene graph to represent captions with nodes for objects and attributes interconnected by relational edges. Utilizing a graph attention network, our model efficiently encodes object-attribute and object-object semantic relations, resulting in a robust and fast-performing system. Representing caption as a scene graph offers the ability to utilize the strong relational inductive bias of graph neural networks to learn object-attribute and object-object relations effectively. To train the model, we propose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vkhoi/cora_cvpr24
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Natural Language Processing Techniques · Advanced Image and Video Retrieval Techniques

MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · ALIGN