RelTransformer: A Transformer-Based Long-Tail Visual Relationship   Recognition

Jun Chen; Aniket Agarwal; Sherif Abdelkarim; Deyao Zhu; Mohamed; Elhoseiny

arXiv:2104.11934·cs.CV·March 30, 2022·1 cites

RelTransformer: A Transformer-Based Long-Tail Visual Relationship Recognition

Jun Chen, Aniket Agarwal, Sherif Abdelkarim, Deyao Zhu, Mohamed, Elhoseiny

PDF

Open Access 1 Repo

TL;DR

RelTransformer introduces a message-passing attention mechanism with a learnable memory to effectively recognize long-tail visual relationships in images, significantly improving performance on large-scale VRR benchmarks.

Contribution

It proposes a novel attention-based scene graph model with a learnable memory to address long-tail distribution challenges in visual relationship recognition.

Findings

01

Outperforms state-of-the-art on VG8K-LT with +2.0% accuracy

02

Achieves +26.0% accuracy on GQA-LT

03

Shows strong results on VG200 relation detection

Abstract

The visual relationship recognition (VRR) task aims at understanding the pairwise visual relationships between interacting objects in an image. These relationships typically have a long-tail distribution due to their compositional nature. This problem gets more severe when the vocabulary becomes large, rendering this task very challenging. This paper shows that modeling an effective message-passing flow through an attention mechanism can be critical to tackling the compositionality and long-tail challenges in VRR. The method, called RelTransformer, represents each image as a fully-connected scene graph and restructures the whole scene into the relation-triplet and global-scene contexts. It directly passes the message from each element in the relation-triplet and global-scene contexts to the target relation via self-attention. We also design a learnable memory to augment the long-tail…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Vision-CAIR/RelTransformer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques