Relation Transformer Network

Rajat Koner; Suprosanna Shit; Volker Tresp

arXiv:2004.06193·cs.CV·July 22, 2021·5 cites

Relation Transformer Network

Rajat Koner, Suprosanna Shit, Volker Tresp

PDF

Open Access 1 Repo

TL;DR

The paper introduces Relation Transformer Network (RTN), a novel transformer-based model for scene graph generation that improves relation detection by modeling rich interactions among objects and their relationships.

Contribution

It proposes a transformer architecture with specialized positional embeddings for relation prediction, achieving state-of-the-art results on Visual Genome and GQA datasets.

Findings

01

Achieved 4.85% and 3.1% improvements over state-of-the-art on Visual Genome and GQA.

02

Effectively models context across small, medium, and large-scale relation classification.

03

Utilizes self-attention and cross-attention for node and edge interaction modeling.

Abstract

The extraction of a scene graph with objects as nodes and mutual relationships as edges is the basis for a deep understanding of image content. Despite recent advances, such as message passing and joint classification, the detection of visual relationships remains a challenging task due to sub-optimal exploration of the mutual interaction among the visual objects. In this work, we propose a novel transformer formulation for scene graph generation and relation prediction. We leverage the encoder-decoder architecture of the transformer for rich feature embedding of nodes and edges. Specifically, we model the node-to-node interaction with the self-attention of the transformer encoder and the edge-to-node interaction with the cross-attention of the transformer decoder. Further, we introduce a novel positional embedding suitable to handle edges in the decoder. Finally, our relation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rajatkoner08/rtn
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax