Relation Transformer Network
Rajat Koner, Suprosanna Shit, Volker Tresp

TL;DR
The paper introduces Relation Transformer Network (RTN), a novel transformer-based model for scene graph generation that improves relation detection by modeling rich interactions among objects and their relationships.
Contribution
It proposes a transformer architecture with specialized positional embeddings for relation prediction, achieving state-of-the-art results on Visual Genome and GQA datasets.
Findings
Achieved 4.85% and 3.1% improvements over state-of-the-art on Visual Genome and GQA.
Effectively models context across small, medium, and large-scale relation classification.
Utilizes self-attention and cross-attention for node and edge interaction modeling.
Abstract
The extraction of a scene graph with objects as nodes and mutual relationships as edges is the basis for a deep understanding of image content. Despite recent advances, such as message passing and joint classification, the detection of visual relationships remains a challenging task due to sub-optimal exploration of the mutual interaction among the visual objects. In this work, we propose a novel transformer formulation for scene graph generation and relation prediction. We leverage the encoder-decoder architecture of the transformer for rich feature embedding of nodes and edges. Specifically, we model the node-to-node interaction with the self-attention of the transformer encoder and the edge-to-node interaction with the cross-attention of the transformer decoder. Further, we introduce a novel positional embedding suitable to handle edges in the decoder. Finally, our relation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax
