SrTR: Self-reasoning Transformer with Visual-linguistic Knowledge for   Scene Graph Generation

Yuxiang Zhang; Zhenbo Liu; Shuai Wang

arXiv:2212.09329·cs.CV·December 20, 2022·1 cites

SrTR: Self-reasoning Transformer with Visual-linguistic Knowledge for Scene Graph Generation

Yuxiang Zhang, Zhenbo Liu, Shuai Wang

PDF

Open Access

TL;DR

SrTR introduces a self-reasoning transformer model that integrates visual and linguistic knowledge to enhance scene graph generation, addressing previous limitations in relation inference and reasoning capabilities.

Contribution

The paper proposes SrTR, a novel encoder-decoder architecture with a self-reasoning decoder and visual-linguistic alignment, enabling comprehensive triplet inference and improved reasoning in scene graph generation.

Findings

01

Outperforms existing methods on Visual Genome dataset.

02

Demonstrates faster inference speed.

03

Enhances relation inference accuracy.

Abstract

Objects in a scene are not always related. The execution efficiency of the one-stage scene graph generation approaches are quite high, which infer the effective relation between entity pairs using sparse proposal sets and a few queries. However, they only focus on the relation between subject and object in triplet set subject entity, predicate entity, object entity, ignoring the relation between subject and predicate or predicate and object, and the model lacks self-reasoning ability. In addition, linguistic modality has been neglected in the one-stage method. It is necessary to mine linguistic modality knowledge to improve model reasoning ability. To address the above-mentioned shortcomings, a Self-reasoning Transformer with Visual-linguistic Knowledge (SrTR) is proposed to add flexible self-reasoning ability to the model. An encoder-decoder architecture is adopted in SrTR, and a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Topic Modeling

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Byte Pair Encoding · Dense Connections · Residual Connection · Label Smoothing · Position-Wise Feed-Forward Layer · Absolute Position Encodings