SrTR: Self-reasoning Transformer with Visual-linguistic Knowledge for Scene Graph Generation
Yuxiang Zhang, Zhenbo Liu, Shuai Wang

TL;DR
SrTR introduces a self-reasoning transformer model that integrates visual and linguistic knowledge to enhance scene graph generation, addressing previous limitations in relation inference and reasoning capabilities.
Contribution
The paper proposes SrTR, a novel encoder-decoder architecture with a self-reasoning decoder and visual-linguistic alignment, enabling comprehensive triplet inference and improved reasoning in scene graph generation.
Findings
Outperforms existing methods on Visual Genome dataset.
Demonstrates faster inference speed.
Enhances relation inference accuracy.
Abstract
Objects in a scene are not always related. The execution efficiency of the one-stage scene graph generation approaches are quite high, which infer the effective relation between entity pairs using sparse proposal sets and a few queries. However, they only focus on the relation between subject and object in triplet set subject entity, predicate entity, object entity, ignoring the relation between subject and predicate or predicate and object, and the model lacks self-reasoning ability. In addition, linguistic modality has been neglected in the one-stage method. It is necessary to mine linguistic modality knowledge to improve model reasoning ability. To address the above-mentioned shortcomings, a Self-reasoning Transformer with Visual-linguistic Knowledge (SrTR) is proposed to add flexible self-reasoning ability to the model. An encoder-decoder architecture is adopted in SrTR, and a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Topic Modeling
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Byte Pair Encoding · Dense Connections · Residual Connection · Label Smoothing · Position-Wise Feed-Forward Layer · Absolute Position Encodings
