Visual Relationship Detection Using Part-and-Sum Transformers with Composite Queries
Qi Dong, Zhuowen Tu, Haofu Liao, Yuting Zhang, Vijay Mahadevan,, Stefano Soatto

TL;DR
This paper introduces PST, a novel Transformer-based method that models hierarchical part-and-sum relationships for visual relationship detection and human-object interaction, achieving state-of-the-art results in single-stage models.
Contribution
The paper proposes a new Part-and-Sum detection Transformer (PST) that explicitly models hierarchical part and sum hypotheses with composite queries and attention modules.
Findings
Achieves state-of-the-art results in visual relationship detection.
Nearly matches two-stage models in performance.
Introduces tensor-based and vector-based composite queries.
Abstract
Computer vision applications such as visual relationship detection and human object interaction can be formulated as a composite (structured) set detection problem in which both the parts (subject, object, and predicate) and the sum (triplet as a whole) are to be detected in a hierarchical fashion. In this paper, we present a new approach, denoted Part-and-Sum detection Transformer (PST), to perform end-to-end visual composite set detection. Different from existing Transformers in which queries are at a single level, we simultaneously model the joint part and sum hypotheses/interactions with composite queries and attention modules. We explicitly incorporate sum queries to enable better modeling of the part-and-sum relations that are absent in the standard Transformers. Our approach also uses novel tensor-based part queries and vector-based sum queries, and models their joint…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Advanced Neural Network Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Layer Normalization · Label Smoothing · Byte Pair Encoding · Residual Connection
