Visual Relationship Detection Using Part-and-Sum Transformers with   Composite Queries

Qi Dong; Zhuowen Tu; Haofu Liao; Yuting Zhang; Vijay Mahadevan,; Stefano Soatto

arXiv:2105.02170·cs.CV·August 23, 2021

Visual Relationship Detection Using Part-and-Sum Transformers with Composite Queries

Qi Dong, Zhuowen Tu, Haofu Liao, Yuting Zhang, Vijay Mahadevan,, Stefano Soatto

PDF

Open Access

TL;DR

This paper introduces PST, a novel Transformer-based method that models hierarchical part-and-sum relationships for visual relationship detection and human-object interaction, achieving state-of-the-art results in single-stage models.

Contribution

The paper proposes a new Part-and-Sum detection Transformer (PST) that explicitly models hierarchical part and sum hypotheses with composite queries and attention modules.

Findings

01

Achieves state-of-the-art results in visual relationship detection.

02

Nearly matches two-stage models in performance.

03

Introduces tensor-based and vector-based composite queries.

Abstract

Computer vision applications such as visual relationship detection and human object interaction can be formulated as a composite (structured) set detection problem in which both the parts (subject, object, and predicate) and the sum (triplet as a whole) are to be detected in a hierarchical fashion. In this paper, we present a new approach, denoted Part-and-Sum detection Transformer (PST), to perform end-to-end visual composite set detection. Different from existing Transformers in which queries are at a single level, we simultaneously model the joint part and sum hypotheses/interactions with composite queries and attention modules. We explicitly incorporate sum queries to enable better modeling of the part-and-sum relations that are absent in the standard Transformers. Our approach also uses novel tensor-based part queries and vector-based sum queries, and models their joint…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Advanced Neural Network Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Layer Normalization · Label Smoothing · Byte Pair Encoding · Residual Connection