Efficient Two-Stage Detection of Human-Object Interactions with a Novel Unary-Pairwise Transformer
Frederic Z. Zhang, Dylan Campbell, Stephen Gould

TL;DR
This paper introduces the Unary-Pairwise Transformer, a two-stage human-object interaction detector that leverages unary and pairwise representations, outperforming existing methods while being more efficient and approaching real-time performance.
Contribution
The paper presents a novel two-stage transformer-based HOI detection model that exploits unary and pairwise features, demonstrating superior performance and efficiency over existing one-stage methods.
Findings
Outperforms state-of-the-art on HICO-DET and V-COCO datasets.
Achieves near real-time inference with ResNet50.
More memory-efficient and faster training than comparable models.
Abstract
Recent developments in transformer models for visual data have led to significant improvements in recognition and detection tasks. In particular, using learnable queries in place of region proposals has given rise to a new class of one-stage detection models, spearheaded by the Detection Transformer (DETR). Variations on this one-stage approach have since dominated human-object interaction (HOI) detection. However, the success of such one-stage HOI detectors can largely be attributed to the representation power of transformers. We discovered that when equipped with the same transformer, their two-stage counterparts can be more performant and memory-efficient, while taking a fraction of the time to train. In this work, we propose the Unary-Pairwise Transformer, a two-stage detector that exploits unary and pairwise representations for HOIs. We observe that the unary and pairwise parts of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
MethodsAttention Is All You Need · Linear Layer · Adam · Softmax · Residual Connection · Dropout · Position-Wise Feed-Forward Layer · Layer Normalization · Dense Connections · Multi-Head Attention
