Efficient Two-Stage Detection of Human-Object Interactions with a Novel   Unary-Pairwise Transformer

Frederic Z. Zhang; Dylan Campbell; Stephen Gould

arXiv:2112.01838·cs.CV·March 29, 2022·5 cites

Efficient Two-Stage Detection of Human-Object Interactions with a Novel Unary-Pairwise Transformer

Frederic Z. Zhang, Dylan Campbell, Stephen Gould

PDF

Open Access 1 Repo

TL;DR

This paper introduces the Unary-Pairwise Transformer, a two-stage human-object interaction detector that leverages unary and pairwise representations, outperforming existing methods while being more efficient and approaching real-time performance.

Contribution

The paper presents a novel two-stage transformer-based HOI detection model that exploits unary and pairwise features, demonstrating superior performance and efficiency over existing one-stage methods.

Findings

01

Outperforms state-of-the-art on HICO-DET and V-COCO datasets.

02

Achieves near real-time inference with ResNet50.

03

More memory-efficient and faster training than comparable models.

Abstract

Recent developments in transformer models for visual data have led to significant improvements in recognition and detection tasks. In particular, using learnable queries in place of region proposals has given rise to a new class of one-stage detection models, spearheaded by the Detection Transformer (DETR). Variations on this one-stage approach have since dominated human-object interaction (HOI) detection. However, the success of such one-stage HOI detectors can largely be attributed to the representation power of transformers. We discovered that when equipped with the same transformer, their two-stage counterparts can be more performant and memory-efficient, while taking a fraction of the time to train. In this work, we propose the Unary-Pairwise Transformer, a two-stage detector that exploits unary and pairwise representations for HOIs. We observe that the unary and pairwise parts of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fredzzhang/upt
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition

MethodsAttention Is All You Need · Linear Layer · Adam · Softmax · Residual Connection · Dropout · Position-Wise Feed-Forward Layer · Layer Normalization · Dense Connections · Multi-Head Attention