Agglomerative Transformer for Human-Object Interaction Detection
Danyang Tu, Wei Sun, Guangtao Zhai, Wei Shen

TL;DR
This paper introduces AGER, an agglomerative Transformer that enhances human-object interaction detection by dynamically clustering features to create comprehensive instance tokens, achieving state-of-the-art results efficiently in a single-stage, end-to-end framework.
Contribution
The paper presents a novel agglomerative Transformer that dynamically clusters patch tokens into instance tokens guided by text, enabling efficient, end-to-end HOI detection without extra detectors.
Findings
Achieves 36.75 mAP on HICO-Det, setting a new state-of-the-art.
Reduces GFLOPs by 8.5% and increases FPS by 36%.
Operates in a single-stage, end-to-end manner without additional detectors.
Abstract
We propose an agglomerative Transformer (AGER) that enables Transformer-based human-object interaction (HOI) detectors to flexibly exploit extra instance-level cues in a single-stage and end-to-end manner for the first time. AGER acquires instance tokens by dynamically clustering patch tokens and aligning cluster centers to instances with textual guidance, thus enjoying two benefits: 1) Integrality: each instance token is encouraged to contain all discriminative feature regions of an instance, which demonstrates a significant improvement in the extraction of different instance-level cues and subsequently leads to a new state-of-the-art performance of HOI detection with 36.75 mAP on HICO-Det. 2) Efficiency: the dynamical clustering mechanism allows AGER to generate instance tokens jointly with the feature learning of the Transformer encoder, eliminating the need of an additional object…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Neural Network Applications
MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Layer Normalization · Softmax · Multi-Head Attention · Absolute Position Encodings · Residual Connection · Dense Connections · Dropout
