Agglomerative Transformer for Human-Object Interaction Detection

Danyang Tu; Wei Sun; Guangtao Zhai; Wei Shen

arXiv:2308.08370·cs.CV·August 17, 2023

Agglomerative Transformer for Human-Object Interaction Detection

Danyang Tu, Wei Sun, Guangtao Zhai, Wei Shen

PDF

Open Access

TL;DR

This paper introduces AGER, an agglomerative Transformer that enhances human-object interaction detection by dynamically clustering features to create comprehensive instance tokens, achieving state-of-the-art results efficiently in a single-stage, end-to-end framework.

Contribution

The paper presents a novel agglomerative Transformer that dynamically clusters patch tokens into instance tokens guided by text, enabling efficient, end-to-end HOI detection without extra detectors.

Findings

01

Achieves 36.75 mAP on HICO-Det, setting a new state-of-the-art.

02

Reduces GFLOPs by 8.5% and increases FPS by 36%.

03

Operates in a single-stage, end-to-end manner without additional detectors.

Abstract

We propose an agglomerative Transformer (AGER) that enables Transformer-based human-object interaction (HOI) detectors to flexibly exploit extra instance-level cues in a single-stage and end-to-end manner for the first time. AGER acquires instance tokens by dynamically clustering patch tokens and aligning cluster centers to instances with textual guidance, thus enjoying two benefits: 1) Integrality: each instance token is encouraged to contain all discriminative feature regions of an instance, which demonstrates a significant improvement in the extraction of different instance-level cues and subsequently leads to a new state-of-the-art performance of HOI detection with 36.75 mAP on HICO-Det. 2) Efficiency: the dynamical clustering mechanism allows AGER to generate instance tokens jointly with the feature learning of the Transformer encoder, eliminating the need of an additional object…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Neural Network Applications

MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Layer Normalization · Softmax · Multi-Head Attention · Absolute Position Encodings · Residual Connection · Dense Connections · Dropout