Human-Object Interaction Detection via Disentangled Transformer
Desen Zhou, Zhichao Liu, Jian Wang, Leshan Wang, Tao Hu, Errui Ding,, Jingdong Wang

TL;DR
This paper introduces a Disentangled Transformer for Human-Object Interaction detection, separating the tasks of human-object pair detection and interaction classification to improve accuracy and performance.
Contribution
It proposes a novel disentangled transformer architecture that decouples the prediction of human-object pairs and interactions, enhancing task-specific feature learning.
Findings
Outperforms previous methods on two public HOI benchmarks
Achieves significant accuracy improvements
Demonstrates effective disentanglement of sub-tasks
Abstract
Human-Object Interaction Detection tackles the problem of joint localization and classification of human object interactions. Existing HOI transformers either adopt a single decoder for triplet prediction, or utilize two parallel decoders to detect individual objects and interactions separately, and compose triplets by a matching process. In contrast, we decouple the triplet prediction into human-object pair detection and interaction classification. Our main motivation is that detecting the human-object instances and classifying interactions accurately needs to learn representations that focus on different regions. To this end, we present Disentangled Transformer, where both encoder and decoder are disentangled to facilitate learning of two sub-tasks. To associate the predictions of disentangled decoders, we first generate a unified representation for HOI triplets with a base decoder,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Neural Network Applications
MethodsAttention Is All You Need · Linear Layer · Label Smoothing · Adam · Multi-Head Attention · Absolute Position Encodings · Byte Pair Encoding · Balanced Selection · Position-Wise Feed-Forward Layer · Dense Connections
