Human-Object Interaction Detection via Disentangled Transformer

Desen Zhou; Zhichao Liu; Jian Wang; Leshan Wang; Tao Hu; Errui Ding,; Jingdong Wang

arXiv:2204.09290·cs.CV·April 21, 2022·5 cites

Human-Object Interaction Detection via Disentangled Transformer

Desen Zhou, Zhichao Liu, Jian Wang, Leshan Wang, Tao Hu, Errui Ding,, Jingdong Wang

PDF

Open Access

TL;DR

This paper introduces a Disentangled Transformer for Human-Object Interaction detection, separating the tasks of human-object pair detection and interaction classification to improve accuracy and performance.

Contribution

It proposes a novel disentangled transformer architecture that decouples the prediction of human-object pairs and interactions, enhancing task-specific feature learning.

Findings

01

Outperforms previous methods on two public HOI benchmarks

02

Achieves significant accuracy improvements

03

Demonstrates effective disentanglement of sub-tasks

Abstract

Human-Object Interaction Detection tackles the problem of joint localization and classification of human object interactions. Existing HOI transformers either adopt a single decoder for triplet prediction, or utilize two parallel decoders to detect individual objects and interactions separately, and compose triplets by a matching process. In contrast, we decouple the triplet prediction into human-object pair detection and interaction classification. Our main motivation is that detecting the human-object instances and classifying interactions accurately needs to learn representations that focus on different regions. To this end, we present Disentangled Transformer, where both encoder and decoder are disentangled to facilitate learning of two sub-tasks. To associate the predictions of disentangled decoders, we first generate a unified representation for HOI triplets with a base decoder,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Neural Network Applications

MethodsAttention Is All You Need · Linear Layer · Label Smoothing · Adam · Multi-Head Attention · Absolute Position Encodings · Byte Pair Encoding · Balanced Selection · Position-Wise Feed-Forward Layer · Dense Connections