Disentangled Interaction Representation for One-Stage Human-Object   Interaction Detection

Xubin Zhong; Changxing Ding; Yupeng Hu; Dacheng Tao

arXiv:2312.01713·cs.CV·December 5, 2023·1 cites

Disentangled Interaction Representation for One-Stage Human-Object Interaction Detection

Xubin Zhong, Changxing Ding, Yupeng Hu, Dacheng Tao

PDF

Open Access

TL;DR

This paper introduces a novel method to improve one-stage human-object interaction detection by extracting disentangled, interpretable interaction representations through specialized attention mechanisms and pose estimation, achieving state-of-the-art results.

Contribution

The paper proposes Shunted Cross-Attention and Interaction-aware Pose Estimation modules to extract disentangled interaction features, enhancing interpretability and performance of one-stage HOI detectors.

Findings

01

Achieves state-of-the-art performance on HICO-DET and V-COCO benchmarks.

02

Effectively disentangles human appearance, object appearance, and pose features.

03

Improves interpretability of interaction representations in one-stage detectors.

Abstract

Human-Object Interaction (HOI) detection is a core task for human-centric image understanding. Recent one-stage methods adopt a transformer decoder to collect image-wide cues that are useful for interaction prediction; however, the interaction representations obtained using this method are entangled and lack interpretability. In contrast, traditional two-stage methods benefit significantly from their ability to compose interaction features in a disentangled and explainable manner. In this paper, we improve the performance of one-stage methods by enabling them to extract disentangled interaction representations. First, we propose Shunted Cross-Attention (SCA) to extract human appearance, object appearance, and global context features using different cross-attention heads. This is achieved by imposing different masks on the cross-attention maps produced by the different heads. Second, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Advanced Neural Network Applications