Improving Human-Object Interaction Detection via Phrase Learning and   Label Composition

Zhimin Li; Cheng Zou; Yu Zhao; Boxun Li; Sheng Zhong

arXiv:2112.07383·cs.CV·January 19, 2022

Improving Human-Object Interaction Detection via Phrase Learning and Label Composition

Zhimin Li, Cheng Zou, Yu Zhao, Boxun Li, Sheng Zhong

PDF

Open Access 1 Video

TL;DR

This paper introduces PhraseHOI, a novel approach for human-object interaction detection that leverages language priors, semantic embeddings, and label composition to improve relation expression and address data imbalance, achieving state-of-the-art results.

Contribution

The paper proposes a new phrase branch supervised by semantic embeddings and a label composition method to enhance HOI detection and handle long-tailed data distributions.

Findings

01

Significant improvement over baseline methods.

02

Outperforms previous state-of-the-art on HICO-DET benchmark.

03

Effective handling of long-tailed HOI data.

Abstract

Human-Object Interaction (HOI) detection is a fundamental task in high-level human-centric scene understanding. We propose PhraseHOI, containing a HOI branch and a novel phrase branch, to leverage language prior and improve relation expression. Specifically, the phrase branch is supervised by semantic embeddings, whose ground truths are automatically converted from the original HOI annotations without extra human efforts. Meanwhile, a novel label composition method is proposed to deal with the long-tailed problem in HOI, which composites novel phrase labels by semantic neighbors. Further, to optimize the phrase branch, a loss composed of a distilling loss and a balanced triplet loss is proposed. Extensive experiments are conducted to prove the effectiveness of the proposed PhraseHOI, which achieves significant improvement over the baseline and surpasses previous state-of-the-art methods…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Improving Human-Object Interaction Detection via Phrase Learning and Label Composition· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition

MethodsTriplet Loss