Improving Human-Object Interaction Detection via Phrase Learning and Label Composition
Zhimin Li, Cheng Zou, Yu Zhao, Boxun Li, Sheng Zhong

TL;DR
This paper introduces PhraseHOI, a novel approach for human-object interaction detection that leverages language priors, semantic embeddings, and label composition to improve relation expression and address data imbalance, achieving state-of-the-art results.
Contribution
The paper proposes a new phrase branch supervised by semantic embeddings and a label composition method to enhance HOI detection and handle long-tailed data distributions.
Findings
Significant improvement over baseline methods.
Outperforms previous state-of-the-art on HICO-DET benchmark.
Effective handling of long-tailed HOI data.
Abstract
Human-Object Interaction (HOI) detection is a fundamental task in high-level human-centric scene understanding. We propose PhraseHOI, containing a HOI branch and a novel phrase branch, to leverage language prior and improve relation expression. Specifically, the phrase branch is supervised by semantic embeddings, whose ground truths are automatically converted from the original HOI annotations without extra human efforts. Meanwhile, a novel label composition method is proposed to deal with the long-tailed problem in HOI, which composites novel phrase labels by semantic neighbors. Further, to optimize the phrase branch, a loss composed of a distilling loss and a balanced triplet loss is proposed. Extensive experiments are conducted to prove the effectiveness of the proposed PhraseHOI, which achieves significant improvement over the baseline and surpasses previous state-of-the-art methods…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
MethodsTriplet Loss
