Re-mine, Learn and Reason: Exploring the Cross-modal Semantic   Correlations for Language-guided HOI detection

Yichao Cao; Qingfei Tang; Feng Yang; Xiu Su; Shan You; Xiaobo Lu and; Chang Xu

arXiv:2307.13529·cs.CV·September 19, 2023·1 cites

Re-mine, Learn and Reason: Exploring the Cross-modal Semantic Correlations for Language-guided HOI detection

Yichao Cao, Qingfei Tang, Feng Yang, Xiu Su, Shan You, Xiaobo Lu and, Chang Xu

PDF

Open Access

TL;DR

This paper introduces a unified framework that enhances human-object interaction detection by leveraging structured text knowledge and cross-modal semantic correlations, significantly improving accuracy on benchmark datasets.

Contribution

The paper proposes a re-mining strategy and fine-grained alignment techniques to better utilize textual knowledge and address many-to-many matching issues in HOI detection.

Findings

01

Achieves state-of-the-art performance on public benchmarks.

02

Effectively alleviates matching confusion in multi-interaction scenarios.

03

Demonstrates the benefit of integrating textual knowledge into visual HOI detection.

Abstract

Human-Object Interaction (HOI) detection is a challenging computer vision task that requires visual models to address the complex interactive relationship between humans and objects and predict HOI triplets. Despite the challenges posed by the numerous interaction combinations, they also offer opportunities for multimodal learning of visual texts. In this paper, we present a systematic and unified framework (RmLR) that enhances HOI detection by incorporating structured text knowledge. Firstly, we qualitatively and quantitatively analyze the loss of interaction information in the two-stage HOI detector and propose a re-mining strategy to generate more comprehensive visual representation.Secondly, we design more fine-grained sentence- and word-level alignment and knowledge transfer strategies to effectively address the many-to-many matching problem between multiple interactions and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques