Re-mine, Learn and Reason: Exploring the Cross-modal Semantic Correlations for Language-guided HOI detection
Yichao Cao, Qingfei Tang, Feng Yang, Xiu Su, Shan You, Xiaobo Lu and, Chang Xu

TL;DR
This paper introduces a unified framework that enhances human-object interaction detection by leveraging structured text knowledge and cross-modal semantic correlations, significantly improving accuracy on benchmark datasets.
Contribution
The paper proposes a re-mining strategy and fine-grained alignment techniques to better utilize textual knowledge and address many-to-many matching issues in HOI detection.
Findings
Achieves state-of-the-art performance on public benchmarks.
Effectively alleviates matching confusion in multi-interaction scenarios.
Demonstrates the benefit of integrating textual knowledge into visual HOI detection.
Abstract
Human-Object Interaction (HOI) detection is a challenging computer vision task that requires visual models to address the complex interactive relationship between humans and objects and predict HOI triplets. Despite the challenges posed by the numerous interaction combinations, they also offer opportunities for multimodal learning of visual texts. In this paper, we present a systematic and unified framework (RmLR) that enhances HOI detection by incorporating structured text knowledge. Firstly, we qualitatively and quantitatively analyze the loss of interaction information in the two-stage HOI detector and propose a re-mining strategy to generate more comprehensive visual representation.Secondly, we design more fine-grained sentence- and word-level alignment and knowledge transfer strategies to effectively address the many-to-many matching problem between multiple interactions and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
