Exploring Interactive Semantic Alignment for Efficient HOI Detection   with Vision-language Model

Jihao Dong; Renjie Pan; Hua Yang

arXiv:2404.12678·cs.CV·May 27, 2024

Exploring Interactive Semantic Alignment for Efficient HOI Detection with Vision-language Model

Jihao Dong, Renjie Pan, Hua Yang

PDF

Open Access

TL;DR

This paper introduces ISA-HOI, a novel HOI detection method that leverages CLIP's vision-language alignment to improve interaction understanding, especially in zero-shot scenarios, with fewer training epochs.

Contribution

The paper proposes a new HOI detector that uses CLIP for semantic alignment, incorporating global and local features and a verb semantic module, advancing zero-shot HOI detection.

Findings

01

Achieves competitive results on HICO-DET and V-COCO benchmarks.

02

Outperforms state-of-the-art methods in zero-shot HOI detection.

03

Requires fewer training epochs for effective performance.

Abstract

Human-Object Interaction (HOI) detection aims to localize human-object pairs and comprehend their interactions. Recently, two-stage transformer-based methods have demonstrated competitive performance. However, these methods frequently focus on object appearance features and ignore global contextual information. Besides, vision-language model CLIP which effectively aligns visual and text embeddings has shown great potential in zero-shot HOI detection. Based on the former facts, We introduce a novel HOI detector named ISA-HOI, which extensively leverages knowledge from CLIP, aligning interactive semantics between visual and textual features. We first extract global context of image and local features of object to Improve interaction Features in images (IF). On the other hand, we propose a Verb Semantic Improvement (VSI) module to enhance textual features of verb labels via cross-modal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Automated Systems

MethodsFocus · Contrastive Language-Image Pre-training