HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models
Shan Ning, Longtian Qiu, Yongfei Liu, Xuming He

TL;DR
HOICLIP introduces a novel framework that leverages CLIP's visual and textual knowledge for improved human-object interaction detection, especially in few/zero-shot scenarios, with a new interaction decoder and knowledge integration techniques.
Contribution
The paper proposes a new HOI detection method that efficiently extracts and integrates CLIP's visual and textual knowledge, improving generalization and performance over existing approaches.
Findings
Outperforms state-of-the-art by +4.04 mAP on HICO-Det.
Effective knowledge extraction from CLIP enhances HOI detection.
Achieves better generalization in few/zero-shot scenarios.
Abstract
Human-Object Interaction (HOI) detection aims to localize human-object pairs and recognize their interactions. Recently, Contrastive Language-Image Pre-training (CLIP) has shown great potential in providing interaction prior for HOI detectors via knowledge distillation. However, such approaches often rely on large-scale training data and suffer from inferior performance under few/zero-shot scenarios. In this paper, we propose a novel HOI detection framework that efficiently extracts prior knowledge from CLIP and achieves better generalization. In detail, we first introduce a novel interaction decoder to extract informative regions in the visual feature map of CLIP via a cross-attention mechanism, which is then fused with the detection backbone by a knowledge integration block for more accurate human-object pair detection. In addition, prior knowledge in CLIP text encoder is leveraged to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsContrastive Language-Image Pre-training
