HOICLIP: Efficient Knowledge Transfer for HOI Detection with   Vision-Language Models

Shan Ning; Longtian Qiu; Yongfei Liu; Xuming He

arXiv:2303.15786·cs.CV·July 27, 2023·6 cites

HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models

Shan Ning, Longtian Qiu, Yongfei Liu, Xuming He

PDF

Open Access 1 Repo

TL;DR

HOICLIP introduces a novel framework that leverages CLIP's visual and textual knowledge for improved human-object interaction detection, especially in few/zero-shot scenarios, with a new interaction decoder and knowledge integration techniques.

Contribution

The paper proposes a new HOI detection method that efficiently extracts and integrates CLIP's visual and textual knowledge, improving generalization and performance over existing approaches.

Findings

01

Outperforms state-of-the-art by +4.04 mAP on HICO-Det.

02

Effective knowledge extraction from CLIP enhances HOI detection.

03

Achieves better generalization in few/zero-shot scenarios.

Abstract

Human-Object Interaction (HOI) detection aims to localize human-object pairs and recognize their interactions. Recently, Contrastive Language-Image Pre-training (CLIP) has shown great potential in providing interaction prior for HOI detectors via knowledge distillation. However, such approaches often rely on large-scale training data and suffer from inferior performance under few/zero-shot scenarios. In this paper, we propose a novel HOI detection framework that efficiently extracts prior knowledge from CLIP and achieves better generalization. In detail, we first introduce a novel interaction decoder to extract informative regions in the visual feature map of CLIP via a cross-attention mechanism, which is then fused with the detection backbone by a knowledge integration block for more accurate human-object pair detection. In addition, prior knowledge in CLIP text encoder is leveraged to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

artanic30/hoiclip
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsContrastive Language-Image Pre-training