Exploiting CLIP for Zero-shot HOI Detection Requires Knowledge   Distillation at Multiple Levels

Bo Wan; Tinne Tuytelaars

arXiv:2309.05069·cs.CV·September 12, 2023

Exploiting CLIP for Zero-shot HOI Detection Requires Knowledge Distillation at Multiple Levels

Bo Wan, Tinne Tuytelaars

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper presents a novel zero-shot human-object interaction detection method using CLIP for multi-level knowledge distillation, enabling effective HOI recognition without task-specific annotations.

Contribution

It introduces a multi-branch neural network leveraging CLIP for multi-level HOI representation learning in zero-shot settings.

Findings

01

Achieves competitive performance on HICO-DET benchmark.

02

Demonstrates effectiveness of multi-level CLIP knowledge integration.

03

Outperforms some fully-supervised and weakly-supervised methods.

Abstract

In this paper, we investigate the task of zero-shot human-object interaction (HOI) detection, a novel paradigm for identifying HOIs without the need for task-specific annotations. To address this challenging task, we employ CLIP, a large-scale pre-trained vision-language model (VLM), for knowledge distillation on multiple levels. Specifically, we design a multi-branch neural network that leverages CLIP for learning HOI representations at various levels, including global images, local union regions encompassing human-object pairs, and individual instances of humans or objects. To train our model, CLIP is utilized to generate HOI scores for both global images and local union regions that serve as supervision signals. The extensive experiments demonstrate the effectiveness of our novel multi-level CLIP knowledge integration strategy. Notably, the model achieves strong performance, which is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bobwan1995/zeroshot-hoi-with-clip
noneOfficial

Videos

Exploiting CLIP for Zero-Shot HOI Detection Requires Knowledge Distillation at Multiple Levels· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · COVID-19 diagnosis using AI

MethodsContrastive Language-Image Pre-training · Knowledge Distillation