Unseen No More: Unlocking the Potential of CLIP for Generative Zero-shot   HOI Detection

Yixin Guo; Yu Liu; Jianghao Li; Weimin Wang; Qi Jia

arXiv:2408.05974·cs.CV·August 13, 2024

Unseen No More: Unlocking the Potential of CLIP for Generative Zero-shot HOI Detection

Yixin Guo, Yu Liu, Jianghao Li, Weimin Wang, Qi Jia

PDF

1 Repo

TL;DR

This paper introduces HOIGen, a generation-based model leveraging CLIP for zero-shot human-object interaction detection, significantly improving unseen class recognition by generating features and using prototype banks.

Contribution

It presents the first generation-based approach with CLIP for zero-shot HOI detection, addressing seen-unseen confusion and enhancing generalization.

Findings

01

HOIGen outperforms existing methods on HICO-DET benchmark.

02

Generates realistic features for seen and unseen classes.

03

Utilizes prototype banks to improve HOI scoring.

Abstract

Zero-shot human-object interaction (HOI) detector is capable of generalizing to HOI categories even not encountered during training. Inspired by the impressive zero-shot capabilities offered by CLIP, latest methods strive to leverage CLIP embeddings for improving zero-shot HOI detection. However, these embedding-based methods train the classifier on seen classes only, inevitably resulting in seen-unseen confusion for the model during inference. Besides, we find that using prompt-tuning and adapters further increases the gap between seen and unseen accuracy. To tackle this challenge, we present the first generation-based model using CLIP for zero-shot HOI detection, coined HOIGen. It allows to unlock the potential of CLIP for feature generation instead of feature extraction only. To achieve it, we develop a CLIP-injected feature generator in accordance with the generation of human,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

soberguo/hoigen
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsContrastive Language-Image Pre-training