TL;DR
This paper introduces HOIGen, a generation-based model leveraging CLIP for zero-shot human-object interaction detection, significantly improving unseen class recognition by generating features and using prototype banks.
Contribution
It presents the first generation-based approach with CLIP for zero-shot HOI detection, addressing seen-unseen confusion and enhancing generalization.
Findings
HOIGen outperforms existing methods on HICO-DET benchmark.
Generates realistic features for seen and unseen classes.
Utilizes prototype banks to improve HOI scoring.
Abstract
Zero-shot human-object interaction (HOI) detector is capable of generalizing to HOI categories even not encountered during training. Inspired by the impressive zero-shot capabilities offered by CLIP, latest methods strive to leverage CLIP embeddings for improving zero-shot HOI detection. However, these embedding-based methods train the classifier on seen classes only, inevitably resulting in seen-unseen confusion for the model during inference. Besides, we find that using prompt-tuning and adapters further increases the gap between seen and unseen accuracy. To tackle this challenge, we present the first generation-based model using CLIP for zero-shot HOI detection, coined HOIGen. It allows to unlock the potential of CLIP for feature generation instead of feature extraction only. To achieve it, we develop a CLIP-injected feature generator in accordance with the generation of human,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Language-Image Pre-training
