Visual Diversity and Region-aware Prompt Learning for Zero-shot HOI Detection
Chanhyeong Yang, Taehoon Song, Jihwan Park, Hyunwoo J. Kim

TL;DR
This paper introduces VDRP, a novel prompt learning framework that enhances zero-shot human-object interaction detection by addressing visual diversity and entanglement through region-aware prompts and diversity-aware strategies, achieving state-of-the-art results.
Contribution
The paper proposes a new visual diversity and region-aware prompt learning method that improves zero-shot HOI detection by capturing intra-class diversity and inter-class entanglement.
Findings
Achieves state-of-the-art performance on HICO-DET benchmark.
Effectively handles intra-class visual diversity and inter-class entanglement.
Enhances verb-level discrimination with region-specific prompts.
Abstract
Zero-shot Human-Object Interaction detection aims to localize humans and objects in an image and recognize their interaction, even when specific verb-object pairs are unseen during training. Recent works have shown promising results using prompt learning with pretrained vision-language models such as CLIP, which align natural language prompts with visual features in a shared embedding space. However, existing approaches still fail to handle the visual complexity of interaction, including (1) intra-class visual diversity, where instances of the same verb appear in diverse poses and contexts, and (2) inter-class visual entanglement, where distinct verbs yield visually similar patterns. To address these challenges, we propose VDRP, a framework for Visual Diversity and Region-aware Prompt learning. First, we introduce a visual diversity-aware prompt learning strategy that injects group-wise…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
