Visual Diversity and Region-aware Prompt Learning for Zero-shot HOI Detection

Chanhyeong Yang; Taehoon Song; Jihwan Park; Hyunwoo J. Kim

arXiv:2510.25094·cs.CV·October 30, 2025

Visual Diversity and Region-aware Prompt Learning for Zero-shot HOI Detection

Chanhyeong Yang, Taehoon Song, Jihwan Park, Hyunwoo J. Kim

PDF

TL;DR

This paper introduces VDRP, a novel prompt learning framework that enhances zero-shot human-object interaction detection by addressing visual diversity and entanglement through region-aware prompts and diversity-aware strategies, achieving state-of-the-art results.

Contribution

The paper proposes a new visual diversity and region-aware prompt learning method that improves zero-shot HOI detection by capturing intra-class diversity and inter-class entanglement.

Findings

01

Achieves state-of-the-art performance on HICO-DET benchmark.

02

Effectively handles intra-class visual diversity and inter-class entanglement.

03

Enhances verb-level discrimination with region-specific prompts.

Abstract

Zero-shot Human-Object Interaction detection aims to localize humans and objects in an image and recognize their interaction, even when specific verb-object pairs are unseen during training. Recent works have shown promising results using prompt learning with pretrained vision-language models such as CLIP, which align natural language prompts with visual features in a shared embedding space. However, existing approaches still fail to handle the visual complexity of interaction, including (1) intra-class visual diversity, where instances of the same verb appear in diverse poses and contexts, and (2) inter-class visual entanglement, where distinct verbs yield visually similar patterns. To address these challenges, we propose VDRP, a framework for Visual Diversity and Region-aware Prompt learning. First, we introduce a visual diversity-aware prompt learning strategy that injects group-wise…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.