TL;DR
SL-HOI is a streamlined open-vocabulary human-object interaction detection framework leveraging DINOv3, achieving state-of-the-art results by effectively bridging representation gaps without extensive model training.
Contribution
The paper introduces SL-HOI, a novel framework that uses a frozen DINOv3 model with minimal additional parameters for efficient open-vocabulary HOI detection.
Findings
Achieves state-of-the-art performance on SWiG-HOI and HICO-DET benchmarks.
Effectively bridges representation gaps between localization and classification components.
Uses a simple yet effective architecture with frozen DINOv3 parameters.
Abstract
Open-vocabulary human-object interaction (HOI) detection aims to localize and recognize all human-object interactions in an image, including those unseen during training. Existing approaches usually rely on the collaboration between a conventional HOI detector and a Vision-Language Model (VLM) to recognize unseen HOI categories. However, feature fusion in this paradigm is challenging due to significant gaps in cross-model representations. To address this issue, we introduce SL-HOI, a StreamLined open-vocabulary HOI detection framework based solely on the powerful DINOv3 model. Our design leverages the complementary strengths of DINOv3's components: its backbone for fine-grained localization and its text-aligned vision head for open-vocabulary interaction classification. Moreover, to facilitate smooth cross-attention between the interaction queries and the vision head's output, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
