CrossHOI-Bench: A Unified Benchmark for HOI Evaluation across Vision-Language Models and HOI-Specific Methods
Qinqian Lei, Bo Wang, Robby T. Tan

TL;DR
This paper introduces CrossHOI-Bench, a new benchmark for evaluating HOI detection across vision-language models and specialized methods, addressing evaluation fairness and revealing their complementary strengths and weaknesses.
Contribution
It proposes a unified, multiple-choice evaluation benchmark for HOI detection that fairly compares VLMs and HOI-specific models, especially in complex scenarios.
Findings
Large VLMs achieve competitive zero-shot HOI detection performance.
VLMs struggle with multi-person scenes and action assignment.
HOI-specific methods excel in multi-action recognition and person-action association.
Abstract
HOI detection has long been dominated by task-specific models, sometimes with early vision-language backbones such as CLIP. With the rise of large generative VLMs, a key question is whether standalone VLMs can perform HOI detection competitively against specialized HOI methods. Existing benchmarks such as HICO-DET require exact label matching under incomplete annotations, so any unmatched prediction is marked wrong. This unfairly penalizes valid outputs, especially from less constrained VLMs, and makes cross-paradigm comparison unreliable. To address this limitation, we introduce CrossHOI-Bench, a multiple-choice HOI benchmark with explicit positives and curated negatives, enabling unified and reliable evaluation of both VLMs and HOI-specific models. We further focus on challenging scenarios, such as multi-person scenes and fine-grained interaction distinctions, which are crucial for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robotics and Automated Systems
