CrossHOI-Bench: A Unified Benchmark for HOI Evaluation across Vision-Language Models and HOI-Specific Methods

Qinqian Lei; Bo Wang; Robby T. Tan

arXiv:2508.18753·cs.CV·March 20, 2026

CrossHOI-Bench: A Unified Benchmark for HOI Evaluation across Vision-Language Models and HOI-Specific Methods

Qinqian Lei, Bo Wang, Robby T. Tan

PDF

Open Access

TL;DR

This paper introduces CrossHOI-Bench, a new benchmark for evaluating HOI detection across vision-language models and specialized methods, addressing evaluation fairness and revealing their complementary strengths and weaknesses.

Contribution

It proposes a unified, multiple-choice evaluation benchmark for HOI detection that fairly compares VLMs and HOI-specific models, especially in complex scenarios.

Findings

01

Large VLMs achieve competitive zero-shot HOI detection performance.

02

VLMs struggle with multi-person scenes and action assignment.

03

HOI-specific methods excel in multi-action recognition and person-action association.

Abstract

HOI detection has long been dominated by task-specific models, sometimes with early vision-language backbones such as CLIP. With the rise of large generative VLMs, a key question is whether standalone VLMs can perform HOI detection competitively against specialized HOI methods. Existing benchmarks such as HICO-DET require exact label matching under incomplete annotations, so any unmatched prediction is marked wrong. This unfairly penalizes valid outputs, especially from less constrained VLMs, and makes cross-paradigm comparison unreliable. To address this limitation, we introduce CrossHOI-Bench, a multiple-choice HOI benchmark with explicit positives and curated negatives, enabling unified and reliable evaluation of both VLMs and HOI-specific models. We further focus on challenging scenarios, such as multi-person scenes and fine-grained interaction distinctions, which are crucial for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robotics and Automated Systems