SHOE: Semantic HOI Open-Vocabulary Evaluation Metric

Maja Noack; Qinqian Lei; Taipeng Tian; Bihan Dong; Robby T. Tan; Yixin Chen; John Young; Saijun Zhang; Bo Wang

arXiv:2604.01586·cs.CV·April 3, 2026

SHOE: Semantic HOI Open-Vocabulary Evaluation Metric

Maja Noack, Qinqian Lei, Taipeng Tian, Bihan Dong, Robby T. Tan, Yixin Chen, John Young, Saijun Zhang, Bo Wang

PDF

TL;DR

SHOE introduces a semantic similarity-based evaluation metric for open-vocabulary human-object interaction detection, enabling more human-aligned assessment beyond exact label matching.

Contribution

It proposes a novel evaluation framework that uses large language models to measure semantic similarity between HOI predictions and ground truth.

Findings

01

SHOE scores align more closely with human judgments than existing metrics.

02

Achieves 85.73% agreement with human ratings on HICO-DET.

03

Enables evaluation of open-ended models beyond fixed HOI labels.

Abstract

Open-vocabulary human-object interaction (HOI) detection is a step towards building scalable systems that generalize to unseen interactions in real-world scenarios and support grounded multimodal systems that reason about human-object relationships. However, standard evaluation metrics, such as mean Average Precision (mAP), treat HOI classes as discrete categorical labels and fail to credit semantically valid but lexically different predictions (e.g., "lean on couch" vs. "sit on couch"), limiting their applicability for evaluating open-vocabulary predictions that go beyond any predefined set of HOI labels. We introduce SHOE (Semantic HOI Open-Vocabulary Evaluation), a new evaluation framework that incorporates semantic similarity between predicted and ground-truth HOI labels. SHOE decomposes each HOI prediction into its verb and object components, estimates their semantic similarity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.