SHOE: Semantic HOI Open-Vocabulary Evaluation Metric
Maja Noack, Qinqian Lei, Taipeng Tian, Bihan Dong, Robby T. Tan, Yixin Chen, John Young, Saijun Zhang, Bo Wang

TL;DR
SHOE introduces a semantic similarity-based evaluation metric for open-vocabulary human-object interaction detection, enabling more human-aligned assessment beyond exact label matching.
Contribution
It proposes a novel evaluation framework that uses large language models to measure semantic similarity between HOI predictions and ground truth.
Findings
SHOE scores align more closely with human judgments than existing metrics.
Achieves 85.73% agreement with human ratings on HICO-DET.
Enables evaluation of open-ended models beyond fixed HOI labels.
Abstract
Open-vocabulary human-object interaction (HOI) detection is a step towards building scalable systems that generalize to unseen interactions in real-world scenarios and support grounded multimodal systems that reason about human-object relationships. However, standard evaluation metrics, such as mean Average Precision (mAP), treat HOI classes as discrete categorical labels and fail to credit semantically valid but lexically different predictions (e.g., "lean on couch" vs. "sit on couch"), limiting their applicability for evaluating open-vocabulary predictions that go beyond any predefined set of HOI labels. We introduce SHOE (Semantic HOI Open-Vocabulary Evaluation), a new evaluation framework that incorporates semantic similarity between predicted and ground-truth HOI labels. SHOE decomposes each HOI prediction into its verb and object components, estimates their semantic similarity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
