Towards Robust Evaluation of Visual Activity Recognition: Resolving Verb Ambiguity with Sense Clustering

Louie Hong Yao; Nicholas Jarvis; Tianyu Jiang

arXiv:2508.04945·cs.CL·January 27, 2026

Towards Robust Evaluation of Visual Activity Recognition: Resolving Verb Ambiguity with Sense Clustering

Louie Hong Yao, Nicholas Jarvis, Tianyu Jiang

PDF

1 Video

TL;DR

This paper introduces a sense clustering framework for evaluating visual activity recognition, addressing verb ambiguity and multiple perspectives to provide a more human-aligned and robust assessment of model performance.

Contribution

It proposes a novel sense clustering approach for evaluation that captures verb ambiguities and perspectives, improving over standard exact-match metrics.

Findings

01

Each image maps to around four sense clusters.

02

Cluster-based evaluation aligns better with human judgments.

03

The framework enhances robustness in activity recognition assessment.

Abstract

Evaluating visual activity recognition systems is challenging due to inherent ambiguities in verb semantics and image interpretation. When describing actions in images, synonymous verbs can refer to the same event (e.g., brushing vs. grooming), while different perspectives can lead to equally valid but distinct verb choices (e.g., piloting vs. operating). Standard exact-match evaluation, which relies on a single gold answer, fails to capture these ambiguities, resulting in an incomplete assessment of model performance. To address this, we propose a vision-language clustering framework that constructs verb sense clusters, providing a more robust evaluation. Our analysis of the imSitu dataset shows that each image maps to around four sense clusters, with each cluster representing a distinct perspective of the image. We evaluate multiple activity recognition models and compare our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Towards Robust Evaluation of Visual Activity Recognition: Resolving Verb Ambiguity with Sense Clustering· underline