TL;DR
This paper introduces a sense clustering framework for evaluating visual activity recognition, addressing verb ambiguity and multiple perspectives to provide a more human-aligned and robust assessment of model performance.
Contribution
It proposes a novel sense clustering approach for evaluation that captures verb ambiguities and perspectives, improving over standard exact-match metrics.
Findings
Each image maps to around four sense clusters.
Cluster-based evaluation aligns better with human judgments.
The framework enhances robustness in activity recognition assessment.
Abstract
Evaluating visual activity recognition systems is challenging due to inherent ambiguities in verb semantics and image interpretation. When describing actions in images, synonymous verbs can refer to the same event (e.g., brushing vs. grooming), while different perspectives can lead to equally valid but distinct verb choices (e.g., piloting vs. operating). Standard exact-match evaluation, which relies on a single gold answer, fails to capture these ambiguities, resulting in an incomplete assessment of model performance. To address this, we propose a vision-language clustering framework that constructs verb sense clusters, providing a more robust evaluation. Our analysis of the imSitu dataset shows that each image maps to around four sense clusters, with each cluster representing a distinct perspective of the image. We evaluate multiple activity recognition models and compare our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
