EgoNormia: Benchmarking Physical Social Norm Understanding
MohammadHossein Rezaei, Yicheng Fu, Phil Cuvin, Caleb Ziems, Yanzhe Zhang, Hao Zhu, Diyi Yang

TL;DR
This paper introduces EGONORMIA, a comprehensive benchmark dataset of egocentric videos with multiple choice questions across seven social norm categories, to evaluate and improve vision-language models' understanding of physical and social norms.
Contribution
It presents a novel large-scale dataset and a pipeline for generating grounded MCQs from egocentric videos, revealing current models' limitations and proposing methods to enhance normative reasoning.
Findings
State-of-the-art VLMs score below 55% on EGONORMIA.
Models show significant risks in safety and privacy norms.
Retrieval-based methods improve normative reasoning.
Abstract
Human activity is moderated by norms; however, supervision for normative reasoning is sparse, particularly where norms are physically- or socially-grounded. We thus present EGONORMIA , comprising 1,853 (200 for EGONORMIA-verified) multiple choice questions (MCQs) grounded within egocentric videos of human interactions, enabling the evaluation and improvement of normative reasoning in vision-language models (VLMs). EGONORMIA spans seven norm categories: safety, privacy, proxemics, politeness, cooperation, coordination/proactivity, and communication/legibility. To compile this dataset at scale, we propose a novel pipeline to generate grounded MCQs from raw egocentric video. Our work demonstrates that current state-of-the-art VLMs lack robust grounded norm understanding, scoring a maximum of 54% on EGONORMIA and 65% on EGONORMIA-verified, with performance across norm…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Explainable Artificial Intelligence (XAI)
