TL;DR
This paper introduces SVBench, a comprehensive benchmark for evaluating social reasoning in video generation models, revealing current models' limitations in producing socially coherent behavior.
Contribution
It presents the first benchmark grounded in social psychology paradigms, with a training-free pipeline and evaluation framework for assessing social reasoning in video generation.
Findings
Seven state-of-the-art models show limited social reasoning capabilities.
Benchmark reveals a gap between visual plausibility and social understanding.
Framework enables large-scale evaluation of social cognition in videos.
Abstract
Recent text-to-video generation models have made remarkable progress in visual realism, motion fidelity, and text-video alignment, yet they still struggle to produce socially coherent behavior. Unlike humans, who readily infer intentions, beliefs, emotions, and social norms from brief visual cues, current models often generate literal scenes without capturing the underlying causal and psychological dynamics. To systematically assess this limitation, we introduce the first benchmark for social reasoning in video generation. Grounded in developmental and social psychology, the benchmark covers thirty classic social cognition paradigms spanning seven core dimensions: mental-state inference, goal-directed action, joint attention, social coordination, prosocial behavior, social norms, and multi-agent strategy. To operationalize these paradigms, we build a fully training-free agent-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
