Watching Movies Like a Human: Egocentric Emotion Understanding for Embodied Companions
Ze Dong, Hao Shi, Zejia Gao, Zhonghua Yi, Kaiwei Wang, Lin Wang

TL;DR
This paper introduces EgoScreen-Emotion, a new egocentric movie emotion dataset, and a multimodal reasoning framework to improve emotion understanding in realistic viewing scenarios, addressing domain gaps from cinematic footage.
Contribution
The paper presents the first egocentric screen-view movie emotion dataset and a multimodal long-context reasoning model to enhance cross-domain emotion understanding.
Findings
Models trained on cinematic footage perform poorly on egocentric data, with Macro-F1 dropping from 27.99 to 16.69.
Training on ESE improves model robustness in realistic egocentric viewing conditions.
The proposed approach achieves competitive results with strong closed-source models.
Abstract
Embodied robotic agents often perceive movies through an egocentric screen-view interface rather than native cinematic footage, introducing domain shifts such as viewpoint distortion, scale variation, illumination changes, and environmental interference. However, existing research on movie emotion understanding is almost exclusively conducted on cinematic footage, limiting cross-domain generalization to real-world viewing scenarios. To bridge this gap, we introduce EgoScreen-Emotion (ESE), the first benchmark dataset for egocentric screen-view movie emotion understanding. ESE contains 224 movie trailers captured under controlled egocentric screen-view conditions, producing 28,667 temporally aligned key-frames annotated by multiple raters with a confidence-aware multi-label protocol to address emotional ambiguity. We further build a multimodal long-context emotion reasoning framework…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
