Watching Movies Like a Human: Egocentric Emotion Understanding for Embodied Companions

Ze Dong; Hao Shi; Zejia Gao; Zhonghua Yi; Kaiwei Wang; Lin Wang

arXiv:2604.15823·cs.CV·April 20, 2026

Watching Movies Like a Human: Egocentric Emotion Understanding for Embodied Companions

Ze Dong, Hao Shi, Zejia Gao, Zhonghua Yi, Kaiwei Wang, Lin Wang

PDF

TL;DR

This paper introduces EgoScreen-Emotion, a new egocentric movie emotion dataset, and a multimodal reasoning framework to improve emotion understanding in realistic viewing scenarios, addressing domain gaps from cinematic footage.

Contribution

The paper presents the first egocentric screen-view movie emotion dataset and a multimodal long-context reasoning model to enhance cross-domain emotion understanding.

Findings

01

Models trained on cinematic footage perform poorly on egocentric data, with Macro-F1 dropping from 27.99 to 16.69.

02

Training on ESE improves model robustness in realistic egocentric viewing conditions.

03

The proposed approach achieves competitive results with strong closed-source models.

Abstract

Embodied robotic agents often perceive movies through an egocentric screen-view interface rather than native cinematic footage, introducing domain shifts such as viewpoint distortion, scale variation, illumination changes, and environmental interference. However, existing research on movie emotion understanding is almost exclusively conducted on cinematic footage, limiting cross-domain generalization to real-world viewing scenarios. To bridge this gap, we introduce EgoScreen-Emotion (ESE), the first benchmark dataset for egocentric screen-view movie emotion understanding. ESE contains 224 movie trailers captured under controlled egocentric screen-view conditions, producing 28,667 temporally aligned key-frames annotated by multiple raters with a confidence-aware multi-label protocol to address emotional ambiguity. We further build a multimodal long-context emotion reasoning framework…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.