Reward-Decomposed Reinforcement Learning for Immersive Video Role-Playing
Miao Wang, Yuling Shi, Yijiang Li, Yeheng Chen, Xiaodong Gu, Bin Li, Bo Gao, Yaduan Ruan

TL;DR
This paper introduces EBM-RL, a novel reinforcement learning framework that enhances immersive video role-playing by explicitly modeling perception, reasoning, and response generation, leading to more authentic and atmospheric interactions.
Contribution
The paper proposes a decoupled, reward-driven reinforcement learning approach that improves visual and character authenticity in immersive video role-playing applications.
Findings
EBM-RL outperforms text-only baselines and larger vision-language models in immersive role-playing tasks.
The framework achieves better visual-atmosphere consistency and character authenticity.
It demonstrates strong zero-shot generalization on VideoQA benchmarks.
Abstract
Text-based role-playing models can imitate character styles, yet they often fail to reflect a scene's atmosphere and evolving tension, both essential for immersive applications such as Virtual Reality (VR) games and interactive narratives. We study video-grounded role-playing dialogue and introduce EBM-RL (Eye-Brain-Mouth Reinforcement Learning), a decoupled GRPO-based framework that explicitly separates observation ([perception]), reasoning ([think]), and utterance ([answer]). This structure promotes human-like sensory grounding by compelling the model to first attend to visual cues, then form internal interpretations, and finally generate context-appropriate dialogue. EBM-RL integrates four complementary rewards: (i) CLIP-based scene-text alignment to improve ambiance and emotion; (ii) a Perceptual-Cognitive reward that encourages [perception] and [think] processes that increase the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
