Reward-Decomposed Reinforcement Learning for Immersive Video Role-Playing

Miao Wang; Yuling Shi; Yijiang Li; Yeheng Chen; Xiaodong Gu; Bin Li; Bo Gao; Yaduan Ruan

arXiv:2605.04733·cs.AI·May 7, 2026

Reward-Decomposed Reinforcement Learning for Immersive Video Role-Playing

Miao Wang, Yuling Shi, Yijiang Li, Yeheng Chen, Xiaodong Gu, Bin Li, Bo Gao, Yaduan Ruan

PDF

TL;DR

This paper introduces EBM-RL, a novel reinforcement learning framework that enhances immersive video role-playing by explicitly modeling perception, reasoning, and response generation, leading to more authentic and atmospheric interactions.

Contribution

The paper proposes a decoupled, reward-driven reinforcement learning approach that improves visual and character authenticity in immersive video role-playing applications.

Findings

01

EBM-RL outperforms text-only baselines and larger vision-language models in immersive role-playing tasks.

02

The framework achieves better visual-atmosphere consistency and character authenticity.

03

It demonstrates strong zero-shot generalization on VideoQA benchmarks.

Abstract

Text-based role-playing models can imitate character styles, yet they often fail to reflect a scene's atmosphere and evolving tension, both essential for immersive applications such as Virtual Reality (VR) games and interactive narratives. We study video-grounded role-playing dialogue and introduce EBM-RL (Eye-Brain-Mouth Reinforcement Learning), a decoupled GRPO-based framework that explicitly separates observation ([perception]), reasoning ([think]), and utterance ([answer]). This structure promotes human-like sensory grounding by compelling the model to first attend to visual cues, then form internal interpretations, and finally generate context-appropriate dialogue. EBM-RL integrates four complementary rewards: (i) CLIP-based scene-text alignment to improve ambiance and emotion; (ii) a Perceptual-Cognitive reward that encourages [perception] and [think] processes that increase the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.