Through the Lens of Character: Resolving Modality-Role Interference in Multimodal Role-Playing Agent

Yihong Tang; Kehai Chen; Xuefeng Bai; Min Zhang

arXiv:2605.09443·cs.CV·May 12, 2026

Through the Lens of Character: Resolving Modality-Role Interference in Multimodal Role-Playing Agent

Yihong Tang, Kehai Chen, Xuefeng Bai, Min Zhang

PDF

TL;DR

This paper introduces CAVI, a training-free framework that improves multimodal role-playing agents by focusing visual perception on character-relevant information, reducing interference from generic visual noise.

Contribution

CAVI is a novel, training-free approach that systematically enhances character consistency in multimodal agents by targeting visual grounding and feature alignment.

Findings

01

CAVI significantly reduces Modality-Role Interference in RPAs.

02

Agents with CAVI show improved character consistency in visual tasks.

03

CAVI enhances multimodal interaction quality in experiments.

Abstract

The advancement of Multimodal Large Language Models (MLLMs) has expanded Role-Playing Agents (RPAs) into visually grounded environments. However, human vision is inherently subjective and identity-driven, whereas existing MLLMs extract objective, character-agnostic features for general tasks. In RPAs, this generic visual noise overpowers fragile character traits, causing Modality-Role Interference (MRI), where agents struggle to integrate visual grounding and character consistency. To address this, we introduce the training-free Character-Aware Visual Intervention (CAVI) framework, enabling agents to perceive the world through the lens of character. CAVI systematically targets MRI: macroscopically, Character-Guided Token Pruning (CTP) restricts the visual receptive field to role-relevant entities; microscopically, Orthogonal Feature Modulation (OFM) projects tokens onto a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.