EVLM: Self-Reflective Multimodal Reasoning for Cross-Dimensional Visual Editing

Umar Khalid; Kashif Munir; Hasan Iqbal; Azib Farooq; Jing Hua; Nazanin Rahnavard; Chen Chen; Victor Zhu; Zhengping Ji

arXiv:2412.10566·cs.CV·November 11, 2025

EVLM: Self-Reflective Multimodal Reasoning for Cross-Dimensional Visual Editing

Umar Khalid, Kashif Munir, Hasan Iqbal, Azib Farooq, Jing Hua, Nazanin Rahnavard, Chen Chen, Victor Zhu, Zhengping Ji

PDF

Open Access

TL;DR

EVLM introduces a reflective reasoning framework for multimodal visual editing, enabling precise interpretation of ambiguous instructions by aligning model outputs with human rationales, thus improving content editing accuracy across various media.

Contribution

The paper presents EVLM, a novel model that combines Chain-of-Thought reasoning with reflection-aware optimization to better interpret user intent in visual editing tasks.

Findings

01

EVLM outperforms baseline models in alignment with human intent.

02

Achieves significant improvements in image, video, 3D, and 4D editing tasks.

03

Trained on 30,000 CoT examples with human-annotated rationales.

Abstract

Editing complex visual content from ambiguous or partially specified instructions remains a core challenge in vision-language modeling. Existing models can contextualize content but often fail to infer the underlying intent within a reference image or scene, leading to inconsistent or misaligned edits. We introduce the Editing Vision-Language Model (EVLM), a system that interprets ambiguous instructions in conjunction with reference visuals to produce precise, context-aware editing prompts. EVLM's key innovation is a reflective reasoning framework that translates subjective user intent into structured, actionable outputs by aligning with human-rated rationales through Reflection-Aware KL-Divergence Target Optimization (RKTO). By combining Chain-of-Thought (CoT) reasoning with RKTO alignment, EVLM captures fine-grained editing preferences without relying on binary supervision. Trained on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Natural Language Processing Techniques · Speech and dialogue systems