MagicQuillV2: Precise and Interactive Image Editing with Layered Visual Cues

Zichen Liu; Yue Yu; Hao Ouyang; Qiuyu Wang; Shuailei Ma; Ka Leong Cheng; Wen Wang; Qingyan Bai; Yuxuan Zhang; Yanhong Zeng; Yixuan Li; Xing Zhu; Yujun Shen; Qifeng Chen

arXiv:2512.03046·cs.CV·December 3, 2025

MagicQuillV2: Precise and Interactive Image Editing with Layered Visual Cues

Zichen Liu, Yue Yu, Hao Ouyang, Qiuyu Wang, Shuailei Ma, Ka Leong Cheng, Wen Wang, Qingyan Bai, Yuxuan Zhang, Yanhong Zeng, Yixuan Li, Xing Zhu, Yujun Shen, Qifeng Chen

PDF

Open Access 1 Models

TL;DR

MagicQuillV2 introduces a layered visual cue system for precise, interactive image editing, combining diffusion models' semantic capabilities with granular control akin to traditional graphics software.

Contribution

It presents a novel layered composition paradigm, including a new data pipeline, control module, and spatial editing branch for improved user control in generative image editing.

Findings

01

Effective disentanglement of user intentions into visual layers

02

Enhanced control over content, position, shape, and color in generated images

03

Validated through extensive experiments demonstrating improved editing precision

Abstract

We propose MagicQuill V2, a novel system that introduces a \textbf{layered composition} paradigm to generative image editing, bridging the gap between the semantic power of diffusion models and the granular control of traditional graphics software. While diffusion transformers excel at holistic generation, their use of singular, monolithic prompts fails to disentangle distinct user intentions for content, position, and appearance. To overcome this, our method deconstructs creative intent into a stack of controllable visual cues: a content layer for what to create, a spatial layer for where to place it, a structural layer for how it is shaped, and a color layer for its palette. Our technical contributions include a specialized data generation pipeline for context-aware content integration, a unified control module to process all visual cues, and a fine-tuned spatial branch for precise…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
LiuZichen/MagicQuillV2-models
model· ♡ 3
♡ 3

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Digital Humanities and Scholarship · Computer Graphics and Visualization Techniques