MiLDEdit: Reasoning-Based Multi-Layer Design Document Editing
Zihao Lin, Wanrong Zhu, Jiuxiang Gu, Jihyung Kil, Christopher Tensmeyer, Lin Zhang, Shilong Liu, Ruiyi Zhang, Lifu Huang, Vlad I. Morariu, Tong Sun

TL;DR
MiLDEdit introduces a reasoning-based framework for multi-layer design document editing, addressing the challenge of layer-aware modifications guided by natural language instructions, and establishes a new benchmark for this task.
Contribution
The paper presents MiLDEAgent, a novel multi-layer document editing framework combining reasoning and editing, along with MiLDEBench, a comprehensive benchmark dataset for evaluation.
Findings
MiLDEAgent outperforms open-source models in multi-layer editing tasks.
Existing models struggle with multi-layer reasoning and format adherence.
MiLDEAgent achieves performance comparable to closed-source models.
Abstract
Real-world design documents (e.g., posters) are inherently multi-layered, combining decoration, text, and images. Editing them from natural-language instructions requires fine-grained, layer-aware reasoning to identify relevant layers and coordinate modifications. Prior work largely overlooks multi-layer design document editing, focusing instead on single-layer image editing or multi-layer generation, which assume a flat canvas and lack the reasoning needed to determine what and where to modify. To address this gap, we introduce the Multi-Layer Document Editing Agent (MiLDEAgent), a reasoning-based framework that combines an RL-trained multimodal reasoner for layer-wise understanding with an image editor for targeted modifications. To systematically benchmark this setting, we introduce the MiLDEBench, a human-in-the-loop corpus of over 20K design documents paired with diverse editing…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The paper introduces a new benchmark targeting layered document editing, which is overlooked by prior works. 2. The proposed GRPO fine-tuning brings some improvements over the open-source baseline models on the proposed evaluation metrics.
1. Overclaimed performance improvement. The authors claim "Delivering over 50% improvements compared to all open-source baselines and attaining performance comparable to closed-source models" (L051). Importantly, the proposed method is not directly compared to existing methods in Table 2, which is not a standard practice in a benchmark paper. - If we take the 3B model result in Figure 4 (a) and compare it with open-source models in Table 2, Instruction Following (13.29 vs 14.23 BAGEL) / Layou
1. Multi-layer design document editing is a challenging and important problem to investigate. 2. The dataset construction pipeline is well designed, and the evaluation metrics are well defined. The resulting benchmark (MiLDEBench) and evaluation protocol (MiLDEEval) could encourage and benefit future works on design document editing. 3. The proposed reasoning-based editing method (MiLDEAgent) is well developed and shows promising results.
1. The experimental setup in Section 4.2 is questionable. All the existing models observe the input design document through only a rendered image. Thus, they are not aware of the design’s multi-layer structure, and are not given element attributes explicitly. A better option could be to provide the structure and attribute information directly to some of these models. For example, it is possible to create a structured representation (e.g., in JSON format) that contains the attributes (e.g., categ
1、 The paper addresses a genuine gap in current vision-language research by formalizing multi-layer design document editing, which has clear practical applications in real-world creative workflows. 2、MiLDEBench is carefully curated with human-in-the-loop validation, incorporating both document-level and layer-aligned instructions. The data generation pipeline using persona-conditioned and document-conditioned prompts ensures diversity and realism.
1、 MiLDEAgent is only trained and evaluated on content editing, not layout editing, which represents only half of the proposed benchmark. This significantly limits the contribution's completeness and practical applicability. 2、The reasoner makes per-layer decisions independently, which can lead to conflicting edits when multiple layers interact. This is a fundamental architectural limitation that could cause cascading errors in complex documents. 3、 While Figure 2 shows performance degradation
This paper provides a comprehensive benchmark for the multi-layer design document editing task. Multi-dimensional metrics from content editing and layout editing aspects have been considered. The proposed benchmark and evaluation protocol could be useful for future research in the community. The proposed method is also a reasonable and effective solution for design document editing.
1. The statement of the proposed benchmark with two complementary axes (i.e., content editing and layout editing) is a little bit misleading. The main experimental analysis focuses almost exclusively on content editing, with layout editing results relegated to Appendix B.2. Moreover, based on the layout editing evaluation in Appendix B.2, the current evaluation metrics are not quite reliable (L768-L763). 2. The choice of Crello over synthetically generated datasets like PrismLayers (Chen et al.,
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Humanities and Scholarship · Generative Adversarial Networks and Image Synthesis · Data Visualization and Analytics
