VDE Bench: Evaluating The Capability of Image Editing Models to Modify Visual Documents

Hongzhu Yi; Yujia Yang; Yuanxiang Wang; Tong Li; Zhenyu Guan; Tianyu Zong; Jiahuan Chen; Chenxi Bao; Tiankun Yang; Haopeng Jin; Yixuan Yuan; Xinming Wang; Tao Yu; Ruilin Gao; Ruiwen Tao; Haijin Liang; Jin Ma; Jinwen Luo; Yeshani; Xinyu Zuo; Jungang Xu

arXiv:2602.00122·cs.CV·May 22, 2026

VDE Bench: Evaluating The Capability of Image Editing Models to Modify Visual Documents

Hongzhu Yi, Yujia Yang, Yuanxiang Wang, Tong Li, Zhenyu Guan, Tianyu Zong, Jiahuan Chen, Chenxi Bao, Tiankun Yang, Haopeng Jin, Yixuan Yuan, Xinming Wang, Tao Yu, Ruilin Gao, Ruiwen Tao, Haijin Liang, Jin Ma, Jinwen Luo, Yeshani, Xinyu Zuo, Jungang Xu

PDF

TL;DR

VDE Bench is a new benchmark dataset and evaluation framework designed to assess image editing models' ability to modify dense, bilingual Chinese-English visual documents while preserving text style and background.

Contribution

It introduces a high-quality dataset of 942 instruction-based samples and a novel OCR-based evaluation framework for complex visual document editing tasks.

Findings

01

Human verification aligns well with automated metrics.

02

Existing models struggle with dense, bilingual documents.

03

VDE Bench is the first systematic benchmark for this task.

Abstract

In recent years, image editing models have made significant progress, enabling users to manipulate visual content in a flexible and interactive manner through natural language instructions. However, an important yet underexplored research direction remains dense visual document image editing, which involves modifying textual content within images while faithfully preserving the original text style and background context. Existing methods primarily focus on English scenarios and images with relatively sparse text, and thus cannot adequately address dense, structurally complex documents or non-Latin scripts such as Chinese. To bridge this gap, we propose VDE Bench (Visual Doc Edit Bench), a rigorously human annotated and evaluated benchmark specifically designed to assess the performance of image editing models on bilingual Chinese-English and complex visual document editing tasks. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Handwritten Text Recognition Techniques