UniEdit-I: Training-free Image Editing for Unified VLM via Iterative Understanding, Editing and Verifying

Chengyu Bai; Jintao Chen; Xiang Bai; Yilong Chen; Qi She; Ming Lu; Shanghang Zhang

arXiv:2508.03142·cs.CV·December 4, 2025

UniEdit-I: Training-free Image Editing for Unified VLM via Iterative Understanding, Editing and Verifying

Chengyu Bai, Jintao Chen, Xiang Bai, Yilong Chen, Qi She, Ming Lu, Shanghang Zhang

PDF

TL;DR

UniEdit-I introduces a training-free, closed-loop image editing method that operates within the semantic latent space of a unified vision-language model, enabling dynamic, self-correcting edits without fine-tuning.

Contribution

It presents the first training-free, semantics-driven image editing framework with an iterative understanding, editing, and verifying loop within a unified VLM.

Findings

01

Achieves state-of-the-art performance on GEdit-Bench

02

Surpasses several large-scale pretrained editors

03

Operates without fine-tuning or architectural changes

Abstract

While Unified Vision-Language Models promise to synergistically combine the high-level semantic understanding of vision-language models with the generative fidelity of diffusion models, current editing methodologies remain fundamentally decoupled and open loop performing static, pre-defined transformations without dynamic feedback between semantic interpretation and visual generation. A central limitation stems from the representation gap: understanding typically leverages high-level, language aligned encoders, whereas generation relies on low level, pixel-space autoencoders, resulting in misaligned feature spaces. To bridge this gap, Recent advances such as Representation Autoencoders and BLIP3-o advocate performing diffusion-based modeling directly in high level features from pretrained semantic encoders. We find editing in the semantic latent space modifies conceptual representations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.