UniEdit-I: Training-free Image Editing for Unified VLM via Iterative Understanding, Editing and Verifying
Chengyu Bai, Jintao Chen, Xiang Bai, Yilong Chen, Qi She, Ming Lu, Shanghang Zhang

TL;DR
UniEdit-I introduces a training-free, closed-loop image editing method that operates within the semantic latent space of a unified vision-language model, enabling dynamic, self-correcting edits without fine-tuning.
Contribution
It presents the first training-free, semantics-driven image editing framework with an iterative understanding, editing, and verifying loop within a unified VLM.
Findings
Achieves state-of-the-art performance on GEdit-Bench
Surpasses several large-scale pretrained editors
Operates without fine-tuning or architectural changes
Abstract
While Unified Vision-Language Models promise to synergistically combine the high-level semantic understanding of vision-language models with the generative fidelity of diffusion models, current editing methodologies remain fundamentally decoupled and open loop performing static, pre-defined transformations without dynamic feedback between semantic interpretation and visual generation. A central limitation stems from the representation gap: understanding typically leverages high-level, language aligned encoders, whereas generation relies on low level, pixel-space autoencoders, resulting in misaligned feature spaces. To bridge this gap, Recent advances such as Representation Autoencoders and BLIP3-o advocate performing diffusion-based modeling directly in high level features from pretrained semantic encoders. We find editing in the semantic latent space modifies conceptual representations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
