# Describe, Don't Dictate: Semantic Image Editing with Natural Language Intent

**Authors:** En Ci, Shanyan Guan, Yanhao Ge, Yilin Zhang, Wei Li, Zhenyu Zhang, Jian Yang, and Ying Tai

arXiv: 2508.20505 · 2025-08-29

## TL;DR

DescriptiveEdit is a novel semantic image editing framework that leverages reference images and prompts to improve editing accuracy and scalability without architectural changes or inversion, addressing limitations of previous methods.

## Contribution

It re-frames instruction-based editing as reference-image-based text-to-image generation using a Cross-Attentive UNet, enhancing flexibility and performance.

## Key findings

- Improves editing accuracy and consistency on Emu Edit benchmark.
- Overcomes dataset quality limitations inherent in instruction-based models.
- Seamlessly integrates with existing extensions like ControlNet and IP-Adapter.

## Abstract

Despite the progress in text-to-image generation, semantic image editing remains a challenge. Inversion-based algorithms unavoidably introduce reconstruction errors, while instruction-based models mainly suffer from limited dataset quality and scale. To address these problems, we propose a descriptive-prompt-based editing framework, named DescriptiveEdit. The core idea is to re-frame `instruction-based image editing' as `reference-image-based text-to-image generation', which preserves the generative power of well-trained Text-to-Image models without architectural modifications or inversion. Specifically, taking the reference image and a prompt as input, we introduce a Cross-Attentive UNet, which newly adds attention bridges to inject reference image features into the prompt-to-edit-image generation process. Owing to its text-to-image nature, DescriptiveEdit overcomes limitations in instruction dataset quality, integrates seamlessly with ControlNet, IP-Adapter, and other extensions, and is more scalable. Experiments on the Emu Edit benchmark show it improves editing accuracy and consistency.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.20505/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/2508.20505/full.md

## References

56 references — full list in the complete paper: https://tomesphere.com/paper/2508.20505/full.md

---
Source: https://tomesphere.com/paper/2508.20505