Task-Oriented Diffusion Inversion for High-Fidelity Text-based Editing
Yangyang Xu, Wenqi Shao, Yong Du, Haiming Zhu, Yang Zhou, Ping Luo,, Shengfeng He

TL;DR
This paper presents TODInv, a novel diffusion inversion framework that achieves high-fidelity image reconstruction and precise, task-specific editing by optimizing prompt embeddings across U-Net layers, outperforming existing methods.
Contribution
Introducing TODInv, a hierarchical, task-oriented diffusion inversion method that balances reconstruction fidelity and editability through reciprocal optimization of prompt embeddings.
Findings
Outperforms existing inversion and editing methods quantitatively.
Achieves high-fidelity reconstructions with precise, task-specific edits.
Demonstrates versatility with few-step diffusion models.
Abstract
Recent advancements in text-guided diffusion models have unlocked powerful image manipulation capabilities, yet balancing reconstruction fidelity and editability for real images remains a significant challenge. In this work, we introduce \textbf{T}ask-\textbf{O}riented \textbf{D}iffusion \textbf{I}nversion (\textbf{TODInv}), a novel framework that inverts and edits real images tailored to specific editing tasks by optimizing prompt embeddings within the extended \(\mathcal{P}^*\) space. By leveraging distinct embeddings across different U-Net layers and time steps, TODInv seamlessly integrates inversion and editing through reciprocal optimization, ensuring both high fidelity and precise editability. This hierarchical editing mechanism categorizes tasks into structure, appearance, and global edits, optimizing only those embeddings unaffected by the current editing task. Extensive…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
The authors proposes layer-wise prompt optimization for inversion
1. The paper lacks novelty. The idea of layer-wise optimization comes from P+:Extented prompt optimization paper. Although P+ is extension from textual inversion, i think the proposed inversion is just simple modification of Null-text inversion with P+. 2. The method lacks efficiency. It requires much time for optimization, which takes over 20 seconds. Since there are so many editing methods which enables editing within 5 seconds or even using single step, this proposed method has no clear adv
- The inversion method seems to perform well on multiple editing models. - The method is clearly descripted.
- I have some concerns about the technical contribution of the paper. The main contribution claimed is the joint optimization process of inversion and editing framework. However, my understanding is that it only contains the inversion process that tries to optimize $P^*$. The second claimed contribution is the idea of classing different layer prompt embeddings into multiple groups according to the resolution. And this is adopted from a personalized editing work NeTI [1]. In general, I don’t thin
1. The quantitative experiments are dense, which is good. 2. The network structure figures are drawn in a professional way. 3. The method is explained clearly and the paper is easy to follow.
1. The quality of the edited images are very poor compared with SoTA methods nowadays, considering that it is the end of 2024 now. The techniques in this paper are still from almost two years ago, which have been improved a lot in recent years. 2. The method requires optimization. Yet its editing results are far worse than Imagic from CVPR 2022 and its later improved versions. The authors did not compare with Imagic and the later methods improving Imagic. Especially, Imagic and its improved
- The idea is straightforward. - The proposed TODInv performs well on certain single object editing examples, editing flexibility and background preservation are balanced. - By further categorizing the editing types into structure/appearance/structure & appearance, more fine grained qualitative and quantitative analysis are carried out on PIE-bench.
- The idea of extended textual inversion is not new (e.g. neti), the extension proposed in this paper seems less flexible and heuristic. - The inversion and editing paradigm in Fig. 3 is not clearly explained by the caption. A dedicated paragraph to explain the process as well as its relation to previous editing pipeline should be added. - The lacks evaluations in more complex scenarios such as multiple objects and larger area editing. - Requires more manual setting which could lead to less stab
1. The paper is well-motivated: Even within text-based image editing tasks, different editing prompts require distinct editing capabilities, making it challenging to faithfully execute all editing prompts with a single unified approach. This paper effectively identifies this limitation and proposes a new method based on this observation. 2. The method is simple and the paper is easy to follow. 3. The paper provides extensive quantitative and qualitative experimental results across various editin
1. Poor presentation * The paper contains several errors in highlighting the best values in bold within tables. Most of these errors occur when the proposed method's values are incorrectly marked in bold despite not being the best performing results. These errors could potentially mislead readers into overestimating the performance of the proposed method and significantly diminish the credibility of the reported experimental results. * There is a significant formatting error in [page 7,
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Concatenated Skip Connection · Max Pooling · Convolution · U-Net · Diffusion
