TL;DR
OmniText is a training-free, versatile framework that enables controllable text-image manipulation, including text removal, style control, and editing, by leveraging attention mechanisms and a new benchmark dataset.
Contribution
We introduce OmniText, a training-free generalist for diverse text image manipulation tasks, utilizing attention inversion and redistribution, along with novel loss functions and a comprehensive benchmark dataset.
Findings
Achieves state-of-the-art results across multiple TIM tasks
Effectively removes and edits text with controlled styles
Comparable to specialist methods in performance
Abstract
Recent advancements in diffusion-based text synthesis have demonstrated significant performance in inserting and editing text within images via inpainting. However, despite the potential of text inpainting methods, three key limitations hinder their applicability to broader Text Image Manipulation (TIM) tasks: (i) the inability to remove text, (ii) the lack of control over the style of rendered text, and (iii) a tendency to generate duplicated letters. To address these challenges, we propose OmniText, a training-free generalist capable of performing a wide range of TIM tasks. Specifically, we investigate two key properties of cross- and self-attention mechanisms to enable text removal and to provide control over both text styles and content. Our findings reveal that text removal can be achieved by applying self-attention inversion, which mitigates the model's tendency to focus on…
Peer Reviews
Decision·ICLR 2026 Poster
1. **Novel Generalist Design for TIM Tasks**: OmniText is the first training-free framework to jointly support diverse TIM tasks (removal, editing, insertion, rescaling, repositioning, style-based manipulation), addressing the long-standing task specialization issue in existing works. Its modular design (text removal + controllable inpainting) enables flexible adaptation to different tasks without retraining, filling a critical gap in the TIM field. 2. **Well-Grounded Attention Manipulation**: T
1. **Insufficient Qualitative Results for Edge Cases**: The paper provides qualitative results for typical TIM tasks but lacks side-by-side comparisons for edge cases, such as long-text manipulation (over 10 characters), text on complex textures (e.g., patterned fabrics), or low-resolution input images. This makes it difficult to evaluate the framework’s robustness in challenging practical scenarios. 2. **Vague Explanation of Hyperparameter Tuning**: OmniText introduces hyperparameters (e.g., we
1) This is the first training-free generalist method enabling text insertion and editing, removal, repositioning, and rescaling, and enabling explicit control over style fidelity, text content, and text removal, which is more applicable in real world. 2) The OmniText-Bench is proposed for a mockup-based evaluation. It consists of 150 sets of input images, targets texts with masks, reference images, and ground-truth, and covers five distinct applications. 3) The qualitative and quantitative exper
1) How about comparing with recent large models such as GPT-4o, Gemini 2.5 Flash Image, Qwen Image since they are also generalist?
- The paper addresses the important goal of creating a unified, generalist model for diverse TIM tasks. Its training-free approach, which avoids costly retraining by manipulating a pre-trained model's internal states, is a significant practical advantage. - The introduction of the OmniText-Bench dataset is a valuable contribution that addresses a clear gap in evaluation resources for complex, style-oriented TIM tasks.
- While the specific techniques are new, the high-level approach of using attention control and latent optimization for editing is established in the broader image editing field. - The paper completely omits any discussion of computational cost. Methods involving per-image optimization are often significantly slower at inference. A runtime comparison against baselines is critical for understanding the practical trade-offs of the proposed framework. - The method's generalizability is not fully
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
