OmniText: A Training-Free Generalist for Controllable Text-Image Manipulation

Agus Gunawan; Samuel Teodoro; Yun Chen; Soo Ye Kim; Jihyong Oh; Munchurl Kim

arXiv:2510.24093·cs.CV·October 29, 2025

OmniText: A Training-Free Generalist for Controllable Text-Image Manipulation

Agus Gunawan, Samuel Teodoro, Yun Chen, Soo Ye Kim, Jihyong Oh, Munchurl Kim

PDF

3 Reviews

TL;DR

OmniText is a training-free, versatile framework that enables controllable text-image manipulation, including text removal, style control, and editing, by leveraging attention mechanisms and a new benchmark dataset.

Contribution

We introduce OmniText, a training-free generalist for diverse text image manipulation tasks, utilizing attention inversion and redistribution, along with novel loss functions and a comprehensive benchmark dataset.

Findings

01

Achieves state-of-the-art results across multiple TIM tasks

02

Effectively removes and edits text with controlled styles

03

Comparable to specialist methods in performance

Abstract

Recent advancements in diffusion-based text synthesis have demonstrated significant performance in inserting and editing text within images via inpainting. However, despite the potential of text inpainting methods, three key limitations hinder their applicability to broader Text Image Manipulation (TIM) tasks: (i) the inability to remove text, (ii) the lack of control over the style of rendered text, and (iii) a tendency to generate duplicated letters. To address these challenges, we propose OmniText, a training-free generalist capable of performing a wide range of TIM tasks. Specifically, we investigate two key properties of cross- and self-attention mechanisms to enable text removal and to provide control over both text styles and content. Our findings reveal that text removal can be achieved by applying self-attention inversion, which mitigates the model's tendency to focus on…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. **Novel Generalist Design for TIM Tasks**: OmniText is the first training-free framework to jointly support diverse TIM tasks (removal, editing, insertion, rescaling, repositioning, style-based manipulation), addressing the long-standing task specialization issue in existing works. Its modular design (text removal + controllable inpainting) enables flexible adaptation to different tasks without retraining, filling a critical gap in the TIM field. 2. **Well-Grounded Attention Manipulation**: T

Weaknesses

1. **Insufficient Qualitative Results for Edge Cases**: The paper provides qualitative results for typical TIM tasks but lacks side-by-side comparisons for edge cases, such as long-text manipulation (over 10 characters), text on complex textures (e.g., patterned fabrics), or low-resolution input images. This makes it difficult to evaluate the framework’s robustness in challenging practical scenarios. 2. **Vague Explanation of Hyperparameter Tuning**: OmniText introduces hyperparameters (e.g., we

Reviewer 02Rating 8Confidence 4

Strengths

1) This is the first training-free generalist method enabling text insertion and editing, removal, repositioning, and rescaling, and enabling explicit control over style fidelity, text content, and text removal, which is more applicable in real world. 2) The OmniText-Bench is proposed for a mockup-based evaluation. It consists of 150 sets of input images, targets texts with masks, reference images, and ground-truth, and covers five distinct applications. 3) The qualitative and quantitative exper

Weaknesses

1) How about comparing with recent large models such as GPT-4o, Gemini 2.5 Flash Image, Qwen Image since they are also generalist?

Reviewer 03Rating 4Confidence 3

Strengths

- The paper addresses the important goal of creating a unified, generalist model for diverse TIM tasks. Its training-free approach, which avoids costly retraining by manipulating a pre-trained model's internal states, is a significant practical advantage. - The introduction of the OmniText-Bench dataset is a valuable contribution that addresses a clear gap in evaluation resources for complex, style-oriented TIM tasks.

Weaknesses

- While the specific techniques are new, the high-level approach of using attention control and latent optimization for editing is established in the broader image editing field. - The paper completely omits any discussion of computational cost. Methods involving per-image optimization are often significantly slower at inference. A runtime comparison against baselines is critical for understanding the practical trade-offs of the proposed framework. - The method's generalizability is not fully

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.