UM-Text: A Unified Multimodal Model for Image Understanding and Visual Text Editing

Lichen Ma; Xiaolong Fu; Gaojing Zhou; Zipeng Guo; Ting Zhu; Yichun Liu; Yu Shi; Jason Li; Junshi Huang

arXiv:2601.08321·cs.CV·May 12, 2026

UM-Text: A Unified Multimodal Model for Image Understanding and Visual Text Editing

Lichen Ma, Xiaolong Fu, Gaojing Zhou, Zipeng Guo, Ting Zhu, Yichun Liu, Yu Shi, Jason Li, Junshi Huang

PDF

1 Video

TL;DR

UM-Text is a novel multimodal model that enables natural language-driven visual text editing with style consistency, leveraging a visual language model, a specialized encoder, and a large dataset for training.

Contribution

The paper introduces UM-Text, a unified multimodal framework with a regional consistency loss and a new large-scale dataset for improved visual text editing.

Findings

01

Achieves state-of-the-art results on multiple benchmarks.

02

Effectively maintains style consistency in generated visual text.

03

Demonstrates strong generalization across diverse scenes.

Abstract

With the rapid advancement of image generation, visual text editing using natural language instructions has received increasing attention. The main challenge of this task is to fully understand the instruction and reference image, and thus generate visual text that is style-consistent with the image. Previous methods often involve complex steps of specifying the text content and attributes, such as font size, color, and layout, without considering the stylistic consistency with the reference image. To address this, we propose UM-Text, a unified multimodal model for context understanding and visual text editing by natural language instructions. Specifically, we introduce a Visual Language Model (VLM) to process the instruction and reference image, so that the text content and layout can be elaborately designed according to the context information. To generate an accurate and harmonious…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

UM-Text: A Unified Multimodal Model for Image Understanding and Visual Text Editing· underline