Unified Diffusion Transformer for High-fidelity Text-Aware Image Restoration

Jin Hyeon Kim; Paul Hyunbin Cho; Claire Kim; Jaewon Min; Jaeeun Lee; Jihye Park; Yeji Choi; Seungryong Kim

arXiv:2512.08922·cs.CV·December 10, 2025

Unified Diffusion Transformer for High-fidelity Text-Aware Image Restoration

Jin Hyeon Kim, Paul Hyunbin Cho, Claire Kim, Jaewon Min, Jaeeun Lee, Jihye Park, Yeji Choi, Seungryong Kim

PDF

Open Access

TL;DR

UniT introduces a unified diffusion transformer framework that combines vision-language understanding and OCR guidance to improve high-fidelity text restoration in degraded images, significantly reducing hallucinations and achieving state-of-the-art results.

Contribution

The paper presents UniT, a novel integrated framework combining diffusion transformers, vision-language models, and OCR modules for enhanced text-aware image restoration.

Findings

01

Achieves state-of-the-art F1-score on SA-Text and Real-Text benchmarks.

02

Substantially reduces text hallucinations compared to previous methods.

03

Effectively reconstructs fine-grained textual content in degraded images.

Abstract

Text-Aware Image Restoration (TAIR) aims to recover high-quality images from low-quality inputs containing degraded textual content. While diffusion models provide strong generative priors for general image restoration, they often produce text hallucinations in text-centric tasks due to the absence of explicit linguistic knowledge. To address this, we propose UniT, a unified text restoration framework that integrates a Diffusion Transformer (DiT), a Vision-Language Model (VLM), and a Text Spotting Module (TSM) in an iterative fashion for high-fidelity text restoration. In UniT, the VLM extracts textual content from degraded images to provide explicit textual guidance. Simultaneously, the TSM, trained on diffusion features, generates intermediate OCR predictions at each denoising step, enabling the VLM to iteratively refine its guidance during the denoising process. Finally, the DiT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Image Processing Techniques · Image Enhancement Techniques