Text-Aware Image Restoration with Diffusion Models

Jaewon Min; Jin Hyeon Kim; Paul Hyunbin Cho; Jaeeun Lee; Jihye Park; Minkyu Park; Sangpil Kim; Hyunhee Park; Seungryong Kim

arXiv:2506.09993·cs.CV·July 4, 2025

Text-Aware Image Restoration with Diffusion Models

Jaewon Min, Jin Hyeon Kim, Paul Hyunbin Cho, Jaeeun Lee, Jihye Park, Minkyu Park, Sangpil Kim, Hyunhee Park, Seungryong Kim

PDF

Open Access 3 Datasets 3 Reviews

TL;DR

This paper introduces a new task and benchmark for text-aware image restoration, proposing a diffusion-based multi-task framework that improves the fidelity of textual regions in degraded images, outperforming existing methods.

Contribution

The paper presents TAIR, a novel text-aware restoration task, a large-scale benchmark SA-Text, and a diffusion-based multi-task model TeReDiff that enhances textual fidelity in image restoration.

Findings

01

TeReDiff outperforms state-of-the-art methods in text recognition accuracy.

02

The benchmark SA-Text provides diverse, densely annotated images for training and evaluation.

03

Joint training of diffusion and text-spotting modules benefits textual and visual restoration.

Abstract

Image restoration aims to recover degraded images. However, existing diffusion-based restoration methods, despite great success in natural image restoration, often struggle to faithfully reconstruct textual regions in degraded images. Those methods frequently generate plausible but incorrect text-like patterns, a phenomenon we refer to as text-image hallucination. In this paper, we introduce Text-Aware Image Restoration (TAIR), a novel restoration task that requires the simultaneous recovery of visual contents and textual fidelity. To tackle this task, we present SA-Text, a large-scale benchmark of 100K high-quality scene images densely annotated with diverse and complex text instances. Furthermore, we propose a multi-task diffusion framework, called TeReDiff, that integrates internal features from diffusion models into a text-spotting module, enabling both components to benefit from…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 5

Strengths

- Introduces TAIR—a more realistic and general setting than existing STISR—by restoring full-scene images with multiple, diverse text instances. - Significantly reduces text hallucination and improves readability through text-conditioned diffusion and multi-stage training. - Achieves good performance on both text recognition metrics and standard image restoration benchmarks, demonstrating balanced fidelity and perceptual quality.

Weaknesses

- The evaluation relies heavily on synthetic or curated datasets (e.g., SA-Text), which may not fully reflect the complexity and variability of real-world degraded images. - Lacks in-depth discussion of scenarios where the method fails (e.g., extremely low-resolution or occluded text), reducing insight into limitations.

Reviewer 02Rating 4Confidence 4

Strengths

1. The author clearly identifies a critical limitation in diffusion-based image restoration—its inability to accurately recover text regions. To address this, the author introduces Text-Aware Image Restoration (TAIR) as a novel task that jointly optimizes visual quality and text fidelity, offering strong potential for practical applications. 2. The author constructs a dataset of 100K high-resolution images derived from SA-1B, densely annotated with text polygons and transcriptions through a sca

Weaknesses

1. The author primarily conducts training and evaluation on synthetic degradations (e.g., Real-ESRGAN). Moreover, while the Real-Text results (Table 3) demonstrate modest improvements (e.g., a +6% F1-score over FaithDiff), the absence of corresponding visual examples limits the clarity of qualitative gains. 2. The author provides no analysis of error propagation across the pipeline. Given that the method relies on accurate text spotting, recognition errors are likely to propagate and compromise

Reviewer 03Rating 6Confidence 4

Strengths

1. Excellent originality: This paper purposefully introduces a new task (TAIR), along with a corresponding dataset and a SOTA model capable of performing both tasks simultaneously. 2. Good quality and performance: The paper constructs a large-scale, high-quality dataset for TAIR and designs a novel model architecture that leverages joint training for both text-spotting and restoration tasks, achieving outstanding performance across multiple benchmarks. 3. The paper is well-written and easy to fo

Weaknesses

1. While the paper emphasizes that the new TAIR task differs from previous models by focusing on text-image hallucination (text readability), it fails to sufficiently demonstrate the model's superiority in this core aspect. The evidence is limited to a few comparative images and existing metrics that are not fully relevant. This lack of qualitative comparison methods specific to the TAIR task undermines the credibility of its performance evaluation. 2. The comparison in Table 4 is insufficient t

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Image Processing Techniques · Digital Media Forensic Detection

MethodsDiffusion