TL;DR
This paper introduces TIGER, a two-stage scene text image super-resolution framework that prioritizes text structure restoration before image enhancement, improving both readability and image quality.
Contribution
The paper proposes a novel two-stage super-resolution method with glyph structure guidance and introduces the UZ-ST dataset for scene text images.
Findings
TIGER achieves state-of-the-art super-resolution results.
It significantly improves text readability in super-resolved images.
The UZ-ST dataset enables comprehensive training and evaluation.
Abstract
Current image super-resolution methods show strong performance on natural images but distort text, creating a fundamental trade-off between image quality and textual readability. To address this, we introduce TIGER (Text-Image Guided supEr-Resolution), a novel two-stage framework that breaks this trade-off through a "text-first, image-later" paradigm. TIGER explicitly decouples glyph restoration from image enhancement: it first reconstructs precise text structures and uses them to guide full-image super-resolution. This ensures high fidelity and readability. To support comprehensive training and evaluation, we present the UZ-ST (UltraZoom-Scene Text) dataset, the first Chinese scene text dataset with extreme zoom. Extensive experiments show TIGER achieves state-of-the-art performance, enhancing readability and image quality.
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
a. Extensive experiments on both Real-CE and UltraZoom-ST demonstrate the superior super-resolution performance of the proposed model, particularly in preserving text structure fidelity. b. This paper exploits variations in camera focal length to construct 5,036 real-world training pairs, forming the UltraZoom-ST dataset. The dataset features multi-line text instances, diverse scenarios, and varied lighting conditions.
a. Incomplete citation: The innovative contributions of this paper highly overlap with [1], as both studies employ a foreground text prior to guide global text image inpainting. However, the presented work fails to cite or discuss the relevant work, thereby undermining the perceived novelty of the proposed method. b. Critical dependency on pre-trained OCR models: The 'Text Restoration process' in the stage 1 heavily relies on the performance of the OCR model. When the OCR model fails to funct
1. The paper introduces a novel two-stage framework, decoupling glyph restoration from image enhancement to solve the trade-off between text readability and image fidelity in STISR. The “text-first, image-later” paradigm is intuitive and practically useful. 2. Experiments showed that TIGER achieves consistent SOTA results across multiple benchmarks and metrics, especially in OCR-A accuracy and FID, with clear quantitative and qualitative evidence. 3. The proposed UltraZoom-ST dataset provides a
1. While the two-stage framework seems to be intuitive and useful, the architecture itself resembles TADiSR a lot. It seems that TIGER splits the diffusion process in TADiSR into two stages, where text mask and SR image are restored in order instead of jointly. This improves image fidelity and text readability, however, the computational cost could be 2x higher. 2. Most baselines in experiments are not SR methods specialized for scene text image, except DiffTSR and TADiSR. In addition, methods
1. The “text-first, image-later” idea is interesting and provides a clear conceptual separation between text restoration and image enhancement. 2. The authors successfully trained a two-stage model and presented several good qualitative visualizations. 3. The paper introduces a new approach for constructing real paired data and contributes a dataset, UltraZoom-ST, to the community.
1. The system heavily depends on OCR for text detection, and any detection errors can directly degrade the overall performance. 2. The two-stage training process is complex and increases implementation difficulty. 3. The methodological novelty is limited, as the proposed two-stage text-first paradigm mainly combines existing diffusion-based restoration and OCR-guided enhancement strategies rather than introducing a fundamentally new framework. 4. The quantitative improvements are small, with
Clear decoupling and strong motivation. Separating glyph restoration (with semantic conditioning from OCR) from later full-image enhancement directly targets the field’s chronic trade-off between readability and perceptual quality and is well motivated for complex scripts (e.g., Chinese). The “text mask → ControlNet guidance” design is coherent. Technically sound Stage-1 and Stage-2 pipelines. Stage-1 uses VAE latents, a UNet denoiser, and dual branches (appearance and structure) with a segment
Dependence on OCR correctness with limited analysis. Stage-1 assumes reliable text localization and transcription to condition glyph restoration, but OCR degrades under severe LR noise; the paper acknowledges failures when OCR fails, yet offers no quantitative sensitivity to OCR errors or calibration analysis (e.g., how often wrong OCR pushes Stage-1 toward wrong glyphs). A robustness study to OCR precision/recall or confidence thresholds is needed. C2. Dataset realism and alignment validation.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
