GLYPH-SR: Can We Achieve Both High-Quality Image Super-Resolution and High-Fidelity Text Recovery via VLM-guided Latent Diffusion Model?

Mingyu Sung; Seungjae Ham; Kangwoo Kim; Yeokyoung Yoon; Sangseok Yun; Il-Min Kim; Jae-Mo Kang

arXiv:2510.26339·cs.CV·October 31, 2025

GLYPH-SR: Can We Achieve Both High-Quality Image Super-Resolution and High-Fidelity Text Recovery via VLM-guided Latent Diffusion Model?

Mingyu Sung, Seungjae Ham, Kangwoo Kim, Yeokyoung Yoon, Sangseok Yun, Il-Min Kim, Jae-Mo Kang

PDF

4 Reviews

TL;DR

GLYPH-SR introduces a vision-language-guided diffusion framework that enhances image super-resolution to improve both scene-text readability and perceptual quality, outperforming previous methods especially in text recovery within complex natural scenes.

Contribution

The paper proposes GLYPH-SR, a novel diffusion-based model guided by OCR data and a ping-pong scheduler, explicitly optimizing for high-quality text recovery and perceptual realism in scene-text super-resolution.

Findings

01

Improves OCR F1 score by up to +15.18 percentage points at x8 scale.

02

Maintains competitive perceptual quality metrics like MANIQA, CLIP-IQA, and MUSIQ.

03

Effective in complex natural scenes with realistic text recovery.

Abstract

Image super-resolution(SR) is fundamental to many vision system-from surveillance and autonomy to document analysis and retail analytics-because recovering high-frequency details, especially scene-text, enables reliable downstream perception. Scene-text, i.e., text embedded in natural images such as signs, product labels, and storefronts, often carries the most actionable information; when characters are blurred or hallucinated, optical character recognition(OCR) and subsequent decisions fail even if the rest of the image appears sharp. Yet previous SR research has often been tuned to distortion (PSNR/SSIM) or learned perceptual metrics (LIPIS, MANIQA, CLIP-IQA, MUSIQ) that are largely insensitive to character-level errors. Furthermore, studies that do address text SR often focus on simplified benchmarks with isolated characters, overlooking the challenges of text within complex natural…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 4

Strengths

1. The bi-objective view of SR (visual realism + text fidelity) is intuitive and important for practical use, addressing the neglected fact in most STISR works. 2. The TS-ControlNet + ping-pong scheduler combination is intuitive for the target of optimizing image perceptual quality and text legibility jointly. 3. The four-way partition synthetic corpus fits the claimed objective of joint optimization for training purpose.

Weaknesses

1. Most baselines in experiments are not SR methods specialized for scene text image, except DiffTSR. In addition, methods like DiffTSR are not built for restoring a full scene text image, but for cropped image that only contains a single textline. The comparison could be unfair. 2. Despite most baselines were not built for scene text image, the proposed GLYPH-SR still can not outperform them consistently, even in terms of OCR accuracy. 3. As mentioned in Sec. C.3, the restoration performance

Reviewer 02Rating 6Confidence 4

Strengths

1. The paper compellingly argues that text legibility is a critical yet overlooked aspect of SR in practical applications. It provides a clear analysis of the systemic biases (metric and objective) in prior work that lead to text hallucination or conservative restoration, effectively framing the need for a dual-objective approach. 2. The proposed TS-ControlNet architecture and the binary ping-pong scheduler are elegant and effective solutions for fusing semantic text cues with global image prio

Weaknesses

1. The related work, experiment section (Section 2/4) and lacks a thorough discussion of several recent and highly relevant works that also leverage VLMs, text prompts, or diffusion models for text-aware image restoration. Notable omissions include, but are not limited to: a) Zhang et al. (2024), "Diffusion-based Blind Text Image Super-Resolution" b) Chen et al. (2024), "Image Super-Resolution with Text Prompt Diffusion" / "Universal Image Restoration with Text Prompt Diffusion"

Reviewer 03Rating 6Confidence 4

Strengths

Strengthens: 1. Proposed the GLYPH-SR framework with TS-ControlNet, allowing fine-grained control over both glyph-level details and scene-level realism. Furthermore, a ping-pong scheduler is introduced to dynamically balance visual fidelity and text legibility during the denoise process. 2. Constructed a factorized synthetic corpus separating text degradation from global image degradation, enabling controlled finetunning and clear ablation analysis. 3. Analyzed the trade-off between SR metrics

Weaknesses

Weaknesses: 1. The novelty could be further improved. - The text branch of the proposed TS-ControlNet continues to adopt the plain ControlNet structure, without any specific modifications for its text-focused role. Introducing task-oriented designs could potentially further improve its performance. - Although the paper considers the trade-off between SR and OCR metrics, it does not introduce a unified metric to evaluate both scene reconstruction and text restoration quality. 2. The ex

Reviewer 04Rating 4Confidence 4

Strengths

1. Clear goal (make text readable, not just “look sharp”). 2. Comprehensive evaluation with OCR metrics and perceptual IQA.

Weaknesses

1. Limited analysis of trade-offs (e.g., when text gets clearer, what happens to non-text textures?). 2. No multilingual or curved-text stress test. 3. Sensitivity to OCR detector quality is not studied.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.