TextSR: Diffusion Super-Resolution with Multilingual OCR Guidance
Keren Ye, Ignacio Garcia Dorado, Michalis Raptis, Mauricio Delbracio, Irene Zhu, Peyman Milanfar, Hossein Talebi

TL;DR
TextSR is a multimodal diffusion model that improves multilingual scene text image super-resolution by integrating OCR-guided text priors, leading to more accurate and legible text reconstruction in challenging images.
Contribution
The paper introduces TextSR, a novel diffusion-based super-resolution model that incorporates OCR and text priors to enhance multilingual scene text image quality.
Findings
Outperforms existing methods on TextZoom and TextVQA datasets
Effectively localizes text regions and models multilingual character shapes
Enhances text legibility and reduces hallucinated textures
Abstract
While recent advancements in Image Super-Resolution (SR) using diffusion models have shown promise in improving overall image quality, their application to scene text images has revealed limitations. These models often struggle with accurate text region localization and fail to effectively model image and multilingual character-to-shape priors. This leads to inconsistencies, the generation of hallucinated textures, and a decrease in the perceived quality of the super-resolved text. To address these issues, we introduce TextSR, a multimodal diffusion model specifically designed for Multilingual Scene Text Image Super-Resolution. TextSR leverages a text detector to pinpoint text regions within an image and then employs Optical Character Recognition (OCR) to extract multilingual text from these areas. The extracted text characters are then transformed into visual shapes using a UTF-8…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsDiffusion
