DualTSR: Unified Dual-Diffusion Transformer for Scene Text Image Super-Resolution
Axi Niu, Kang Zhang, Qingsen Yan, Hao Jin, Jinqiu Sun, Yanning Zhang

TL;DR
DualTSR introduces a unified transformer-based framework for scene text image super-resolution that internally models visual and textual information, eliminating the need for external OCR priors and simplifying the architecture.
Contribution
It proposes a novel dual diffusion transformer that jointly models image and text distributions in a single end-to-end system for super-resolution.
Findings
Achieves high perceptual quality and text fidelity on Chinese benchmarks.
Simplifies architecture compared to multi-branch diffusion systems.
Effectively infers text priors without external OCR models.
Abstract
Scene Text Image Super-Resolution (STISR) aims to restore high-resolution details in low-resolution text images, which is crucial for both human readability and machine recognition. Existing methods, however, often depend on external Optical Character Recognition (OCR) models for textual priors or rely on complex multi-component architectures that are difficult to train and reproduce. In this paper, we introduce DualTSR, a unified end-to-end framework that addresses both issues. DualTSR employs a single multimodal transformer backbone trained with a dual diffusion objective. It simultaneously models the continuous distribution of high-resolution images via Conditional Flow Matching and the discrete distribution of textual content via discrete diffusion. This shared design enables visual and textual information to interact at every layer, allowing the model to infer text priors…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image Processing Techniques · Generative Adversarial Networks and Image Synthesis · Image and Video Quality Assessment
