DualTSR: Unified Dual-Diffusion Transformer for Scene Text Image Super-Resolution

Axi Niu; Kang Zhang; Qingsen Yan; Hao Jin; Jinqiu Sun; Yanning Zhang

arXiv:2603.14207·cs.CV·March 17, 2026

DualTSR: Unified Dual-Diffusion Transformer for Scene Text Image Super-Resolution

Axi Niu, Kang Zhang, Qingsen Yan, Hao Jin, Jinqiu Sun, Yanning Zhang

PDF

Open Access

TL;DR

DualTSR introduces a unified transformer-based framework for scene text image super-resolution that internally models visual and textual information, eliminating the need for external OCR priors and simplifying the architecture.

Contribution

It proposes a novel dual diffusion transformer that jointly models image and text distributions in a single end-to-end system for super-resolution.

Findings

01

Achieves high perceptual quality and text fidelity on Chinese benchmarks.

02

Simplifies architecture compared to multi-branch diffusion systems.

03

Effectively infers text priors without external OCR models.

Abstract

Scene Text Image Super-Resolution (STISR) aims to restore high-resolution details in low-resolution text images, which is crucial for both human readability and machine recognition. Existing methods, however, often depend on external Optical Character Recognition (OCR) models for textual priors or rely on complex multi-component architectures that are difficult to train and reproduce. In this paper, we introduce DualTSR, a unified end-to-end framework that addresses both issues. DualTSR employs a single multimodal transformer backbone trained with a dual diffusion objective. It simultaneously models the continuous distribution of high-resolution images via Conditional Flow Matching and the discrete distribution of textual content via discrete diffusion. This shared design enables visual and textual information to interact at every layer, allowing the model to infer text priors…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image Processing Techniques · Generative Adversarial Networks and Image Synthesis · Image and Video Quality Assessment