Hybrid CNN-ViT Framework for Motion-Blurred Scene Text Restoration

Umar Rashid (1); Muhammad Arslan Arshad (1); Ghulam Ahmad (1); Muhammad Zeeshan Anjum (1); Rizwan Khan (1); Muhammad Akmal (2) ((1) University of Engineering & Technology; New Campus; Lahore; Pakistan; (2) Sheffield Hallam University; Sheffield; UK)

arXiv:2511.06087·cs.CV·November 11, 2025

Hybrid CNN-ViT Framework for Motion-Blurred Scene Text Restoration

Umar Rashid (1), Muhammad Arslan Arshad (1), Ghulam Ahmad (1), Muhammad Zeeshan Anjum (1), Rizwan Khan (1), Muhammad Akmal (2) ((1) University of Engineering & Technology, New Campus, Lahore, Pakistan, (2) Sheffield Hallam University, Sheffield, UK)

PDF

Open Access

TL;DR

This paper presents a hybrid CNN-ViT deep learning framework that effectively restores motion-blurred scene text images, improving readability and performance in computer vision tasks with high accuracy and efficiency.

Contribution

It introduces a novel CNN-ViT hybrid architecture specifically designed for motion-blurred scene text restoration, combining local and global feature modeling.

Findings

01

Achieves 32.20 dB PSNR and 0.934 SSIM on benchmark datasets.

02

Maintains lightweight design with 2.83 million parameters.

03

Operates with an average inference time of 61 ms.

Abstract

Motion blur in scene text images severely impairs readability and hinders the reliability of computer vision tasks, including autonomous driving, document digitization, and visual information retrieval. Conventional deblurring approaches are often inadequate in handling spatially varying blur and typically fall short in modeling the long-range dependencies necessary for restoring textual clarity. To overcome these limitations, we introduce a hybrid deep learning framework that combines convolutional neural networks (CNNs) with vision transformers (ViTs), thereby leveraging both local feature extraction and global contextual reasoning. The architecture employs a CNN-based encoder-decoder to preserve structural details, while a transformer module enhances global awareness through self-attention. Training is conducted on a curated dataset derived from TextOCR, where sharp scene-text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image Processing Techniques · Generative Adversarial Networks and Image Synthesis · Image and Video Quality Assessment