Two-Stage Vision Transformer for Image Restoration: Colorization Pretraining + Residual Upsampling

Aditya Chaudhary; Prachet Dev Singh; Ankit Jha

arXiv:2512.02512·cs.CV·December 4, 2025

Two-Stage Vision Transformer for Image Restoration: Colorization Pretraining + Residual Upsampling

Aditya Chaudhary, Prachet Dev Singh, Ankit Jha

PDF

Open Access

TL;DR

This paper introduces ViT-SR, a two-stage vision transformer approach for image super-resolution that leverages self-supervised pretraining on colorization to enhance performance, achieving notable results on the DIV2K benchmark.

Contribution

The paper proposes a novel two-stage training strategy for ViT in image restoration, combining self-supervised colorization pretraining with residual upsampling for super-resolution.

Findings

01

Achieves SSIM of 0.712 and PSNR of 22.90 dB on DIV2K.

02

Self-supervised pretraining improves super-resolution performance.

03

Residual learning simplifies the super-resolution task.

Abstract

In computer vision, Single Image Super-Resolution (SISR) is still a difficult problem. We present ViT-SR, a new technique to improve the performance of a Vision Transformer (ViT) employing a two-stage training strategy. In our method, the model learns rich, generalizable visual representations from the data itself through a self-supervised pretraining phase on a colourization task. The pre-trained model is then adjusted for 4x super-resolution. By predicting the addition of a high-frequency residual image to an initial bicubic interpolation, this design simplifies residual learning. ViT-SR, trained and evaluated on the DIV2K benchmark dataset, achieves an impressive SSIM of 0.712 and PSNR of 22.90 dB. These results demonstrate the efficacy of our two-stage approach and highlight the potential of self-supervised pre-training for complex image restoration tasks. Further improvements may…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image Processing Techniques · Generative Adversarial Networks and Image Synthesis · Sparse and Compressive Sensing Techniques