Two-Stage Vision Transformer for Image Restoration: Colorization Pretraining + Residual Upsampling
Aditya Chaudhary, Prachet Dev Singh, Ankit Jha

TL;DR
This paper introduces ViT-SR, a two-stage vision transformer approach for image super-resolution that leverages self-supervised pretraining on colorization to enhance performance, achieving notable results on the DIV2K benchmark.
Contribution
The paper proposes a novel two-stage training strategy for ViT in image restoration, combining self-supervised colorization pretraining with residual upsampling for super-resolution.
Findings
Achieves SSIM of 0.712 and PSNR of 22.90 dB on DIV2K.
Self-supervised pretraining improves super-resolution performance.
Residual learning simplifies the super-resolution task.
Abstract
In computer vision, Single Image Super-Resolution (SISR) is still a difficult problem. We present ViT-SR, a new technique to improve the performance of a Vision Transformer (ViT) employing a two-stage training strategy. In our method, the model learns rich, generalizable visual representations from the data itself through a self-supervised pretraining phase on a colourization task. The pre-trained model is then adjusted for 4x super-resolution. By predicting the addition of a high-frequency residual image to an initial bicubic interpolation, this design simplifies residual learning. ViT-SR, trained and evaluated on the DIV2K benchmark dataset, achieves an impressive SSIM of 0.712 and PSNR of 22.90 dB. These results demonstrate the efficacy of our two-stage approach and highlight the potential of self-supervised pre-training for complex image restoration tasks. Further improvements may…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image Processing Techniques · Generative Adversarial Networks and Image Synthesis · Sparse and Compressive Sensing Techniques
