Recursive Generalization Transformer for Image Super-Resolution
Zheng Chen, Yulun Zhang, Jinjin Gu, Linghe Kong, Xiaokang Yang

TL;DR
The paper introduces the Recursive Generalization Transformer (RGT) for image super-resolution, effectively capturing global context with recursive and cross-attention mechanisms, outperforming existing methods.
Contribution
It proposes a novel recursive-generalization self-attention mechanism combined with local attention and a hybrid adaptive integration for improved image super-resolution.
Findings
RGT achieves superior quantitative results on benchmark datasets.
The model effectively captures global spatial information.
Extensive experiments validate the effectiveness of the proposed approach.
Abstract
Transformer architectures have exhibited remarkable performance in image super-resolution (SR). Since the quadratic computational complexity of the self-attention (SA) in Transformer, existing methods tend to adopt SA in a local region to reduce overheads. However, the local design restricts the global context exploitation, which is crucial for accurate image reconstruction. In this work, we propose the Recursive Generalization Transformer (RGT) for image SR, which can capture global spatial information and is suitable for high-resolution images. Specifically, we propose the recursive-generalization self-attention (RG-SA). It recursively aggregates input features into representative feature maps, and then utilizes cross-attention to extract global information. Meanwhile, the channel dimensions of attention matrices (query, key, and value) are further scaled to mitigate the redundancy in…
Peer Reviews
Decision·ICLR 2024 poster
1. The idea of using global attention in the Transformer is widespread. But, the authors effectively maintain low computational complexity, which is meaningful in image SR. 2. Additionally, the proposed HAI is simple yet effective. Both the ablation experiments (Table 1 (c), (d)) and the visual results (Figs. 3, 4, 5) strongly support the authors' claim: integrate global and local modules. 3. The main comparisons with recent methods demonstrate the superiority of this method. I also notice that
1. The experiments on RG-SA are not enough. The authors claim the superiority of RG-SA, but it is not compared with other global attention mechanisms. 2. The authors only provide visual comparisons on Urban100 and Manga109 datasets. Comparisons on other datasets are lacking.
1. The authors propose the recursive-generalization self-attention (RG-SA), which controls computational complexity while achieving global modeling. 2. They also design the hybrid adaptive integration (HAI). It is a simple yet effective. 3. The paper's experiments are comprehensive. The ablation study demonstrates the effects of each component. 4. Quantitative and qualitative results indicate that the proposed method outperforms SwinIR and CAT-A. 5. The authors provide various visual results: fe
1. Some details in the paper are not clear. For example, the representative map size "h" is set as 4 for training but 16 for testing. Why not use the same settings? 2. The RGT-S and RGT all adopt a larger window size than SwinIR. To establish a fairer comparison, it is recommended to use the same window size. 3. It would be beneficial to include comparisons with more recent methods, such as RGT, to evaluate the effectiveness of the proposed method. 4. A comparison of running times should be give
- The paper's writing and organization are good. All illustrations, tables, and visual results are intuitive and clear. - The motivation for the proposed method is reasonable. The global information in SR is important while reducing the complexity of global attention is crucial for its application in SR tasks. - The proposed components RG-SA and HAI are novel and valuable. - The ablation study is extensive. The effectiveness of each part in RGT is demonstrated. - The authors provide multiple mo
- When compared with CAT-A, the improvements of RGT on some datasets (Set5, Set14) are not very obvious (< 0.1 dB). - Although FLOPs are provided in Sec. 4.4, the running time of the model on real devices should also be provided. - The primary evaluation metrics used in the paper are PSNR and SSIM. However, these metrics may not reflect actual SR performance. Some perceptual metrics, such as LPIPS, should be evaluated.
Code & Models
Videos
Taxonomy
TopicsAdvanced Image Processing Techniques · Advanced Image Fusion Techniques · Image Processing Techniques and Applications
MethodsMulti-Head Attention · Attention Is All You Need · Residual Connection · Dense Connections · Absolute Position Encodings · Linear Layer · Label Smoothing · Dropout · Adam · Softmax
