Xformer: Hybrid X-Shaped Transformer for Image Denoising

Jiale Zhang; Yulun Zhang; Jinjin Gu; Jiahua Dong; Linghe; Kong; Xiaokang Yang

arXiv:2303.06440·cs.CV·February 27, 2024·6 cites

Xformer: Hybrid X-Shaped Transformer for Image Denoising

Jiale Zhang, Yulun Zhang, Jinjin Gu, Jiahua Dong, Linghe, Kong, Xiaokang Yang

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

Xformer is a novel hybrid X-shaped Transformer architecture that enhances image denoising by capturing multi-scale features through spatial and channel-wise interactions, achieving state-of-the-art results.

Contribution

The paper introduces a hybrid X-shaped Transformer with dual branches and a Bidirectional Connection Unit for improved global and local feature modeling in image denoising.

Findings

01

Achieves state-of-the-art denoising performance

02

Effective multi-scale feature extraction

03

Comparable model complexity with superior results

Abstract

In this paper, we present a hybrid X-shaped vision Transformer, named Xformer, which performs notably on image denoising tasks. We explore strengthening the global representation of tokens from different scopes. In detail, we adopt two types of Transformer blocks. The spatial-wise Transformer block performs fine-grained local patches interactions across tokens defined by spatial dimension. The channel-wise Transformer block performs direct global context interactions across tokens defined by channel dimension. Based on the concurrent network structure, we design two branches to conduct these two interaction fashions. Within each branch, we employ an encoder-decoder architecture to capture multi-scale features. Besides, we propose the Bidirectional Connection Unit (BCU) to couple the learned representations from these two branches while providing enhanced information fusion. The joint…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

1. This paper is motivated well. It is reasonable to combine the advantages of both spatial-wise self attention and channel-wise self attention to capture both the local fine-grained features and global features across channels. 2. The paper is organized well and easy to follow despite some typos.

Weaknesses

1. The technical novelty is incremental. There are two core designs: the dual-branch architecture and the bilateral interactions between two branches, which are both typical designs and have been extensively explored in other work. Thus, the technical novelty is limited, especially compared to SwinIR and Restormer. 2. Compared to Restormer, Xformer has limited performance improvement, especially on real image denoising scenarios which is more important for evaluation.

Reviewer 02Rating 8· accept, good paperConfidence 4

Strengths

The idea is good and novel. The proposed Xformer exploits stronger global representation of tokens with a hybrid implementation of spatial-wise and channel-wise Transformer. The bidirectional connection unit (BCU) is proposed to couple the learned representations from two branches of Xformer. It is simple but effective according to the ablation. The authors provide extensive ablations to show the effects of some key components, like STB, CTB, BCU, and shift operation. The main comparisons are

Weaknesses

Some details are not clear enough for better understanding. How did the authors determine the final model when training is finished? For example, did the authors choose the model based on the best validation performance or just use the model from the final iteration? In the ablation study, Table 1 (b), it seems that w/o BCU and BCU-1 is comparable, BCU-2 and Complete BCU is comparable. Please give more analyses about their difference. If the proposed method could be used for other image restor

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 2

Strengths

1. The X-shaped architecture is elegant and reasonable. 2. The experimental results on the synthetic dataset are good. 3. The overall paper writing is good.

Weaknesses

There are several places that are not intuitive or clear: 1. The authors claim that "we make the last encoder involving STBs of two branches share parameters for the purpose of computational efficiency." However, it is unclear how much the performance will be influenced by the parameter-sharing strategy. It is also not clear why it is critical to share parameters for this place in the network. Why not share parameters in other places? 2. The authors claim that "In short, the STB utilizes non-ove

Code & Models

Repositories

gladzhang/xformer
pytorchOfficial

Videos

Xformer: Hybrid X-Shaped Transformer for Image Denoising· slideslive

Taxonomy

TopicsImage and Signal Denoising Methods · Advanced Image Fusion Techniques · Image Processing Techniques and Applications

MethodsMulti-Head Attention · Attention Is All You Need · Residual Connection · Dense Connections · Absolute Position Encodings · Linear Layer · Label Smoothing · Dropout · Adam · Softmax