SPFormer: Enhancing Vision Transformer with Superpixel Representation
Jieru Mei, Liang-Chieh Chen, Alan Yuille, Cihang Xie

TL;DR
SPFormer introduces superpixel-based adaptive patching to Vision Transformers, improving accuracy, interpretability, and robustness in image recognition tasks, especially on challenging benchmarks like ImageNet.
Contribution
It presents a novel superpixel-enhanced Vision Transformer that adaptively captures image content, boosting performance and explainability over traditional fixed-patch methods.
Findings
Achieves 1.4% higher accuracy than DeiT-T on ImageNet.
Provides inherent interpretability through superpixel structures.
Enhances robustness against rotations and occlusions.
Abstract
In this work, we introduce SPFormer, a novel Vision Transformer enhanced by superpixel representation. Addressing the limitations of traditional Vision Transformers' fixed-size, non-adaptive patch partitioning, SPFormer employs superpixels that adapt to the image's content. This approach divides the image into irregular, semantically coherent regions, effectively capturing intricate details and applicable at both initial and intermediate feature levels. SPFormer, trainable end-to-end, exhibits superior performance across various benchmarks. Notably, it exhibits significant improvements on the challenging ImageNet benchmark, achieving a 1.4% increase over DeiT-T and 1.1% over DeiT-S respectively. A standout feature of SPFormer is its inherent explainability. The superpixel structure offers a window into the model's internal processes, providing valuable insights that enhance the…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
* The proposed SPFormer introduces an innovative way of jointly optimizing superpixel and pixel representations. The converged superpixels are naturally aligned semantically * The proposed SPFormer introduces efficient superpoint representation which significantly lower the resource requirements * The proposed SPFormer outperforms the chosen baselines by significant margins
* The relationships between pixels and superpixels are poorly explained. Specifically, the neighboring relationship described by "pixels surrounding superpixel" and "neighboring superpixels" are ambiguous. As the interaction between pixels and superpixels is a foundamental part of this work, this ambiguity severely lower the quality of the otherwise significant work. * The comparison with SViT (Huang et al., 2023) seems problematic. According to the original paper of Huang et al., the accuracy o
The idea to combine superpixels with transformer architectures sounds interesting, especially since transformers are prone to high memory requirements The introduction manages to explain the problem and the appeal of the solution The related work section is extensive and well-written
Section 3.1.: There are a couple of things that are confusing. First of all, the notation is unorthodox, with ph/pw (or sh/sw) the authors probably mean subscripted p_h, p_w rather than its multiplicative interpretation. Equation (1) is also confusing as it is unclear what “p” really denotes. Assuming that “p” is a spatial coordinate (e.g. (x, y)), we run into the problem that the last dimensions of S_f are sh x sw, while they are h x w for S_a. This should be made clearer, e.g. with the help of
- I think the visualizations of this paper is decent enough for people to follow - The use of superpixels is an interesting idea, and I think it's good to see people trying out this approach
- Experiments are done on imageNet datasets for image classification tasks, and only compared with relatively old approaches like DeiT. Most state-of-the-art approach is using Swin-like shifting window attention mechanisms, and the reported numbers are relatively low in year 2023. It's really challenging to justify the effectiveness of this approach. - The main motivation of using super pixel representations is (as stated by the authors): 1) efficiency, and 2) can potentially handle high resolut
- The paper shows the superpixel representation is superior to the patch representation in a ViT architecture.
1. Incorporating superpixels into the neural nets has been well studied, and the essential difference between the proposed module and previous studies is updating the pixel feature. However, the proposed module is not compared with other clustering modules (e.g., SSN[1] and GroupViT), and the advantage of the proposed module is not verified. 2. The authors discussed the existing architecture using superpixel representation, but the authors compare the proposed method only with patch-based archi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCCD and CMOS Imaging Sensors · Medical Image Segmentation Techniques · Image and Signal Denoising Methods
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Dense Connections · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Softmax · Label Smoothing · Adam
