Sub-token ViT Embedding via Stochastic Resonance Transformers
Dong Lao, Yangchao Wu, Tian Yu Liu, Alex Wong, Stefano Soatto

TL;DR
This paper introduces the Stochastic Resonance Transformer (SRT), a training-free method that enhances Vision Transformer features with finer spatial details, significantly improving performance across various vision tasks.
Contribution
The paper proposes a novel, training-free sub-token spatial transformation technique for ViT models, improving spatial resolution and task performance without additional training.
Findings
Boosts performance on segmentation, classification, and depth estimation tasks by up to 14.9%.
Applicable across any ViT layer, enhancing spatial detail without fine-tuning.
Retains semantic richness while improving spatial granularity.
Abstract
Vision Transformer (ViT) architectures represent images as collections of high-dimensional vectorized tokens, each corresponding to a rectangular non-overlapping patch. This representation trades spatial granularity for embedding dimensionality, and results in semantically rich but spatially coarsely quantized feature maps. In order to retrieve spatial details beneficial to fine-grained inference tasks we propose a training-free method inspired by "stochastic resonance". Specifically, we perform sub-token spatial transformations to the input data, and aggregate the resulting ViT features after applying the inverse transformation. The resulting "Stochastic Resonance Transformer" (SRT) retains the rich semantic information of the original representation, but grounds it on a finer-scale spatial domain, partly mitigating the coarse effect of spatial tokenization. SRT is applicable across…
Peer Reviews
Decision·ICML 2024 Poster
- The proposed method is intuitive and based on well-established principles in signal processing, while being remarkably simple to apply, similar in form to an augmentation and ensembling scheme. - The method is essentially post-hoc and architecture agnostic as it only requires super-resolving features extracted the final layer, which could then be leveraged in potential downstream tasks, notably dense prediction tasks which require higher resolution feature attributions. - Aside from the outlin
- ~~The problem the paper seeks to tackle, while somewhat intuitively reasoned, seems insufficiently motivated. The quantisation artefacts are presented as a result of discrete partitioning into uniform square patches, but the exact nature of these artefacts as well as their effect on predictions are hardly discussed or touched upon. This makes the problem statement more ambiguous than necessary.~~ **Edit:** *The last revision addresses this with a formalisation of the context of SRT.* - While t
- This work introduce an interesting finding that super-resolution on the token embedding can help improve the VIT's capability - The effectiveness of this method has been verified in multiple CV tasks.
- Perturbations are translation only. Translations are one of the perturbations, and can be replaced by a simple convolution kernel. If translation based perturbation works, then it's possible that other complex perturbations like rotation should also work. Also, since the translation works, it seems it can be replaced by an convolution network, followed by some deconvolution kernels in the upsampling stage, with some conv layers to do the aggregation. Then it will become a learnable model inste
Clarity of Presentation: The paper is well-written and easily understandable. It effectively conveys the proposed approach and its rationale. Simplicity and Effectiveness: The simplicity of SRT is a notable strength. Despite its simplicity, it has demonstrated high effectiveness in five vision tasks, which is a valuable contribution to the field. Generalization Ability: SRT can be applied at any layer and on any task without fine-tuning.
The theoretical guarantee is missing
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntegrated Circuits and Semiconductor Failure Analysis
MethodsAttention Is All You Need · Dropout · Label Smoothing · Residual Connection · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Linear Layer · Byte Pair Encoding · Adam
