MSPE: Multi-Scale Patch Embedding Prompts Vision Transformers to Any Resolution
Wenzhuo Liu, Fei Zhu, Shijie Ma, Cheng-Lin Liu

TL;DR
This paper introduces MSPE, a method that enables Vision Transformers to adapt effectively to varying input resolutions by using multi-scale patch embeddings, improving performance on low-resolution images without retraining.
Contribution
MSPE replaces standard patch embedding with multi-scale kernels, allowing ViTs to handle different resolutions without additional training or model modifications.
Findings
Improves accuracy on low-resolution images
Maintains competitive performance on high-resolution images
Applicable to various vision tasks like classification, segmentation, detection
Abstract
Although Vision Transformers (ViTs) have recently advanced computer vision tasks significantly, an important real-world problem was overlooked: adapting to variable input resolutions. Typically, images are resized to a fixed resolution, such as 224x224, for efficiency during training and inference. However, uniform input size conflicts with real-world scenarios where images naturally vary in resolution. Modifying the preset resolution of a model may severely degrade the performance. In this work, we propose to enhance the model adaptability to resolution variation by optimizing the patch embedding. The proposed method, called Multi-Scale Patch Embedding (MSPE), substitutes the standard patch embedding with multiple variable-sized patch kernels and selects the best parameters for different resolutions, eliminating the need to resize the original image. Our method does not require…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Advanced Neural Network Applications
