MSPE: Multi-Scale Patch Embedding Prompts Vision Transformers to Any   Resolution

Wenzhuo Liu; Fei Zhu; Shijie Ma; Cheng-Lin Liu

arXiv:2405.18240·cs.CV·May 29, 2024·2 cites

MSPE: Multi-Scale Patch Embedding Prompts Vision Transformers to Any Resolution

Wenzhuo Liu, Fei Zhu, Shijie Ma, Cheng-Lin Liu

PDF

Open Access

TL;DR

This paper introduces MSPE, a method that enables Vision Transformers to adapt effectively to varying input resolutions by using multi-scale patch embeddings, improving performance on low-resolution images without retraining.

Contribution

MSPE replaces standard patch embedding with multi-scale kernels, allowing ViTs to handle different resolutions without additional training or model modifications.

Findings

01

Improves accuracy on low-resolution images

02

Maintains competitive performance on high-resolution images

03

Applicable to various vision tasks like classification, segmentation, detection

Abstract

Although Vision Transformers (ViTs) have recently advanced computer vision tasks significantly, an important real-world problem was overlooked: adapting to variable input resolutions. Typically, images are resized to a fixed resolution, such as 224x224, for efficiency during training and inference. However, uniform input size conflicts with real-world scenarios where images naturally vary in resolution. Modifying the preset resolution of a model may severely degrade the performance. In this work, we propose to enhance the model adaptability to resolution variation by optimizing the patch embedding. The proposed method, called Multi-Scale Patch Embedding (MSPE), substitutes the standard patch embedding with multiple variable-sized patch kernels and selects the best parameters for different resolutions, eliminating the need to resize the original image. Our method does not require…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications · Advanced Neural Network Applications