Multi-Scale Vision Longformer: A New Vision Transformer for   High-Resolution Image Encoding

Pengchuan Zhang; Xiyang Dai; Jianwei Yang; Bin Xiao; Lu Yuan; Lei; Zhang; Jianfeng Gao

arXiv:2103.15358·cs.CV·May 28, 2021·6 cites

Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

Pengchuan Zhang, Xiyang Dai, Jianwei Yang, Bin Xiao, Lu Yuan, Lei, Zhang, Jianfeng Gao

PDF

Open Access 3 Repos

TL;DR

This paper introduces Multi-Scale Vision Longformer, a novel high-resolution image encoding architecture that combines multi-scale features with a linear-complexity attention mechanism, outperforming existing models across various vision tasks.

Contribution

The paper proposes a new Vision Transformer architecture that integrates multi-scale encoding and Longformer-based attention for efficient high-resolution image processing.

Findings

01

Outperforms existing ViT and ResNet models on multiple vision tasks

02

Achieves linear complexity in attention mechanism for large images

03

Provides publicly available source code and models

Abstract

This paper presents a new Vision Transformer (ViT) architecture Multi-Scale Vision Longformer, which significantly enhances the ViT of \cite{dosovitskiy2020image} for encoding high-resolution images using two techniques. The first is the multi-scale model structure, which provides image encodings at multiple scales with manageable computational cost. The second is the attention mechanism of vision Longformer, which is a variant of Longformer \cite{beltagy2020longformer}, originally developed for natural language processing, and achieves a linear complexity w.r.t. the number of input tokens. A comprehensive empirical study shows that the new ViT significantly outperforms several strong baselines, including the existing ViT models and their ResNet counterparts, and the Pyramid Vision Transformer from a concurrent work \cite{wang2021pyramid}, on a range of vision tasks, including image…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · COVID-19 diagnosis using AI · Advanced Image and Video Retrieval Techniques

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Average Pooling · 1x1 Convolution · Batch Normalization · Global Average Pooling · Bottleneck Residual Block · Adam · Linear Warmup With Linear Decay