Beyond Fixation: Dynamic Window Visual Transformer

Pengzhen Ren; Changlin Li; Guangrun Wang; Yun Xiao; Qing Du; Xiaodan; Liang; Xiaojun Chang

arXiv:2203.12856·cs.CV·April 11, 2022

Beyond Fixation: Dynamic Window Visual Transformer

Pengzhen Ren, Changlin Li, Guangrun Wang, Yun Xiao, Qing Du, Xiaodan, Liang, Xiaojun Chang

PDF

Open Access 1 Repo

TL;DR

This paper introduces DW-ViT, a dynamic multi-scale window visual transformer that enhances model performance by adaptively adjusting window sizes for better multi-scale information modeling, outperforming fixed-window approaches.

Contribution

The paper presents the first use of dynamic multi-scale windows in visual transformers, significantly improving performance over fixed-window models like Swin Transformer.

Findings

01

DW-ViT outperforms state-of-the-art methods on ImageNet-1K, ADE20K, and COCO.

02

It achieves consistent improvements with similar parameters and computational costs.

03

DW-ViT is scalable and easily integrable into existing window-based transformers.

Abstract

Recently, a surge of interest in visual transformers is to reduce the computational cost by limiting the calculation of self-attention to a local window. Most current work uses a fixed single-scale window for modeling by default, ignoring the impact of window size on model performance. However, this may limit the modeling potential of these window-based models for multi-scale information. In this paper, we propose a novel method, named Dynamic Window Vision Transformer (DW-ViT). The dynamic window strategy proposed by DW-ViT goes beyond the model that employs a fixed single window setting. To the best of our knowledge, we are the first to use dynamic multi-scale windows to explore the upper limit of the effect of window settings on model performance. In DW-ViT, multi-scale information is obtained by assigning windows of different sizes to different head groups of window multi-head…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pzhren/dw-vit
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Computing and Algorithms · Image and Video Quality Assessment · Visual Attention and Saliency Detection

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Dense Connections · Residual Connection · Softmax · Label Smoothing · Dropout