Beyond Fixation: Dynamic Window Visual Transformer
Pengzhen Ren, Changlin Li, Guangrun Wang, Yun Xiao, Qing Du, Xiaodan, Liang, Xiaojun Chang

TL;DR
This paper introduces DW-ViT, a dynamic multi-scale window visual transformer that enhances model performance by adaptively adjusting window sizes for better multi-scale information modeling, outperforming fixed-window approaches.
Contribution
The paper presents the first use of dynamic multi-scale windows in visual transformers, significantly improving performance over fixed-window models like Swin Transformer.
Findings
DW-ViT outperforms state-of-the-art methods on ImageNet-1K, ADE20K, and COCO.
It achieves consistent improvements with similar parameters and computational costs.
DW-ViT is scalable and easily integrable into existing window-based transformers.
Abstract
Recently, a surge of interest in visual transformers is to reduce the computational cost by limiting the calculation of self-attention to a local window. Most current work uses a fixed single-scale window for modeling by default, ignoring the impact of window size on model performance. However, this may limit the modeling potential of these window-based models for multi-scale information. In this paper, we propose a novel method, named Dynamic Window Vision Transformer (DW-ViT). The dynamic window strategy proposed by DW-ViT goes beyond the model that employs a fixed single window setting. To the best of our knowledge, we are the first to use dynamic multi-scale windows to explore the upper limit of the effect of window settings on model performance. In DW-ViT, multi-scale information is obtained by assigning windows of different sizes to different head groups of window multi-head…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Computing and Algorithms · Image and Video Quality Assessment · Visual Attention and Saliency Detection
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Dense Connections · Residual Connection · Softmax · Label Smoothing · Dropout
