Polyline Path Masked Attention for Vision Transformer
Zhongchen Zhao, Chaodong Xiao, Hui Lin, Qi Xie, Lei Zhang, Deyu Meng

TL;DR
This paper introduces Polyline Path Masked Attention (PPMA), a novel structured masking strategy for Vision Transformers that improves spatial adjacency modeling, leading to state-of-the-art results in image classification, detection, and segmentation tasks.
Contribution
It proposes a new polyline path mask for ViTs that better preserves adjacency relationships and integrates it into the self-attention mechanism, enhancing spatial modeling capabilities.
Findings
Achieves higher mIoU on ADE20K segmentation benchmark.
Outperforms previous state-of-the-art models in classification and detection.
Provides theoretical analysis and efficient algorithm for mask computation.
Abstract
Global dependency modeling and spatial position modeling are two core issues of the foundational architecture design in current deep learning frameworks. Recently, Vision Transformers (ViTs) have achieved remarkable success in computer vision, leveraging the powerful global dependency modeling capability of the self-attention mechanism. Furthermore, Mamba2 has demonstrated its significant potential in natural language processing tasks by explicitly modeling the spatial adjacency prior through the structured mask. In this paper, we propose Polyline Path Masked Attention (PPMA) that integrates the self-attention mechanism of ViTs with an enhanced structured mask of Mamba2, harnessing the complementary strengths of both architectures. Specifically, we first ameliorate the traditional structured mask of Mamba2 by introducing a 2D polyline path scanning strategy and derive its corresponding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
