SDTP: Semantic-aware Decoupled Transformer Pyramid for Dense Image Prediction
Zekun Li, Yufan Liu, Bing Li, Weiming Hu, Kebin Wu, Pei Wang

TL;DR
This paper introduces SDTP, a novel transformer pyramid architecture that enhances multi-scale dense image prediction by exploiting semantic diversity and efficient cross-level interaction, outperforming existing methods.
Contribution
The paper proposes a new Semantic-aware Decoupled Transformer Pyramid with three key components, improving multi-scale feature interaction and semantic diversity handling in dense prediction tasks.
Findings
Outperforms state-of-the-art methods in dense image prediction.
Components are plug-and-play and adaptable to other models.
Effectively models multi-scale semantic interactions with reduced computation.
Abstract
Although transformer has achieved great progress on computer vision tasks, the scale variation in dense image prediction is still the key challenge. Few effective multi-scale techniques are applied in transformer and there are two main limitations in the current methods. On one hand, self-attention module in vanilla transformer fails to sufficiently exploit the diversity of semantic information because of its rigid mechanism. On the other hand, it is hard to build attention and interaction among different levels due to the heavy computational burden. To alleviate this problem, we first revisit multi-scale problem in dense prediction, verifying the significance of diverse semantic representation and multi-scale interaction, and exploring the adaptation of transformer to pyramidal structure. Inspired by these findings, we propose a novel Semantic-aware Decoupled Transformer Pyramid (SDTP)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Layer Normalization · Dense Connections · Label Smoothing · Multi-Head Attention · Byte Pair Encoding · Softmax
