IncepFormer: Efficient Inception Transformer with Pyramid Pooling for Semantic Segmentation
Lihua Fu, Haoyue Tian, Xiangping Bryce Zhai, Pan Gao, Xiaojiang Peng

TL;DR
IncepFormer is a novel Transformer-based architecture for semantic segmentation that combines pyramid structured encoders with Inception-like modules to achieve high accuracy and efficiency across multiple benchmarks.
Contribution
It introduces a pyramid structured Transformer encoder and integrates Inception-like modules with depth-wise convolutions for improved local and global feature extraction.
Findings
Achieves 47.7% mIoU on ADE20K with fewer parameters and FLOPs.
Attains 82.0% mIoU on Cityscapes with 39.6M parameters.
Outperforms state-of-the-art methods in accuracy and speed.
Abstract
Semantic segmentation usually benefits from global contexts, fine localisation information, multi-scale features, etc. To advance Transformer-based segmenters with these aspects, we present a simple yet powerful semantic segmentation architecture, termed as IncepFormer. IncepFormer has two critical contributions as following. First, it introduces a novel pyramid structured Transformer encoder which harvests global context and fine localisation features simultaneously. These features are concatenated and fed into a convolution layer for final per-pixel prediction. Second, IncepFormer integrates an Inception-like architecture with depth-wise convolutions, and a light-weight feed-forward module in each self-attention layer, efficiently obtaining rich local multi-scale object features. Extensive experiments on five benchmarks show that our IncepFormer is superior to state-of-the-art methods…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Label Smoothing · Layer Normalization · Softmax · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Convolution · Linear Layer
