Semantic Segmentation by Early Region Proxy
Yifan Zhang, Bo Pang, Cewu Lu

TL;DR
This paper introduces RegProxy, a novel region-based Transformer model for semantic segmentation that predicts at the region level, achieving superior performance and efficiency compared to traditional dense prediction methods.
Contribution
It proposes a region proxy approach that models image regions with learnable, flexible geometries and encodes them using Transformer self-attention, eliminating the need for dense pixel-wise prediction.
Findings
Outperforms CNN models with fewer parameters and less computation.
Achieves 52.9 mIoU on ADE20K, surpassing state-of-the-art.
Demonstrates a superior performance-efficiency trade-off.
Abstract
Typical vision backbones manipulate structured features. As a compromise, semantic segmentation has long been modeled as per-point prediction on dense regular grids. In this work, we present a novel and efficient modeling that starts from interpreting the image as a tessellation of learnable regions, each of which has flexible geometrics and carries homogeneous semantics. To model region-wise context, we exploit Transformer to encode regions in a sequence-to-sequence manner by applying multi-layer self-attention on the region embeddings, which serve as proxies of specific regions. Semantic segmentation is now carried out as per-region prediction on top of the encoded region embeddings using a single linear classifier, where a decoder is no longer needed. The proposed RegProxy model discards the common Cartesian feature layout and operates purely at region level. Hence, it exhibits the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Residual Connection · Softmax · Dropout · Position-Wise Feed-Forward Layer · Dense Connections · Byte Pair Encoding
