Real-time Semantic Segmentation with Fast Attention
Ping Hu, Federico Perazzi, Fabian Caba Heilbron, Oliver Wang, Zhe Lin,, Kate Saenko, Stan Sclaroff

TL;DR
This paper introduces a fast spatial attention mechanism within a novel CNN architecture that significantly improves real-time semantic segmentation accuracy and speed on high-resolution images and videos.
Contribution
The paper proposes a new fast spatial attention module and an efficient architecture that reduces computational costs while maintaining high accuracy for real-time semantic segmentation.
Findings
Achieves 74.4% mIoU at 72 FPS on Cityscapes
50% faster than previous state-of-the-art methods
Maintains high accuracy with minimal loss when processing high-resolution inputs
Abstract
In deep CNN based models for semantic segmentation, high accuracy relies on rich spatial context (large receptive fields) and fine spatial details (high resolution), both of which incur high computational costs. In this paper, we propose a novel architecture that addresses both challenges and achieves state-of-the-art performance for semantic segmentation of high-resolution images and videos in real-time. The proposed architecture relies on our fast spatial attention, which is a simple yet efficient modification of the popular self-attention mechanism and captures the same rich spatial context at a small fraction of the computational cost, by changing the order of operations. Moreover, to efficiently process high-resolution input, we apply an additional spatial reduction to intermediate feature stages of the network with minimal loss in accuracy thanks to the use of the fast attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
