# In Defense of Pre-trained ImageNet Architectures for Real-time Semantic   Segmentation of Road-driving Images

**Authors:** Marin Or\v{s}i\'c, Ivan Kre\v{s}o, Petra Bevandi\'c, Sini\v{s}a, \v{S}egvi\'c

arXiv: 1903.08469 · 2019-04-15

## TL;DR

This paper demonstrates that using pre-trained ImageNet architectures with lightweight upsampling and multi-resolution feature fusion significantly improves real-time semantic segmentation performance on road-driving images, outperforming custom lightweight models.

## Contribution

It introduces a novel approach combining pre-trained general purpose architectures with efficient upsampling and multi-resolution feature fusion for real-time segmentation.

## Key findings

- Achieves 75.5% MIoU on Cityscapes dataset.
- Runs at 39.9 Hz on 1024x2048 images with GTX1080Ti.
- Outperforms many custom lightweight architectures.

## Abstract

Recent success of semantic segmentation approaches on demanding road driving datasets has spurred interest in many related application fields. Many of these applications involve real-time prediction on mobile platforms such as cars, drones and various kinds of robots. Real-time setup is challenging due to extraordinary computational complexity involved. Many previous works address the challenge with custom lightweight architectures which decrease computational complexity by reducing depth, width and layer capacity with respect to general purpose architectures. We propose an alternative approach which achieves a significantly better performance across a wide range of computing budgets. First, we rely on a light-weight general purpose architecture as the main recognition engine. Then, we leverage light-weight upsampling with lateral connections as the most cost-effective solution to restore the prediction resolution. Finally, we propose to enlarge the receptive field by fusing shared features at multiple resolutions in a novel fashion. Experiments on several road driving datasets show a substantial advantage of the proposed approach, either with ImageNet pre-trained parameters or when we learn from scratch. Our Cityscapes test submission entitled SwiftNetRN-18 delivers 75.5% MIoU and achieves 39.9 Hz on 1024x2048 images on GTX1080Ti.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1903.08469/full.md

## Figures

28 figures with captions in the complete paper: https://tomesphere.com/paper/1903.08469/full.md

## References

42 references — full list in the complete paper: https://tomesphere.com/paper/1903.08469/full.md

---
Source: https://tomesphere.com/paper/1903.08469