# Analyzing Modular CNN Architectures for Joint Depth Prediction and   Semantic Segmentation

**Authors:** Omid Hosseini Jafari, Oliver Groth, Alexander Kirillov, Michael Ying, Yang, Carsten Rother

arXiv: 1702.08009 · 2018-09-27

## TL;DR

This paper investigates how modular CNN architectures can jointly improve depth estimation and semantic segmentation by analyzing cross-modality influences and proposing a balanced fusion approach.

## Contribution

It introduces a method to quantify and balance cross-modality influence in joint CNN architectures for depth and semantic tasks.

## Key findings

- Balanced cross-modality influence improves accuracy for both tasks.
- A new CNN architecture effectively fuses depth and semantic information.
- Achieved state-of-the-art results on NYU-Depth v2 benchmark.

## Abstract

This paper addresses the task of designing a modular neural network architecture that jointly solves different tasks. As an example we use the tasks of depth estimation and semantic segmentation given a single RGB image. The main focus of this work is to analyze the cross-modality influence between depth and semantic prediction maps on their joint refinement. While most previous works solely focus on measuring improvements in accuracy, we propose a way to quantify the cross-modality influence. We show that there is a relationship between final accuracy and cross-modality influence, although not a simple linear one. Hence a larger cross-modality influence does not necessarily translate into an improved accuracy. We find that a beneficial balance between the cross-modality influences can be achieved by network architecture and conjecture that this relationship can be utilized to understand different network design choices. Towards this end we propose a Convolutional Neural Network (CNN) architecture that fuses the state of the state-of-the-art results for depth estimation and semantic labeling. By balancing the cross-modality influences between depth and semantic prediction, we achieve improved results for both tasks using the NYU-Depth v2 benchmark.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1702.08009/full.md

## Figures

11 figures with captions in the complete paper: https://tomesphere.com/paper/1702.08009/full.md

## References

42 references — full list in the complete paper: https://tomesphere.com/paper/1702.08009/full.md

---
Source: https://tomesphere.com/paper/1702.08009