# 4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks

**Authors:** Christopher Choy, JunYoung Gwak, Silvio Savarese

arXiv: 1904.08755 · 2019-06-17

## TL;DR

This paper introduces 4D spatio-temporal convolutional neural networks utilizing sparse tensors and generalized convolutions to directly process 3D-videos, improving performance and robustness over traditional 2D and 3D methods.

## Contribution

It proposes the first 4D CNN architecture with a generalized sparse convolution and an open-source library, advancing 3D-video perception capabilities.

## Key findings

- 4D CNNs outperform 2D and hybrid methods in semantic segmentation
- They are more robust to noise in 3D-video data
- In some cases, 4D CNNs are faster than 3D counterparts

## Abstract

In many robotics and VR/AR applications, 3D-videos are readily-available sources of input (a continuous sequence of depth images, or LIDAR scans). However, those 3D-videos are processed frame-by-frame either through 2D convnets or 3D perception algorithms. In this work, we propose 4-dimensional convolutional neural networks for spatio-temporal perception that can directly process such 3D-videos using high-dimensional convolutions. For this, we adopt sparse tensors and propose the generalized sparse convolution that encompasses all discrete convolutions. To implement the generalized sparse convolution, we create an open-source auto-differentiation library for sparse tensors that provides extensive functions for high-dimensional convolutional neural networks. We create 4D spatio-temporal convolutional neural networks using the library and validate them on various 3D semantic segmentation benchmarks and proposed 4D datasets for 3D-video perception. To overcome challenges in the 4D space, we propose the hybrid kernel, a special case of the generalized sparse convolution, and the trilateral-stationary conditional random field that enforces spatio-temporal consistency in the 7D space-time-chroma space. Experimentally, we show that convolutional neural networks with only generalized 3D sparse convolutions can outperform 2D or 2D-3D hybrid methods by a large margin. Also, we show that on 3D-videos, 4D spatio-temporal convolutional neural networks are robust to noise, outperform 3D convolutional neural networks and are faster than the 3D counterpart in some cases.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1904.08755/full.md

## Figures

17 figures with captions in the complete paper: https://tomesphere.com/paper/1904.08755/full.md

## References

35 references — full list in the complete paper: https://tomesphere.com/paper/1904.08755/full.md

---
Source: https://tomesphere.com/paper/1904.08755