STD2P: RGBD Semantic Segmentation Using Spatio-Temporal Data-Driven Pooling
Yang He, Wei-Chen Chiu, Margret Keuper, Mario Fritz

TL;DR
This paper introduces a superpixel-based multi-view CNN that leverages spatio-temporal information from multiple RGBD views to improve semantic segmentation accuracy, especially in indoor video scenarios.
Contribution
It presents a novel spatio-temporal pooling layer and a multi-view approach that enhances segmentation by utilizing additional viewpoints and unlabeled frames.
Findings
Improves segmentation accuracy over state-of-the-art methods.
Utilizes unlabeled frames to boost training effectiveness.
Effective on NYU-Depth-V2 and SUN3D datasets.
Abstract
We propose a novel superpixel-based multi-view convolutional neural network for semantic image segmentation. The proposed network produces a high quality segmentation of a single image by leveraging information from additional views of the same scene. Particularly in indoor videos such as captured by robotic platforms or handheld and bodyworn RGBD cameras, nearby video frames provide diverse viewpoints and additional context of objects and scenes. To leverage such information, we first compute region correspondences by optical flow and image boundary-based superpixels. Given these region correspondences, we propose a novel spatio-temporal pooling layer to aggregate information over space and time. We evaluate our approach on the NYU--Depth--V2 and the SUN3D datasets and compare it to various state-of-the-art single-view and multi-view approaches. Besides a general improvement over the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Vision and Imaging · Video Surveillance and Tracking Methods
