Improving Semantic Segmentation through Spatio-Temporal Consistency Learned from Videos
Ankita Pasad, Ariel Gordon, Tsung-Yi Lin, Anelia Angelova

TL;DR
This paper introduces a method that uses unsupervised learning of 3D geometry and motion from videos to improve semantic segmentation accuracy and reduce labeling requirements by enforcing spatio-temporal consistency.
Contribution
It proposes leveraging depth, egomotion, and camera intrinsics to provide additional supervision for segmentation models, enhancing performance and efficiency.
Findings
Significant improvement in segmentation quality on ScanNet dataset
Reduced need for labeled data in training segmentation models
Effective enforcement of 3D-geometric and temporal consistency
Abstract
We leverage unsupervised learning of depth, egomotion, and camera intrinsics to improve the performance of single-image semantic segmentation, by enforcing 3D-geometric and temporal consistency of segmentation masks across video frames. The predicted depth, egomotion, and camera intrinsics are used to provide an additional supervision signal to the segmentation model, significantly enhancing its quality, or, alternatively, reducing the number of labels the segmentation model needs. Our experiments were performed on the ScanNet dataset.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Image Enhancement Techniques · Image Processing Techniques and Applications
